Hypo3D: Exploring Hypothetical Reasoning in 3D

Introduction

The rise of vision-language foundation models marks an advancement in bridging the gap between human and machine capabilities in 3D scene reasoning. Existing 3D reasoning benchmarks assume real-time scene accessibility, which is impractical due to the high cost of frequent scene updates. To this end, we introduce Hypothetical 3D Reasoning, namely Hypo3D, a benchmark designed to evaluate models' ability to reason without access to real-time scene data. Models need to imagine the scene state based on a provided change description before reasoning. Hypo3D is formulated as a 3D Visual Question Answering benchmark, comprising 7,727 context changes across 700 indoor scenes, resulting in 14,885 question-answer pairs. An anchor-based world frame is established for all scenes, ensuring consistent reference to a global frame for directional terms in context changes and QAs. Extensive experiments show that state-of-the-art models struggle to reason effectively in hypothetically changed scenes. This reveals a substantial performance gap compared to humans, particularly in scenarios involving movement changes and directional reasoning. Even when the change is irrelevant to the question, models often incorrectly adjust their answers.

Hypo3D Leaderboard

Table 1. EM and PM accuracy of ten foundation models and human evaluators on Hypo3D.

The highest model performance for each type of context change is in bold, while the best-performing model within each family is underlined.

Model	Movement	Removal	Attribute	Addition	Replacement	Overall
📝 LLM (Scene Caption)
Llama-3.2 3B	25.31	28.37	29.85	33.65	24.95	29.59	26.78	30.78	23.75	27.68	26.08	29.91
GPT-4o API (Text)	35.76	38.66	36.88	41.71	34.05	39.58	39.74	43.28	31.33	35.24	35.54	39.65
🗺️ 2D VLM (Non-Semantic Top-View Map)
Qwen2-VL 7B	29.23	35.08	30.71	34.69	29.04	33.94	31.48	35.17	28.41	33.10	29.68	34.47
Qwen2-VL 72B	33.02	37.38	33.88	37.57	33.48	37.62	35.95	40.29	30.66	34.64	33.39	37.51
LLaVA-OV 7B	30.34	34.17	29.81	33.24	31.37	36.13	33.12	35.64	28.41	31.81	30.62	34.34
LLaVA-OV 72B	36.46	39.83	36.45	40.22	35.70	40.46	39.64	42.25	33.83	37.85	36.38	40.13
Claude 3.5 Sonnet API	17.49	30.24	19.90	27.34	22.96	33.47	22.90	31.61	20.35	27.70	20.42	30.29
GPT-4o API	34.49	37.69	32.85	36.53	31.23	35.38	38.09	40.70	30.04	33.22	33.58	36.75
🗺️ 2D VLM (Semantic Top-View Map)
Qwen2-VL 7B	31.26	36.41	38.09	41.90	34.83	39.41	37.64	41.41	31.86	36.62	34.40	38.91
Qwen2-VL 72B	38.42	42.56	47.36	51.05	46.76	51.10	47.63	50.87	44.43	48.78	44.25	48.25
LLaVA-OV 7B	33.32	36.80	34.34	37.84	34.98	39.50	38.96	41.98	33.93	38.33	34.81	38.60
LLaVA-OV 72B	39.39	42.99	43.44	46.87	44.57	49.37	46.12	49.06	44.10	48.18	43.01	46.83
Claude 3.5 Sonnet API	30.92	42.98	40.26	48.54	42.29	52.72	43.16	51.59	43.28	50.73	38.86	48.65
GPT-4o API	40.77	43.79	47.36	50.40	47.42	51.39	50.59	53.77	44.24	47.68	45.50	48.82
🧊 3D VLM (RGB-D Video, Point Cloud)
LEO 7B	14.40	22.96	18.54	22.82	14.35	21.56	14.64	24.83	11.76	19.50	14.83	22.40
LLaVA-3D 7B	31.63	35.11	30.60	33.91	31.60	36.16	33.67	36.70	30.42	34.16	31.56	35.23
👨‍🔬 Human
Human	95.00	96.00	93.00	95.00	93.00	94.83	89.00	90.67	85.00	86.00	91.00	92.50

Model

Movement

Removal

Attribute

Addition

Replacement

Overall

📝 LLM (Scene Caption)

Llama-3.2 3B

25.31

28.37

29.85

33.65

24.95

29.59

26.78

30.78

23.75

27.68

26.08

29.91

GPT-4o API (Text)

35.76

38.66

36.88

41.71

34.05

39.58

39.74

43.28

31.33

35.24

35.54

39.65

🗺️ 2D VLM (Non-Semantic Top-View Map)

Qwen2-VL 7B

29.23

35.08

30.71

34.69

29.04

33.94

31.48

35.17

28.41

33.10

29.68

34.47

Qwen2-VL 72B

33.02

37.38

33.88

37.57

33.48

37.62

35.95

40.29

30.66

34.64

33.39

37.51

LLaVA-OV 7B

30.34

34.17

29.81

33.24

31.37

36.13

33.12

35.64

28.41

31.81

30.62

34.34

LLaVA-OV 72B

36.46

39.83

36.45

40.22

35.70

40.46

39.64

42.25

33.83

37.85

36.38

40.13

Claude 3.5 Sonnet API

17.49

30.24

19.90

27.34

22.96

33.47

22.90

31.61

20.35

27.70

20.42

30.29

GPT-4o API

34.49

37.69

32.85

36.53

31.23

35.38

38.09

40.70

30.04

33.22

33.58

36.75

🗺️ 2D VLM (Semantic Top-View Map)

Qwen2-VL 7B

31.26

36.41

38.09

41.90

34.83

39.41

37.64

41.41

31.86

36.62

34.40

38.91

Qwen2-VL 72B

38.42

42.56

47.36

51.05

46.76

51.10

47.63

50.87

44.43

48.78

44.25

48.25

LLaVA-OV 7B

33.32

36.80

34.34

37.84

34.98

39.50

38.96

41.98

33.93

38.33

34.81

38.60

LLaVA-OV 72B

39.39

42.99

43.44

46.87

44.57

49.37

46.12

49.06

44.10

48.18

43.01

46.83

Claude 3.5 Sonnet API

30.92

42.98

40.26

48.54

42.29

52.72

43.16

51.59

43.28

50.73

38.86

48.65

GPT-4o API

40.77

43.79

47.36

50.40

47.42

51.39

50.59

53.77

44.24

47.68

45.50

48.82

🧊 3D VLM (RGB-D Video, Point Cloud)

LEO 7B

14.40

22.96

18.54

22.82

14.35

21.56

14.64

24.83

11.76

19.50

14.83

22.40

LLaVA-3D 7B

31.63

35.11

30.60

33.91

31.60

36.16

33.67

36.70

30.42

34.16

31.56

35.23

👨‍🔬 Human

Human

95.00

96.00

93.00

95.00

93.00

94.83

89.00

90.67

85.00

86.00

91.00

92.50

Qualitative Visualizations

🔍 Five Key Insights from the Hypo3D Benchmark

Insight 1: Models struggle with hypothetical movement and replacement changes.
Insight 2: Models struggle with direction-based questions.
Insight 3: Anchor-based frame definition improves orientation understanding.
Insight 4: Reasoning in hypothetically changed scenes is more challenging than in unchanged scenes.
Insight 5: Models hallucinate when changes are irrelevant.

BibTeX

@article{mao2025hypo3d,
      title={Hypo3D: Exploring Hypothetical Reasoning in 3D},
      author={Mao, Ye and Luo, Weixun and Jing, Junpeng and Qiu, Anlan and Mikolajczyk, Krystian},
      journal={arXiv preprint arXiv:2502.00954},
      year={2025}
    }

Hypo3D

Exploring Hypothetical Reasoning in 3D Scenes

Introduction

Hypo3D Leaderboard

Qualitative Visualizations

🔍 Five Key Insights from the Hypo3D Benchmark

BibTeX