The rise of vision-language foundation models marks an advancement in bridging the gap between human and machine capabilities in 3D scene reasoning. Existing 3D reasoning benchmarks assume real-time scene accessibility, which is impractical due to the high cost of frequent scene updates. To this end, we introduce Hypothetical 3D Reasoning, namely Hypo3D, a benchmark designed to evaluate models' ability to reason without access to real-time scene data. Models need to imagine the scene state based on a provided change description before reasoning. Hypo3D is formulated as a 3D Visual Question Answering benchmark, comprising 7,727 context changes across 700 indoor scenes, resulting in 14,885 question-answer pairs. An anchor-based world frame is established for all scenes, ensuring consistent reference to a global frame for directional terms in context changes and QAs. Extensive experiments show that state-of-the-art models struggle to reason effectively in hypothetically changed scenes. This reveals a substantial performance gap compared to humans, particularly in scenarios involving movement changes and directional reasoning. Even when the change is irrelevant to the question, models often incorrectly adjust their answers.
Table 1. EM and PM accuracy of ten foundation models and human evaluators on Hypo3D.
The highest model performance for each type of context change is in bold, while the best-performing model within each family is underlined.
Model | Movement | Removal | Attribute | Addition | Replacement | Overall | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
EM | PM | EM | PM | EM | PM | EM | PM | EM | PM | EM | PM | |
📝 LLM (Scene Caption) | ||||||||||||
Llama-3.2 3B | 25.31 | 28.37 | 29.85 | 33.65 | 24.95 | 29.59 | 26.78 | 30.78 | 23.75 | 27.68 | 26.08 | 29.91 |
GPT-4o API (Text) | 35.76 | 38.66 | 36.88 | 41.71 | 34.05 | 39.58 | 39.74 | 43.28 | 31.33 | 35.24 | 35.54 | 39.65 |
🗺️ 2D VLM (Non-Semantic Top-View Map) | ||||||||||||
Qwen2-VL 7B | 29.23 | 35.08 | 30.71 | 34.69 | 29.04 | 33.94 | 31.48 | 35.17 | 28.41 | 33.10 | 29.68 | 34.47 |
Qwen2-VL 72B | 33.02 | 37.38 | 33.88 | 37.57 | 33.48 | 37.62 | 35.95 | 40.29 | 30.66 | 34.64 | 33.39 | 37.51 |
LLaVA-OV 7B | 30.34 | 34.17 | 29.81 | 33.24 | 31.37 | 36.13 | 33.12 | 35.64 | 28.41 | 31.81 | 30.62 | 34.34 |
LLaVA-OV 72B | 36.46 | 39.83 | 36.45 | 40.22 | 35.70 | 40.46 | 39.64 | 42.25 | 33.83 | 37.85 | 36.38 | 40.13 |
Claude 3.5 Sonnet API | 17.49 | 30.24 | 19.90 | 27.34 | 22.96 | 33.47 | 22.90 | 31.61 | 20.35 | 27.70 | 20.42 | 30.29 |
GPT-4o API | 34.49 | 37.69 | 32.85 | 36.53 | 31.23 | 35.38 | 38.09 | 40.70 | 30.04 | 33.22 | 33.58 | 36.75 |
🗺️ 2D VLM (Semantic Top-View Map) | ||||||||||||
Qwen2-VL 7B | 31.26 | 36.41 | 38.09 | 41.90 | 34.83 | 39.41 | 37.64 | 41.41 | 31.86 | 36.62 | 34.40 | 38.91 |
Qwen2-VL 72B | 38.42 | 42.56 | 47.36 | 51.05 | 46.76 | 51.10 | 47.63 | 50.87 | 44.43 | 48.78 | 44.25 | 48.25 |
LLaVA-OV 7B | 33.32 | 36.80 | 34.34 | 37.84 | 34.98 | 39.50 | 38.96 | 41.98 | 33.93 | 38.33 | 34.81 | 38.60 |
LLaVA-OV 72B | 39.39 | 42.99 | 43.44 | 46.87 | 44.57 | 49.37 | 46.12 | 49.06 | 44.10 | 48.18 | 43.01 | 46.83 |
Claude 3.5 Sonnet API | 30.92 | 42.98 | 40.26 | 48.54 | 42.29 | 52.72 | 43.16 | 51.59 | 43.28 | 50.73 | 38.86 | 48.65 |
GPT-4o API | 40.77 | 43.79 | 47.36 | 50.40 | 47.42 | 51.39 | 50.59 | 53.77 | 44.24 | 47.68 | 45.50 | 48.82 |
🧊 3D VLM (RGB-D Video, Point Cloud) | ||||||||||||
LEO 7B | 14.40 | 22.96 | 18.54 | 22.82 | 14.35 | 21.56 | 14.64 | 24.83 | 11.76 | 19.50 | 14.83 | 22.40 |
LLaVA-3D 7B | 31.63 | 35.11 | 30.60 | 33.91 | 31.60 | 36.16 | 33.67 | 36.70 | 30.42 | 34.16 | 31.56 | 35.23 |
👨🔬 Human | ||||||||||||
Human | 95.00 | 96.00 | 93.00 | 95.00 | 93.00 | 94.83 | 89.00 | 90.67 | 85.00 | 86.00 | 91.00 | 92.50 |
@article{mao2025hypo3d,
title={Hypo3D: Exploring Hypothetical Reasoning in 3D},
author={Mao, Ye and Luo, Weixun and Jing, Junpeng and Qiu, Anlan and Mikolajczyk, Krystian},
journal={arXiv preprint arXiv:2502.00954},
year={2025}
}