Logo

Hypo3D

Exploring Hypothetical Reasoning in 3D Scenes

Imperial College London
ICML 2025
Radar Chart

Overview of the Hypo3D benchmark. ① Examples of five types of context changes. ② Sample questions that include scale-based, direction-based, and semantic reasoning, all requiring open-ended answers. ③ A radar chart illustrates the significant performance gap between models and humans, especially on direction-based questions.

Bar Chart

Example of hypothetical reasoning in a 3D scene. Given a 3D scene and an anchor-based frame description (Scene Orientation), models first align the scene to the specified frame. Then, based on a context change description and a question, models hypothetically modify the aligned scene and answer questions about the changed scene.

Additional Chart

Dataset Statistics. ① Word cloud representing context change descriptions. ② Frequency distribution of context change types across 7,727 instances. ③ Distribution of question types across change categories, with question frequency consistently highest for scale-based, then direction-based, and finally semantic.

Introduction

The rise of vision-language foundation models marks an advancement in bridging the gap between human and machine capabilities in 3D scene reasoning. Existing 3D reasoning benchmarks assume real-time scene accessibility, which is impractical due to the high cost of frequent scene updates. To this end, we introduce Hypothetical 3D Reasoning, namely Hypo3D, a benchmark designed to evaluate models' ability to reason without access to real-time scene data. Models need to imagine the scene state based on a provided change description before reasoning. Hypo3D is formulated as a 3D Visual Question Answering benchmark, comprising 7,727 context changes across 700 indoor scenes, resulting in 14,885 question-answer pairs. An anchor-based world frame is established for all scenes, ensuring consistent reference to a global frame for directional terms in context changes and QAs. Extensive experiments show that state-of-the-art models struggle to reason effectively in hypothetically changed scenes. This reveals a substantial performance gap compared to humans, particularly in scenarios involving movement changes and directional reasoning. Even when the change is irrelevant to the question, models often incorrectly adjust their answers.

Hypo3D Leaderboard

Table 1. EM and PM accuracy of ten foundation models and human evaluators on Hypo3D.

The highest model performance for each type of context change is in bold, while the best-performing model within each family is underlined.

Model Movement Removal Attribute Addition Replacement Overall
EMPM EMPM EMPM EMPM EMPM EMPM
📝 LLM (Scene Caption)
Llama-3.2 3B25.3128.3729.8533.6524.9529.5926.7830.7823.7527.6826.0829.91
GPT-4o API (Text)35.7638.6636.8841.7134.0539.5839.7443.2831.3335.2435.5439.65
🗺️ 2D VLM (Non-Semantic Top-View Map)
Qwen2-VL 7B29.2335.0830.7134.6929.0433.9431.4835.1728.4133.1029.6834.47
Qwen2-VL 72B33.0237.3833.8837.5733.4837.6235.9540.2930.6634.6433.3937.51
LLaVA-OV 7B30.3434.1729.8133.2431.3736.1333.1235.6428.4131.8130.6234.34
LLaVA-OV 72B36.4639.8336.4540.2235.7040.4639.6442.2533.8337.8536.3840.13
Claude 3.5 Sonnet API17.4930.2419.9027.3422.9633.4722.9031.6120.3527.7020.4230.29
GPT-4o API34.4937.6932.8536.5331.2335.3838.0940.7030.0433.2233.5836.75
🗺️ 2D VLM (Semantic Top-View Map)
Qwen2-VL 7B31.2636.4138.0941.9034.8339.4137.6441.4131.8636.6234.4038.91
Qwen2-VL 72B38.4242.5647.3651.0546.7651.1047.6350.8744.4348.7844.2548.25
LLaVA-OV 7B33.3236.8034.3437.8434.9839.5038.9641.9833.9338.3334.8138.60
LLaVA-OV 72B39.3942.9943.4446.8744.5749.3746.1249.0644.1048.1843.0146.83
Claude 3.5 Sonnet API30.9242.9840.2648.5442.2952.7243.1651.5943.2850.7338.8648.65
GPT-4o API40.7743.7947.3650.4047.4251.3950.5953.7744.2447.6845.5048.82
🧊 3D VLM (RGB-D Video, Point Cloud)
LEO 7B14.4022.9618.5422.8214.3521.5614.6424.8311.7619.5014.8322.40
LLaVA-3D 7B31.6335.1130.6033.9131.6036.1633.6736.7030.4234.1631.5635.23
👨‍🔬 Human
Human95.0096.0093.0095.0093.0094.8389.0090.6785.0086.0091.0092.50

Qualitative Visualizations

Teaser

🔍 Five Key Insights from the Hypo3D Benchmark

  • Insight 1: Models struggle with hypothetical movement and replacement changes.
  • Insight 2: Models struggle with direction-based questions.
  • Insight 3: Anchor-based frame definition improves orientation understanding.
  • Insight 4: Reasoning in hypothetically changed scenes is more challenging than in unchanged scenes.
  • Insight 5: Models hallucinate when changes are irrelevant.

BibTeX

@article{mao2025hypo3d,
      title={Hypo3D: Exploring Hypothetical Reasoning in 3D},
      author={Mao, Ye and Luo, Weixun and Jing, Junpeng and Qiu, Anlan and Mikolajczyk, Krystian},
      journal={arXiv preprint arXiv:2502.00954},
      year={2025}
    }