3D visual question answering (3D VQA), a key task for evaluating 3D reasoning, has advanced significantly with the rise of vision-language models and benchmark datasets. Existing 3D VQA benchmarks assume real-time scene accessibility during questioning, which is impractical due to the high cost of frequent scene updates. To this end, we introduce \textit{Hypothetical 3D Reasoning}, namely Hypo3D, a novel 3D VQA benchmark evaluating models' abilities to reason without access to the real-time scene. Models are required to hypothetically update the past scene based on the context change description before reasoning. The Hypo3D benchmark comprises 7,727 context changes across 700 indoor scenes, resulting in 14,885 question-answer pairs. An anchor-based world frame is established for all scenes, ensuring that directional terms in context changes and QAs are consistently referenced to a fixed coordinate system. Extensive experiments reveal that state-of-the-art foundation models struggle with reasoning in hypothetically changed scenes compared to humans, particularly in scenarios involving movement changes and directional reasoning. Even when the context change is irrelevant to the question, models often incorrectly adjust their answers. The benchmark will be publicly released.
@article{mao2025hypo3d,
title={Hypo3D: Exploring Hypothetical Reasoning in 3D},
author={Mao, Ye and Luo, Weixun and Jing, Junpeng and Qiu, Anlan and Mikolajczyk, Krystian},
journal={arXiv preprint arXiv:2502.00954},
year={2025}
}