Figure: POMA-3D Overview. POMA-3D is a self-supervised 3D model pretrained on the large-scale point map dataset ScenePoint via alignment with 2D foundation models and the POMA-JEPA objective. The 3D features from pretrained POMA-3D transfer effectively to diverse 3D understanding tasks, including 3D visual question answering, embodied navigation, scene retrieval, and localization.
POMA-3D is the first self-supervised 3D representation learning model based on multiview point maps.
In this paper, we present POMA-3D, the first self-supervised model for 3D understanding learned from point maps. For pretraining POMA-3D, we introduce ScenePoint, a large-scale point map dataset constructed from 6K room-level RGB-D scenes and 1M 2D image scenes. Point maps encode explicit 3D coordinates within a structured 2D grid, thus share the same format as the input of 2D foundation models, which we use to leverage rich 2D priors via vision-language alignment during training. Moreover, as point maps are view-dependent with respect to a canonical space, we introduce POMA-JEPA, a joint embedding architecture that enforces geometrically consistent point map features across views. Experiments show that POMA-3D serves as a strong backbone for both specialist and generalist 3D understanding. Its features benefit 3D-related tasks, including 3D visual question answering, embodied navigation, scene retrieval, and embodied localization, all achieved using only geometric inputs (i.e., 3D coordinates). Our POMA-3D explores a point map way to 3D scene understanding, addressing the scarcity of pretrained priors and limited data in 3D representation learning.
POMA-3D is pretrained with two objectives: (1) aligning [CLS] embeddings from the point map context encoder with image and text embeddings from the frozen FG-CLIP using Lview and Lscene, and (2) predicting masked point map embeddings from the target encoder using unmasked embeddings from the context encoder via a predictor network optimized by Lpjepa. The target encoder is updated via EMA and used for downstream 3D understanding.
| Method | Modality | ScanQA | SQA3D | Hypo3D | MSNN | ||||
|---|---|---|---|---|---|---|---|---|---|
| EM@1 | EM@10 | EM@1 | EM@10 | EM@1 | EM@10 | 4 dire. | 8 dire. | ||
| 3D LLM Models | |||||||||
| LEO | PC | 24.5 | – | 50.0 | – | 16.2 | – | – | – |
| LLaVA-3D | RGB-D | 27.0 | – | 55.6 | – | 33.1 | – | 22.9 | 12.3 |
| Video-3D LLM | RGB-D | 30.1 | – | 58.6 | – | – | – | – | – |
| 2D LMM Models | |||||||||
| Qwen2.5-VL 7B | RGB | 23.7 | – | 47.8 | – | 30.9 | – | 21.8 | 2.87 |
| LLaVA-OV 7B | RGB | 20.8 | – | 47.7 | – | 33.2 | – | 24.0 | 5.83 |
| SplatTalk | 3DGS | 22.4 | – | 47.6 | – | – | – | – | – |
| POMA-3Dllm | PM | 21.3 | – | 51.6 | – | 35.9 | – | 36.9 | 21.4 |
| Specialist Models | |||||||||
| ScanQA | PC | 21.1 | – | 47.2 | – | – | – | – | – |
| ScanRefer + MCAN | PC | 18.6 | – | – | – | – | – | – | – |
| SQA3D | PC | – | – | 46.6 | – | – | – | – | – |
| 3D-ViSTA | PC | 22.4 | 52.1 | 48.5 | 85.6 | 31.0 | 81.2 | 39.9 | 20.1 |
| SceneVerse | PC | 22.7 | 51.5 | 49.9 | 85.0 | 31.6 | 80.3 | 36.0 | 19.5 |
| FG-CLIP | PM | 20.9 | 49.9 | 49.5 | 89.7 | 31.1 | 82.1 | 39.3 | 20.4 |
| POMA-3Dspec | PM | 22.3 | 52.3 | 51.1 | 91.2 | 33.4 | 84.8 | 40.4 | 21.2 |
3D VQA results on ScanQA, SQA3D, and Hypo3D, and embodied navigation results on MSNN. “4 dire.” and “8 dire.” denote four- and eight-way navigation, respectively.
| Method | Mod. | ScanRefer | Nr3D | Sr3D | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| R@1-1 | R@1-5 | R@5-1 | R@5-5 | R@1-1 | R@1-5 | R@5-1 | R@5-5 | R@1-1 | R@1-5 | R@5-1 | R@5-5 | ||
| 3D-ViSTA | PC | 0.48 | 2.27 | 0.24 | 2.03 | 0.45 | 0.60 | 0.15 | 0.60 | 0.33 | 1.15 | 0.33 | 1.48 |
| SceneVerse | PC | 0.24 | 2.27 | 0.83 | 2.03 | 0.26 | 1.82 | 0.26 | 1.56 | 0.28 | 1.99 | 0.28 | 1.70 |
| FG-CLIP | RGB | 5.10 | 16.4 | 14.9 | 42.2 | 1.37 | 6.71 | 5.18 | 17.2 | 1.35 | 6.42 | 1.86 | 10.1 |
| FG-CLIP | PM | 0.50 | 2.00 | 0.25 | 2.81 | 0.46 | 1.98 | 0.46 | 2.13 | 0.34 | 1.18 | 0.17 | 0.84 |
| POMA-3D | PM | 9.31 | 27.9 | 29.4 | 59.4 | 8.10 | 15.7 | 15.0 | 42.2 | 3.89 | 14.0 | 6.59 | 20.7 |
Scene Retrieval Results. The metric R@M-N denotes recall@N for retrieving the correct 3D scene from M language utterances. All methods are evaluated in the zero-shot setting.
Qualitative results of embodied localization. Top: text to describe the current agent’s situation. Bottom: merged multi-view point maps, where red regions indicate the point map views retrieved by POMA-3D based on the situational text.
@article{poma3d2025yourname,
title={POMA-3D: The Point Map Way to 3D Scene Understanding},
author={Ye Mao, Weixun Luo, Ranran Huang, Junpeng Jing, and Krystian Mikolajczyk},
journal={arXiv},
year={2025},
}