POMA-3D: The Point Map Way to 3D Scene Understanding

Figure: POMA-3D Overview. POMA-3D is a self-supervised 3D model pretrained on the large-scale point map dataset ScenePoint via alignment with 2D foundation models and the POMA-JEPA objective. The 3D features from pretrained POMA-3D transfer effectively to diverse 3D understanding tasks, including 3D visual question answering, embodied navigation, scene retrieval, and localization.

POMA-3D is the first self-supervised 3D representation learning model based on multiview point maps.

Abstract

In this paper, we introduce POMA-3D, the first self-supervised 3D representation model learned from point maps. Point maps encode explicit 3D coordinates on a structured 2D grid, preserving global 3D geometry while remaining compatible with the input format of 2D foundation models. To transfer rich 2D priors into POMA-3D, a view-to-scene alignment strategy is designed. Moreover, as point maps are view-dependent with respect to a canonical space, we introduce POMA-JEPA, a joint embedding-predictive architecture that enforces geometrically consistent point map features across multiple views. Additionally, we introduce ScenePoint, a point map dataset constructed from 6.5K room-level RGB-D scenes and 1M 2D image scenes to facilitate large-scale POMA-3D pretraining. Experiments show that POMA-3D serves as a strong backbone for both specialist and generalist 3D understanding. It benefits diverse tasks, including 3D question answering, embodied navigation, scene retrieval, and embodied localization, all achieved using only geometric inputs (i.e., 3D coordinates). Overall, our POMA-3D explores a point map way to 3D scene understanding, addressing the scarcity of pretrained priors and limited data in 3D representation learning.

POMA-3D Architecture

POMA-3D is pretrained with two objectives: (1) aligning [CLS] embeddings from the point map context encoder with image and text embeddings from the frozen FG-CLIP using Lview and Lscene, and (2) predicting masked point map embeddings from the target encoder using unmasked embeddings from the context encoder via a predictor network optimized by Lpjepa. The target encoder is updated via EMA and used for downstream 3D understanding.

3D VQA & Embodied Navigation Results

Method Modality ScanQA SQA3D Hypo3D MSNN
EM@1EM@10 EM@1EM@10 EM@1EM@10 4 dire.8 dire.
3D LLM Models
LEOPC24.550.016.2
LLaVA-3DRGB-D27.055.633.122.912.3
Video-3D LLMRGB-D30.158.6
2D LMM-based Models (Pretrained/LoRA-tuned)
Qwen2.5-VL 7BRGB23.747.830.921.82.87
LLaVA-OV 7BRGB20.847.733.224.05.83
SplatTalk3DGS22.447.6
POMA-3DllmPM 21.3 51.6 35.9 36.9 21.4
Specialist Models
ScanQAPC21.147.2
SQA3DPC46.6
3D-ViSTAPC22.452.148.585.631.081.239.920.1
SceneVerse PC22.751.549.985.031.680.336.019.5
FG-CLIP PM20.949.949.589.731.182.139.320.4
POMA-3DspecPM 22.352.3 51.191.2 33.484.8 40.421.2

3D VQA results on ScanQA, SQA3D, and Hypo3D, and embodied navigation results on MSNN. “4 dire.” and “8 dire.” denote four- and eight-way navigation, respectively.

Scene Retrieval Results

Method Mod. ScanRefer Nr3D Sr3D
R@1-1R@1-5R@5-1R@5-5 R@1-1R@1-5R@5-1R@5-5 R@1-1R@1-5R@5-1R@5-5
3D-ViSTAPC 0.482.270.242.03 0.450.600.150.60 0.331.150.331.48
SceneVersePC 0.242.270.832.03 0.261.820.261.56 0.281.990.281.70
FG-CLIPRGB 5.1016.414.942.2 1.376.715.1817.2 1.356.421.8610.1
FG-CLIPPM 0.502.000.252.81 0.461.980.462.13 0.341.180.170.84
POMA-3DPM 9.31 27.9 29.4 59.4 8.10 15.7 15.0 42.2 3.89 14.0 6.59 20.7

Scene Retrieval Results. The metric R@M-N denotes recall@N for retrieving the correct 3D scene from M language utterances. All methods are evaluated in the zero-shot setting.

Embodied Localization Results

POMA-3D Pretraining Objectives

Qualitative results of embodied localization. Top: text to describe the current agent’s situation. Bottom: merged multi-view point maps, where red regions indicate the point map views retrieved by POMA-3D based on the situational text.

Citation

@article{poma3d2025,
        title={POMA-3D: The Point Map Way to 3D Scene Understanding},
        author={Ye Mao, Weixun Luo, Ranran Huang, Junpeng Jing, and Krystian Mikolajczyk},
        journal={arXiv},
        year={2025},
}
            
× Enlarged Image