POMA-3D: The Point Map Way to 3D Scene Understanding

Figure: POMA-3D Overview. POMA-3D is a self-supervised 3D model pretrained on the large-scale point map dataset ScenePoint via alignment with 2D foundation models and the POMA-JEPA objective. The 3D features from pretrained POMA-3D transfer effectively to diverse 3D understanding tasks, including 3D visual question answering, embodied navigation, scene retrieval, and localization.

POMA-3D is the first self-supervised 3D representation learning model based on multiview point maps.

Abstract

In this paper, we present POMA-3D, the first self-supervised model for 3D understanding learned from point maps. For pretraining POMA-3D, we introduce ScenePoint, a large-scale point map dataset constructed from 6K room-level RGB-D scenes and 1M 2D image scenes. Point maps encode explicit 3D coordinates within a structured 2D grid, thus share the same format as the input of 2D foundation models, which we use to leverage rich 2D priors via vision-language alignment during training. Moreover, as point maps are view-dependent with respect to a canonical space, we introduce POMA-JEPA, a joint embedding architecture that enforces geometrically consistent point map features across views. Experiments show that POMA-3D serves as a strong backbone for both specialist and generalist 3D understanding. Its features benefit 3D-related tasks, including 3D visual question answering, embodied navigation, scene retrieval, and embodied localization, all achieved using only geometric inputs (i.e., 3D coordinates). Our POMA-3D explores a point map way to 3D scene understanding, addressing the scarcity of pretrained priors and limited data in 3D representation learning.

POMA-3D Architecture

POMA-3D is pretrained with two objectives: (1) aligning [CLS] embeddings from the point map context encoder with image and text embeddings from the frozen FG-CLIP using Lview and Lscene, and (2) predicting masked point map embeddings from the target encoder using unmasked embeddings from the context encoder via a predictor network optimized by Lpjepa. The target encoder is updated via EMA and used for downstream 3D understanding.

3D VQA & Embodied Navigation Results

Method Modality ScanQA SQA3D Hypo3D MSNN
EM@1EM@10 EM@1EM@10 EM@1EM@10 4 dire.8 dire.
3D LLM Models
LEOPC24.550.016.2
LLaVA-3DRGB-D27.055.633.122.912.3
Video-3D LLMRGB-D30.158.6
2D LMM Models
Qwen2.5-VL 7BRGB23.747.830.921.82.87
LLaVA-OV 7BRGB20.847.733.224.05.83
SplatTalk3DGS22.447.6
POMA-3DllmPM 21.3 51.6 35.9 36.9 21.4
Specialist Models
ScanQAPC21.147.2
ScanRefer + MCANPC18.6
SQA3DPC46.6
3D-ViSTAPC22.452.148.585.631.081.239.920.1
SceneVerse PC22.751.549.985.031.680.336.019.5
FG-CLIP PM20.949.949.589.731.182.139.320.4
POMA-3DspecPM 22.352.3 51.191.2 33.484.8 40.421.2

3D VQA results on ScanQA, SQA3D, and Hypo3D, and embodied navigation results on MSNN. “4 dire.” and “8 dire.” denote four- and eight-way navigation, respectively.

Scene Retrieval Results

Method Mod. ScanRefer Nr3D Sr3D
R@1-1R@1-5R@5-1R@5-5 R@1-1R@1-5R@5-1R@5-5 R@1-1R@1-5R@5-1R@5-5
3D-ViSTAPC 0.482.270.242.03 0.450.600.150.60 0.331.150.331.48
SceneVersePC 0.242.270.832.03 0.261.820.261.56 0.281.990.281.70
FG-CLIPRGB 5.1016.414.942.2 1.376.715.1817.2 1.356.421.8610.1
FG-CLIPPM 0.502.000.252.81 0.461.980.462.13 0.341.180.170.84
POMA-3DPM 9.31 27.9 29.4 59.4 8.10 15.7 15.0 42.2 3.89 14.0 6.59 20.7

Scene Retrieval Results. The metric R@M-N denotes recall@N for retrieving the correct 3D scene from M language utterances. All methods are evaluated in the zero-shot setting.

Embodied Localization Results

POMA-3D Pretraining Objectives

Qualitative results of embodied localization. Top: text to describe the current agent’s situation. Bottom: merged multi-view point maps, where red regions indicate the point map views retrieved by POMA-3D based on the situational text.

Citation

@article{poma3d2025yourname,
                title={POMA-3D: The Point Map Way to 3D Scene Understanding},
                author={Ye Mao, Weixun Luo, Ranran Huang, Junpeng Jing, and Krystian Mikolajczyk},
                journal={arXiv},
                year={2025},
            }
            
× Enlarged Image