POMA-3D: The Point Map Way to 3D Scene Understanding

Abstract

In this paper, we present POMA-3D, the first self-supervised model for 3D understanding learned from point maps. For pretraining POMA-3D, we introduce ScenePoint, a large-scale point map dataset constructed from 6K room-level RGB-D scenes and 1M 2D image scenes. Point maps encode explicit 3D coordinates within a structured 2D grid, thus share the same format as the input of 2D foundation models, which we use to leverage rich 2D priors via vision-language alignment during training. Moreover, as point maps are view-dependent with respect to a canonical space, we introduce POMA-JEPA, a joint embedding architecture that enforces geometrically consistent point map features across views. Experiments show that POMA-3D serves as a strong backbone for both specialist and generalist 3D understanding. Its features benefit 3D-related tasks, including 3D visual question answering, embodied navigation, scene retrieval, and embodied localization, all achieved using only geometric inputs (i.e., 3D coordinates). Our POMA-3D explores a point map way to 3D scene understanding, addressing the scarcity of pretrained priors and limited data in 3D representation learning.

POMA-3D Architecture

POMA-3D is pretrained with two objectives: (1) aligning [CLS] embeddings from the point map context encoder with image and text embeddings from the frozen FG-CLIP using L_view and L_scene, and (2) predicting masked point map embeddings from the target encoder using unmasked embeddings from the context encoder via a predictor network optimized by L_pjepa. The target encoder is updated via EMA and used for downstream 3D understanding.

3D VQA & Embodied Navigation Results

Method	Modality	ScanQA		SQA3D		Hypo3D		MSNN
		EM@1	EM@10	EM@1	EM@10	EM@1	EM@10	4 dire.	8 dire.
3D LLM Models
LEO	PC	24.5	–	50.0	–	16.2	–	–	–
LLaVA-3D	RGB-D	27.0	–	55.6	–	33.1	–	22.9	12.3
Video-3D LLM	RGB-D	30.1	–	58.6	–	–	–	–	–
2D LMM Models
Qwen2.5-VL 7B	RGB	23.7	–	47.8	–	30.9	–	21.8	2.87
LLaVA-OV 7B	RGB	20.8	–	47.7	–	33.2	–	24.0	5.83
SplatTalk	3DGS	22.4	–	47.6	–	–	–	–	–
POMA-3D_llm	PM	21.3	–	51.6	–	35.9	–	36.9	21.4
Specialist Models
ScanQA	PC	21.1	–	47.2	–	–	–	–	–
ScanRefer + MCAN	PC	18.6	–	–	–	–	–	–	–
SQA3D	PC	–	–	46.6	–	–	–	–	–
3D-ViSTA	PC	22.4	52.1	48.5	85.6	31.0	81.2	39.9	20.1
SceneVerse	PC	22.7	51.5	49.9	85.0	31.6	80.3	36.0	19.5
FG-CLIP	PM	20.9	49.9	49.5	89.7	31.1	82.1	39.3	20.4
POMA-3D_spec	PM	22.3	52.3	51.1	91.2	33.4	84.8	40.4	21.2

3D VQA results on ScanQA, SQA3D, and Hypo3D, and embodied navigation results on MSNN. “4 dire.” and “8 dire.” denote four- and eight-way navigation, respectively.

Scene Retrieval Results

Method	Mod.	ScanRefer				Nr3D				Sr3D
Method	Mod.	R@1-1	R@1-5	R@5-1	R@5-5	R@1-1	R@1-5	R@5-1	R@5-5	R@1-1	R@1-5	R@5-1	R@5-5
3D-ViSTA	PC	0.48	2.27	0.24	2.03	0.45	0.60	0.15	0.60	0.33	1.15	0.33	1.48
SceneVerse	PC	0.24	2.27	0.83	2.03	0.26	1.82	0.26	1.56	0.28	1.99	0.28	1.70
FG-CLIP	RGB	5.10	16.4	14.9	42.2	1.37	6.71	5.18	17.2	1.35	6.42	1.86	10.1
FG-CLIP	PM	0.50	2.00	0.25	2.81	0.46	1.98	0.46	2.13	0.34	1.18	0.17	0.84
POMA-3D	PM	9.31	27.9	29.4	59.4	8.10	15.7	15.0	42.2	3.89	14.0	6.59	20.7

Scene Retrieval Results. The metric R@M-N denotes recall@N for retrieving the correct 3D scene from M language utterances. All methods are evaluated in the zero-shot setting.

Embodied Localization Results

Qualitative results of embodied localization. Top: text to describe the current agent’s situation. Bottom: merged multi-view point maps, where red regions indicate the point map views retrieved by POMA-3D based on the situational text.

Citation

@article{poma3d2025yourname,
                title={POMA-3D: The Point Map Way to 3D Scene Understanding},
                author={Ye Mao, Weixun Luo, Ranran Huang, Junpeng Jing, and Krystian Mikolajczyk},
                journal={arXiv},
                year={2025},
            }