Tango3D: Towards Alignment for Global and Local 2D-3D Correspondence

Abstract

Existing 3D foundation models typically align point clouds to frozen vision-language spaces like CLIP, which achieve strong cross-modal retrieval by compressing 3D shape into a global vector. However, this global-only alignment cannot establish fine-grained pixel-to-point correspondence. To solve this, we present Tango3D, a foundation model that unifies dense correspondence and global retrieval. We use a geometry-aware 2D visual backbone and a pretrained 3D VAE to encode images into 2D patches and point clouds into 3D tokens. These are mapped into a single shared space to achieve both local pixel-to-point alignment and global semantic alignment. To stabilize the joint learning of dense and global objectives, we introduce a three-stage progressive training strategy. Experiments show Tango3D is the first to achieve object-level pixel-to-point alignment while maintaining competitive global retrieval. By establishing a fine-grained alignment feature space, Tango3D injects rich semantics into purely geometric 3D tokens, paving the way for a wide range of dense 3D downstream tasks.

Method

Pixel-to-Point Correspondence: 2D → 3D

A single query pixel on the 2D image accurately localizes the corresponding geometric region on the 3D mesh, confirming robust fine-grained spatial grounding.

Cross-Instance Point Matching: 2D → 3D

A pixel on one object successfully retrieves the semantically equivalent 3D region on entirely different instances, indicating that the shared space captures category-level semantics rather than merely overfitting to exact shapes.

Reverse Grounding: 3D → 2D

A single 3D point correctly grounds back to corresponding pixels across diverse camera poses, verifying strict multi-view consistency of the shared descriptor space.

Cross-Instance Shape Matching: 3D → 3D

Pure 3D-to-3D queries reveal that the 3D tokens acquire semantic part-level alignment, accurately matching corresponding structural parts across different shapes without any 2D input.

Part Transfer: 2D → 3D

By coupling SAM with our local descriptor space, an arbitrary 2D part mask is transferred onto the 3D mesh without any part-level supervision. Click to enlarge.

Image-to-Shape Retrieval: 2D → 3D

Using the 2D global descriptor, Tango3D retrieves the most similar 3D shapes from a large gallery ranked by cosine similarity. Drag to rotate.

Query Image	Top-1	Top-2	Top-3	Top-4	Last-2	Last-1

Shape-to-Shape Retrieval: 3D → 3D

Using only the 3D global descriptor without any 2D input, retrieved shapes share fine-grained geometric and topological structure beyond basic category labels. Drag to rotate.

Query 3D	Top-1	Top-2	Top-3	Top-4	Last-2	Last-1