Improve model card: Add metadata, links, abstract, and usage for Concerto

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +42 -1
README.md CHANGED
@@ -1 +1,42 @@
1
- Model weights for [Concerto](arxiv.org/abs/2510.23607)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: graph-ml
3
+ library_name: pytorch
4
+ license: apache-2.0
5
+ tags:
6
+ - 3d
7
+ - point-cloud
8
+ - self-supervised-learning
9
+ ---
10
+
11
+ # Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations
12
+
13
+ This repository contains the model weights for **Concerto**, a novel approach for learning robust spatial representations presented in the paper [Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations](https://huggingface.co/papers/2510.23607).
14
+
15
+ - **Paper:** [Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations](https://huggingface.co/papers/2510.23607)
16
+ - **Project Page:** [https://pointcept.github.io/Concerto/](https://pointcept.github.io/Concerto/)
17
+ - **Codebase:** [https://github.com/Pointcept/Pointcept](https://github.com/Pointcept/Pointcept)
18
+
19
+ ## Abstract
20
+ Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP's language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.
21
+
22
+ ## Usage
23
+ For detailed installation, data preparation, training, and testing instructions, please refer to the [official GitHub repository](https://github.com/Pointcept/Pointcept).
24
+
25
+ ## Citation
26
+ If you find Concerto or the Pointcept codebase useful in your research, please cite the following papers:
27
+
28
+ ```bibtex
29
+ @misc{pointcept2023,
30
+ title={Pointcept: A Codebase for Point Cloud Perception Research},
31
+ author={Pointcept Contributors},
32
+ howpublished = {\url{https://github.com/Pointcept/Pointcept}},
33
+ year={2023}
34
+ }
35
+
36
+ @article{zhang2025concerto,
37
+ title={Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations},
38
+ author={Zhang, Yujia and Wu, Xiaoyang and Lao, Yixing and Wang, Chengyao and Tian, Zhuotao and Wang, Naiyan and Zhao, Hengshuang},
39
+ journal={Conference on Neural Information Processing Systems},
40
+ year={2025},
41
+ }
42
+ ```