Improve model card: Add metadata, links, abstract, and usage for Concerto
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1 +1,42 @@
|
|
| 1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
pipeline_tag: graph-ml
|
| 3 |
+
library_name: pytorch
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
tags:
|
| 6 |
+
- 3d
|
| 7 |
+
- point-cloud
|
| 8 |
+
- self-supervised-learning
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
# Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations
|
| 12 |
+
|
| 13 |
+
This repository contains the model weights for **Concerto**, a novel approach for learning robust spatial representations presented in the paper [Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations](https://huggingface.co/papers/2510.23607).
|
| 14 |
+
|
| 15 |
+
- **Paper:** [Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations](https://huggingface.co/papers/2510.23607)
|
| 16 |
+
- **Project Page:** [https://pointcept.github.io/Concerto/](https://pointcept.github.io/Concerto/)
|
| 17 |
+
- **Codebase:** [https://github.com/Pointcept/Pointcept](https://github.com/Pointcept/Pointcept)
|
| 18 |
+
|
| 19 |
+
## Abstract
|
| 20 |
+
Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP's language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.
|
| 21 |
+
|
| 22 |
+
## Usage
|
| 23 |
+
For detailed installation, data preparation, training, and testing instructions, please refer to the [official GitHub repository](https://github.com/Pointcept/Pointcept).
|
| 24 |
+
|
| 25 |
+
## Citation
|
| 26 |
+
If you find Concerto or the Pointcept codebase useful in your research, please cite the following papers:
|
| 27 |
+
|
| 28 |
+
```bibtex
|
| 29 |
+
@misc{pointcept2023,
|
| 30 |
+
title={Pointcept: A Codebase for Point Cloud Perception Research},
|
| 31 |
+
author={Pointcept Contributors},
|
| 32 |
+
howpublished = {\url{https://github.com/Pointcept/Pointcept}},
|
| 33 |
+
year={2023}
|
| 34 |
+
}
|
| 35 |
+
|
| 36 |
+
@article{zhang2025concerto,
|
| 37 |
+
title={Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations},
|
| 38 |
+
author={Zhang, Yujia and Wu, Xiaoyang and Lao, Yixing and Wang, Chengyao and Tian, Zhuotao and Wang, Naiyan and Zhao, Hengshuang},
|
| 39 |
+
journal={Conference on Neural Information Processing Systems},
|
| 40 |
+
year={2025},
|
| 41 |
+
}
|
| 42 |
+
```
|