LAM3C
LAM3C is a self-supervised learning method trained on video-generated point clouds reconstructed from unlabeled indoor walkthrough videos. This repository provides pretrained Point Transformerv3 (PTv3) backbones for feature extraction and downstream 3D scene understanding.
- LAM3C is not a raw-video model. The released checkpoints take point clouds as input, not videos.
- The expected per-point input is 9D: XYZ + RGB + normals.
- The backbone checkpoints are feature extractors. They do not include a task-specific segmentation head unless explicitly stated.
arXiv: 3D sans 3D Scans: Scalable Pre-training from Video-Generated Point Clouds (CVPR 2026)
Github: ryosuke-yamada/lam3c
What makes LAM3C different?
Most 3D self-supervised learning methods rely on real 3D scans, which are expensive to collect at scale. LAM3C instead learns from RoomTours, a large collection of point clouds reconstructed from unlabeled room-tour videos gathered from the web.
The method combines:
- RoomTours, a scalable VGPC pre-training dataset
- Point Transformer v3 backbones
- a noise-robust self-supervised objective with:
- Laplacian smoothing loss
- noise consistency loss
Available checkpoints
| Checkpoint | Backbone | Params | Training data | Intended use |
|---|---|---|---|---|
lam3c_roomtours49k_ptv3-base |
PTv3-Base | 121M | RoomTours-49k | LAM3C pre-trained model |
lam3c_roomtours49k_ptv3-large |
PTv3-Large | 224M | RoomTours-49k | LAM3C pre-trained model |
lam3c_ptv3-base_roomtours49k_probe-head_scannet |
PTv3-Base (linear probing head) | - | ScanNet | LAM3C pre-trained model trained on ScanNet linear probing |
lam3c_ptv3-large_roomtours49k_probe-head_scannet |
PTv3-Large (linear probing head) | - | ScanNet | LAM3C pre-trained model trained on ScanNet linear probing |
Quickstart
Load a pretrained backbone
import torch
from lam3c.model import PointTransformerV3
device = "cuda" if torch.cuda.is_available() else "cpu"
model = PointTransformerV3.from_pretrained("aist-cvrt/lam3c").to(device)
model.eval()
Extract point features
point = transform(point) # prepares xyz / rgb / normals
point = model(point)
features = point.feat
Use the backbone features for linear probing, full fine-tuning, or as initialization for segmentation heads. The exact preprocessing pipeline and checkpoint-specific loading utilities are provided in the code repository.
Intended uses
LAM3C is intended for:
- self-supervised point cloud feature extraction
- initialization for 3D semantic segmentation
- initialization for 3D instance segmentation
- representation learning research on indoor point clouds
Training data
The backbone checkpoints are pretrained on RoomTours, a dataset of 49,219 indoor scenes reconstructed from unlabeled indoor walkthrough videos. The paper reports that the authors' independently collected portion of RoomTours contains 3,462 videos from 19 countries, producing 15,921 indoor sequences after CLIP-based filtering and scene splitting.
RoomTours scenes are reconstructed with an off-the-shelf feed-forward 3D reconstruction model and then aligned in Z-up orientation and scale. The resulting point clouds use 9D input features: coordinates, colors, and normals.
The released LAM3C checkpoints do not use real 3D scans as pre-training inputs. However, the paper explicitly notes that the reconstruction model used to create RoomTours was itself trained with real point clouds.
Model details
- Architecture: Point Transformer v3 backbone
- Learning paradigm: self-supervised teacher-student clustering
- Noise robustness: Laplacian smoothing + noise consistency
- Default pre-training setup in the paper: 8 NVIDIA H200 GPUs, total batch size 16, AdamW, OneCycleLR, 145,600 iterations for the default PTv3-Base setting
Evaluation
Selected results from the paper are shown below.
Semantic segmentation
| Checkpoint | ScanNet (LP) | ScanNet (Full-FT) | ScanNet200 (LP) | ScanNet200 (Full-FT) | ScanNet++ Val (LP) | ScanNet++ Val (Full-FT) | S3DIS Area 5 (LP) | S3DIS Area 5 (Full-FT) |
|---|---|---|---|---|---|---|---|---|
lam3c_roomtours49k_ptv3-base |
66.0 | 75.1 | 25.3 | 35.1 | 34.2 | 43.1 | 65.7 | 72.9 |
lam3c_roomtours49k_ptv3-large* |
69.5 | 79.5 | 28.1 | 35.5 | 35.9 | 43.1 | 69.5 | 75.5 |
Instance segmentation
| Checkpoint | ScanNet (LP) | ScanNet (Full-FT) | ScanNet200 (LP) | ScanNet200 (Full-FT) | ScanNet++ Val (LP) | ScanNet++ Val (Full-FT) | S3DIS Area 5 (LP) | S3DIS Area 5 (Full-FT) |
|---|---|---|---|---|---|---|---|---|
lam3c_roomtours49k_ptv3-base |
25.1 | 39.7 | 8.3 | 19.6 | 11.3 | 20.5 | 21.6 | 45.7 |
lam3c_roomtours49k_ptv3-large* |
28.6 | 41.7 | 9.5 | 21.9 | 12.1 | 21.1 | 27.8 | 47.2 |
* In the paper, the PTv3-Large variant uses 434k pre-training steps.
For full benchmark details and experimental settings, please refer to the paper.
Limitations
- RoomTours is built from indoor walkthrough videos, including real-estate tours and apartment viewings. Performance may not transfer directly to cluttered, industrial, outdoor, or LiDAR-native domains.
- Video-generated point clouds are noisy and incomplete. The paper shows examples with blurred object boundaries, doubled walls/floors, and weaker global geometric consistency than models pretrained on accurate real scans.
- In very small-data downstream regimes, the paper reports drops attributable to the domain gap between real scans and video-generated point clouds.
- Although the released checkpoints are pretrained without real 3D scans as inputs, the RoomTours reconstruction pipeline depends on a reconstruction model trained with real point clouds.
Citation
If you use LAM3C in your research, please cite:
@inproceedings{yamada2026lam3c,
title={3D sans 3D Scans: Scalable Pre-training from Video-Generated Point Clouds},
author={Yamada, Ryousuke and Ide, Kohsuke and Fukuhara, Yoshihiro and Kataoka, Hirokatsu and Puy, Gilles and Bursuc, Andrei and Asano, Yuki M.},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}
License
The model weights are released under CC BY-NC 4.0. This license allows non-commercial reuse subject to attribution. Please review the license terms before using the model in products or services.