Abstract
Sapiens2 is a high-resolution transformer model family for human-centric vision that achieves superior performance through combined pretraining objectives, large-scale human image datasets, and architectural improvements enabling detailed dense prediction and semantic understanding.
We present Sapiens2, a model family of high-resolution transformers for human-centric vision focused on generalization, versatility, and high-fidelity outputs. Our model sizes range from 0.4 to 5 billion parameters, with native 1K resolution and hierarchical variants that support 4K. Sapiens2 substantially improves over its predecessor in both pretraining and post-training. First, to learn features that capture low-level details (for dense prediction) and high-level semantics (for zero-shot or few-label settings), we combine masked image reconstruction with self-distilled contrastive objectives. Our evaluations show that this unified pretraining objective is better suited for a wider range of downstream tasks. Second, along the data axis, we pretrain on a curated dataset of 1 billion high-quality human images and improve the quality and quantity of task annotations. Third, architecturally, we incorporate advances from frontier models that enable longer training schedules with improved stability. Our 4K models adopt windowed attention to reason over longer spatial context and are pretrained with 2K output resolution. Sapiens2 sets a new state-of-the-art and improves over the first generation on pose (+4 mAP), body-part segmentation (+24.3 mIoU), normal estimation (45.6% lower angular error) and extends to new tasks such as pointmap and albedo estimation. Code: https://github.com/facebookresearch/sapiens2
Community
I love it. Was waiting for something to replace MediaPipe.
Would be awesome to see this used in ASL translation systems!
Family of human-centric vision transformers pretrained on 1 billion images at 1K and 4K resolution.
The models showcase strong generalization to multiple downstream tasks.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning (2026)
- TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment (2026)
- GoldiCLIP: The Goldilocks Approach for Balancing Explicit Supervision for Language-Image Pretraining (2026)
- THFM: A Unified Video Foundation Model for 4D Human Perception and Beyond (2026)
- dinov3.seg: Open-Vocabulary Semantic Segmentation with DINOv3 (2026)
- UTICA: Multi-Objective Self-Distllation Foundation Model Pretraining for Time Series Classification (2026)
- Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.21681 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 30
Browse 30 models citing this paperDatasets citing this paper 0
No dataset linking this paper