Spaces:
Runtime error
Runtime error
File size: 14,532 Bytes
ad0d9fe efd385f ad0d9fe a39c621 ad0d9fe a39c621 ad0d9fe a39c621 ad0d9fe 4873960 ad0d9fe a39c621 ad0d9fe a39c621 ad0d9fe a39c621 ad0d9fe a39c621 ad0d9fe a39c621 ad0d9fe a39c621 ad0d9fe a39c621 ad0d9fe |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 |
---
title: GestureLSM Demo
emoji: "\U0001F57A"
colorFrom: purple
colorTo: pink
sdk: gradio
sdk_version: "4.42.0"
app_file: hf_space/app.py
pinned: false
---
[](https://paperswithcode.com/sota/gesture-generation-on-beat2?p=gesturelsm-latent-shortcut-based-co-speech) <a href="https://arxiv.org/abs/2501.18898"><img src="https://img.shields.io/badge/arxiv-gray?logo=arxiv&"></a>
# GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling [ICCV 2025]
# π Release Plans
- [x] Inference Code
- [x] Pretrained Models
- [x] A web demo
- [x] Training Code
- [x] Clean Code to make it look nicer
- [x] Support for [MeanFlow](https://arxiv.org/abs/2505.13447)
- [x] Unified training and testing pipeline
- [ ] MeanFlow Training Code (Coming Soon)
- [ ] Merge with [Intentional-Gesture](https://github.com/andypinxinliu/Intentional-Gesture)
## π Code Updates
**Latest Update**: The codebase has been cleaned and restructured. For legacy or historical information, please check out the `old` branch.
**New Features**:
- Added MeanFlow model support
- Unified training and testing pipeline using `train.py`
- New configuration files in `configs_new/` directory
- Updated checkpoint files with improved performance
# βοΈ Installation
## Build Environtment
```
conda create -n gesturelsm python=3.12
conda activate gesturelsm
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
bash demo/install_mfa.sh
```
## π Code Structure
Understanding the codebase structure will help you navigate and customize the project effectively.
```
GestureLSM/
βββ π configs_new/ # New unified configuration files
β βββ diffusion_rvqvae_128.yaml # Diffusion model config
β βββ shortcut_rvqvae_128.yaml # Shortcut model config
β βββ meanflow_rvqvae_128.yaml # MeanFlow model config
βββ π configs/ # Legacy configuration files (deprecated)
βββ π ckpt/ # Pretrained model checkpoints
β βββ new_540_diffusion.bin # Diffusion model weights
β βββ shortcut_reflow.bin # Shortcut model weights
β βββ meanflow.pth # MeanFlow model weights
β βββ net_300000_*.pth # RVQ-VAE model weights
βββ π models/ # Model implementations
β βββ Diffusion.py # Diffusion model
β βββ LSM.py # Latent Shortcut Model
β βββ MeanFlow.py # MeanFlow model
β βββ π layers/ # Neural network layers
β βββ π vq/ # Vector quantization modules
β βββ π utils/ # Model utilities
βββ π dataloaders/ # Data loading and preprocessing
β βββ beat_sep_lower.py # Main dataset loader
β βββ π pymo/ # Motion processing library
β βββ π utils/ # Data utilities
βββ π trainer/ # Training framework
β βββ base_trainer.py # Base trainer class
β βββ generative_trainer.py # Generative model trainer
βββ π utils/ # General utilities
β βββ config.py # Configuration management
β βββ metric.py # Evaluation metrics
β βββ rotation_conversions.py # Rotation utilities
βββ π demo/ # Demo and visualization
β βββ examples/ # Sample audio files
β βββ install_mfa.sh # MFA installation script
βββ π datasets/ # Dataset storage
β βββ BEAT_SMPL/ # Original BEAT dataset
β βββ beat_cache/ # Preprocessed cache
β βββ hub/ # SMPL models and pretrained weights
βββ π outputs/ # Training outputs and logs
β βββ weights/ # Saved model weights
βββ train.py # Unified training/testing script
βββ demo.py # Web demo script
βββ rvq_beatx_train.py # RVQ-VAE training script
βββ requirements.txt # Python dependencies
```
### π§ Key Components
#### **Model Architecture**
- **`models/Diffusion.py`**: Denoising diffusion model for high-quality generation
- **`models/LSM.py`**: Latent Shortcut Model for fast inference
- **`models/MeanFlow.py`**: Flow-based model for single-step generation
- **`models/vq/`**: Vector quantization modules for latent space compression
#### **Configuration System**
- **`configs_new/`**: New unified configuration files for all models
- **`configs/`**: Legacy configuration files (deprecated)
- Each config file contains model parameters, training settings, and data paths
#### **Data Pipeline**
- **`dataloaders/beat_sep_lower.py`**: Main dataset loader for BEAT dataset
- **`dataloaders/pymo/`**: Motion processing library for gesture data
- **`datasets/beat_cache/`**: Preprocessed data cache for faster loading
#### **Training Framework**
- **`train.py`**: Unified script for training and testing all models
- **`trainer/`**: Training framework with base and generative trainers
- **`optimizers/`**: Optimizer and scheduler implementations
#### **Utilities**
- **`utils/config.py`**: Configuration management and validation
- **`utils/metric.py`**: Evaluation metrics (FGD, etc.)
- **`utils/rotation_conversions.py`**: 3D rotation utilities
### π Getting Started with the Code
1. **For Training**: Use `train.py` with configs from `configs_new/`
2. **For Inference**: Use `demo.py` for web interface or `train.py --mode test`
3. **For Customization**: Modify config files in `configs_new/` directory
4. **For New Models**: Add model implementation in `models/` directory
## Results

This table shows the results of 1-speaker and all-speaker comparisons. RAG-Gesture refers to [**Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis**](https://arxiv.org/abs/2412.06786), accepted by CVPR 2025. The stats for 1-speaker is based on speaker-id of 2, 'scott' in order to be consistent with the previous SOTA methods. I directly copied the stats from the RAG-Gesture repo, which is different from the stats in the current paper.
## Important Notes
### Model Performance
- The statistics reported in the paper is based on 1-speaker with speaker-id of 2, 'scott' in order to be consistent with the previous SOTA methods.
- The pretrained models are trained on 1-speaker. (RVQ-VAEs, Diffusion, Shortcut, MeanFlow)
- If you want to use all-speaker, please modify the config files to include all speaker ids.
- April 16, 2025: updated the pretrained model to include all speakers. (RVQ-VAEs, Shortcut)
- No hyperparameter tuning was done for all-speaker - same settings as 1-speaker are used.
### Model Design Choices
- No speaker embedding is included to make the model capable of generating gestures for novel speakers.
- No gesture type information is used in the current version. This is intentional as gesture types are typically unknown for novel speakers and settings, making this approach more realistic for real-world applications.
- If you want to see better FGD scores, you can try adding gesture type information.
### Code Structure
- **Current Version**: Clean, unified codebase with MeanFlow support
- **Legacy Code**: Available in the `old` branch for historical reference
- **Accepted to ICCV 2025** - Thanks to all co-authors!
## Download Models
### Pretrained Models (Updated)
```
# Option 1: From Google Drive
# Download the pretrained models (Diffusion + Shortcut + MeanFlow + RVQ-VAEs)
gdown https://drive.google.com/drive/folders/1OfYWWJbaXal6q7LttQlYKWAy0KTwkPRw?usp=drive_link -O ./ckpt --folder
# Option 2: From Huggingface Hub
huggingface-cli download https://huggingface.co/pliu23/GestureLSM --local-dir ./ckpt
# Download the SMPL model
gdown https://drive.google.com/drive/folders/1MCks7CMNBtAzU2XihYezNmiGT_6pWex8?usp=drive_link -O ./datasets/hub --folder
```
### Available Checkpoints
- **Diffusion Model**: `ckpt/new_540_diffusion.bin`
- **Shortcut Model**: `ckpt/shortcut_reflow.bin`
- **MeanFlow Model**: `ckpt/meanflow.pth`
- **RVQ-VAE Models**: `ckpt/net_300000_upper.pth`, `ckpt/net_300000_hands.pth`, `ckpt/net_300000_lower.pth`
## Download Dataset
> For evaluation and training, not necessary for running a web demo or inference.
### Download BEAT2 Dataset from Hugging Face
The original dataset download method is no longer available. Please use the Hugging Face dataset:
```bash
# Download BEAT2 dataset from Hugging Face
huggingface-cli download H-Liu1997/BEAT2 --local-dir ./datasets/BEAT2
```
**Dataset Information**:
- **Source**: [H-Liu1997/BEAT2 on Hugging Face](https://huggingface.co/datasets/H-Liu1997/BEAT2)
- **Size**: ~4.1K samples
- **Format**: CSV with train/test splits
- **License**: Apache 2.0
### Legacy Download (Deprecated)
> The original download method is no longer working
```bash
# This command is deprecated and no longer works
# bash preprocess/bash_raw_cospeech_download.sh
```
## Testing/Evaluation
> **Note**: Requires dataset download for evaluation. For inference only, see the Demo section below.
### Unified Testing Pipeline
The codebase now uses a unified `train.py` script for both training and testing. Use the `--mode test` flag for evaluation:
```bash
# Test Diffusion Model (20 steps)
python train.py --config configs_new/diffusion_rvqvae_128.yaml --ckpt ckpt/new_540_diffusion.bin --mode test
# Test Shortcut Model (2-step reflow)
python train.py --config configs_new/shortcut_rvqvae_128.yaml --ckpt ckpt/shortcut_reflow.bin --mode test
# Test MeanFlow Model (1-step flow-based)
python train.py --config configs_new/meanflow_rvqvae_128.yaml --ckpt ckpt/meanflow.pth --mode test
```
### Model Comparison
| Model | Steps | Description | Key Features | Use Case |
|-------|-------|-------------|--------------|----------|
| **Diffusion** | 20 | Denoising diffusion model | High quality, slower inference | High-quality generation |
| **Shortcut** | 2-4 | Latent shortcut with reflow | Fast inference, good quality | **Recommended for most users** |
| **MeanFlow** | 1 | Flow-based generation | Fastest inference, single step | Real-time applications |
### Performance Comparison
| Model | Steps | FGD Score β | Beat Constancy β | L1Div Score β | Inference Speed |
|-------|-------|-------------|------------------|---------------|-----------------|
| **MeanFlow** | 1 | **0.4031** | **0.7489** | 12.4631 | **Fastest** |
| **Diffusion** | 20 | 0.4100 | 0.7384 | 12.5752 | Slowest |
| **Shortcut** | 20 | 0.4040 | 0.7144 | 13.4874 | Fast |
| **Shortcut-ReFlow** | 2 | 0.4104 | 0.7182 | **13.678** | Fast |
**Legend**:
- **FGD Score** (β): Lower is better - measures gesture quality
- **Beat Constancy** (β): Higher is better - measures audio-gesture synchronization
- **L1Div Score** (β): Higher is better - measures diversity of generated gestures
**Recommendation**: **MeanFlow** is the clear winner, offering the best FGD and L1Div scores with the fastest inference speed.
### Legacy Testing (Deprecated)
> For reference only - use the unified pipeline above instead
```bash
# Old testing commands (deprecated)
python test.py -c configs/shortcut_rvqvae_128.yaml
python test.py -c configs/shortcut_reflow_test.yaml
python test.py -c configs/diffuser_rvqvae_128.yaml
```
## Train RVQ-VAEs (1-speaker)
> Require download dataset
```
bash train_rvq.sh
```
## Training
> **Note**: Requires dataset download for training.
### Unified Training Pipeline
The codebase now uses a unified `train.py` script for training all models. Use the new configuration files in `configs_new/`:
```bash
# Train Diffusion Model
python train.py --config configs_new/diffusion_rvqvae_128.yaml
# Train Shortcut Model
python train.py --config configs_new/shortcut_rvqvae_128.yaml
# Train MeanFlow Model
python train.py --config configs_new/meanflow_rvqvae_128.yaml
```
### Training Configuration
- **Config Directory**: Use `configs_new/` for the latest configurations
- **Output Directory**: Models are saved to `./outputs/weights/`
- **Logging**: Supports Weights & Biases integration (configure in config files)
- **GPU Support**: Configure GPU usage in the config files
### Legacy Training (Deprecated)
> For reference only - use the unified pipeline above instead
```bash
# Old training commands (deprecated)
python train.py -c configs/shortcut_rvqvae_128.yaml
python train.py -c configs/diffuser_rvqvae_128.yaml
```
## Quick Start
### Demo/Inference (No Dataset Required)
```bash
# Run the web demo with Shortcut model
python demo.py -c configs/shortcut_rvqvae_128_hf.yaml
```
### Testing with Your Own Data
```bash
# Test with your own audio and text (requires pretrained models)
python train.py --config configs_new/shortcut_rvqvae_128.yaml --ckpt ckpt/shortcut_reflow.bin --mode test
```
## Demo
The demo provides a web interface for gesture generation. It uses the Shortcut model by default for fast inference.
```bash
python demo.py -c configs/shortcut_rvqvae_128_hf.yaml
```
**Features**:
- Web-based interface for easy interaction
- Real-time gesture generation
- Support for custom audio and text input
- Visualization of generated gestures
# π Acknowledgments
Thanks to [SynTalker](https://github.com/RobinWitch/SynTalker/tree/main), [EMAGE](https://github.com/PantoMatrix/PantoMatrix/tree/main/scripts/EMAGE_2024), [DiffuseStyleGesture](https://github.com/YoungSeng/DiffuseStyleGesture), our code is partially borrowing from them. Please check these useful repos.
# π Citation
If you find our code or paper helps, please consider citing:
```bibtex
@inproceedings{liu2025gesturelsmlatentshortcutbased,
title={{GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling}},
author={Pinxin Liu and Luchuan Song and Junhua Huang and Chenliang Xu},
booktitle={IEEE/CVF International Conference on Computer Vision},
year={2025},
}
```
|