---
title: GestureLSM Demo
emoji: "\U0001F57A"
colorFrom: purple
colorTo: pink
sdk: gradio
sdk_version: "4.42.0"
app_file: hf_space/app.py
pinned: false
---

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/gesturelsm-latent-shortcut-based-co-speech/gesture-generation-on-beat2)](https://paperswithcode.com/sota/gesture-generation-on-beat2?p=gesturelsm-latent-shortcut-based-co-speech) <a href="https://arxiv.org/abs/2501.18898"><img src="https://img.shields.io/badge/arxiv-gray?logo=arxiv&amp"></a>


# GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling [ICCV 2025]


# 📝 Release Plans

- [x] Inference Code
- [x] Pretrained Models
- [x] A web demo
- [x] Training Code
- [x] Clean Code to make it look nicer
- [x] Support for [MeanFlow](https://arxiv.org/abs/2505.13447)
- [x] Unified training and testing pipeline
- [ ] MeanFlow Training Code (Coming Soon)
- [ ] Merge with [Intentional-Gesture](https://github.com/andypinxinliu/Intentional-Gesture)

## 🔄 Code Updates

**Latest Update**: The codebase has been cleaned and restructured. For legacy or historical information, please check out the `old` branch.

**New Features**:
- Added MeanFlow model support
- Unified training and testing pipeline using `train.py`
- New configuration files in `configs_new/` directory
- Updated checkpoint files with improved performance

# ⚒️ Installation

## Build Environtment

```
conda create -n gesturelsm python=3.12
conda activate gesturelsm
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
bash demo/install_mfa.sh
```

## 📁 Code Structure

Understanding the codebase structure will help you navigate and customize the project effectively.

```
GestureLSM/
├── 📁 configs_new/              # New unified configuration files
│   ├── diffusion_rvqvae_128.yaml    # Diffusion model config
│   ├── shortcut_rvqvae_128.yaml     # Shortcut model config
│   └── meanflow_rvqvae_128.yaml     # MeanFlow model config
├── 📁 configs/                  # Legacy configuration files (deprecated)
├── 📁 ckpt/                     # Pretrained model checkpoints
│   ├── new_540_diffusion.bin        # Diffusion model weights
│   ├── shortcut_reflow.bin          # Shortcut model weights
│   ├── meanflow.pth                 # MeanFlow model weights
│   └── net_300000_*.pth            # RVQ-VAE model weights
├── 📁 models/                   # Model implementations
│   ├── Diffusion.py                 # Diffusion model
│   ├── LSM.py                       # Latent Shortcut Model
│   ├── MeanFlow.py                  # MeanFlow model
│   ├── 📁 layers/                   # Neural network layers
│   ├── 📁 vq/                       # Vector quantization modules
│   └── 📁 utils/                    # Model utilities
├── 📁 dataloaders/              # Data loading and preprocessing
│   ├── beat_sep_lower.py            # Main dataset loader
│   ├── 📁 pymo/                     # Motion processing library
│   └── 📁 utils/                    # Data utilities
├── 📁 trainer/                  # Training framework
│   ├── base_trainer.py              # Base trainer class
│   └── generative_trainer.py        # Generative model trainer
├── 📁 utils/                    # General utilities
│   ├── config.py                    # Configuration management
│   ├── metric.py                    # Evaluation metrics
│   └── rotation_conversions.py      # Rotation utilities
├── 📁 demo/                     # Demo and visualization
│   ├── examples/                    # Sample audio files
│   └── install_mfa.sh               # MFA installation script
├── 📁 datasets/                 # Dataset storage
│   ├── BEAT_SMPL/                   # Original BEAT dataset
│   ├── beat_cache/                  # Preprocessed cache
│   └── hub/                         # SMPL models and pretrained weights
├── 📁 outputs/                  # Training outputs and logs
│   └── weights/                     # Saved model weights
├── train.py                     # Unified training/testing script
├── demo.py                      # Web demo script
├── rvq_beatx_train.py          # RVQ-VAE training script
└── requirements.txt             # Python dependencies
```

### 🔧 Key Components

#### **Model Architecture**
- **`models/Diffusion.py`**: Denoising diffusion model for high-quality generation
- **`models/LSM.py`**: Latent Shortcut Model for fast inference
- **`models/MeanFlow.py`**: Flow-based model for single-step generation
- **`models/vq/`**: Vector quantization modules for latent space compression

#### **Configuration System**
- **`configs_new/`**: New unified configuration files for all models
- **`configs/`**: Legacy configuration files (deprecated)
- Each config file contains model parameters, training settings, and data paths

#### **Data Pipeline**
- **`dataloaders/beat_sep_lower.py`**: Main dataset loader for BEAT dataset
- **`dataloaders/pymo/`**: Motion processing library for gesture data
- **`datasets/beat_cache/`**: Preprocessed data cache for faster loading

#### **Training Framework**
- **`train.py`**: Unified script for training and testing all models
- **`trainer/`**: Training framework with base and generative trainers
- **`optimizers/`**: Optimizer and scheduler implementations

#### **Utilities**
- **`utils/config.py`**: Configuration management and validation
- **`utils/metric.py`**: Evaluation metrics (FGD, etc.)
- **`utils/rotation_conversions.py`**: 3D rotation utilities

### 🚀 Getting Started with the Code

1. **For Training**: Use `train.py` with configs from `configs_new/`
2. **For Inference**: Use `demo.py` for web interface or `train.py --mode test`
3. **For Customization**: Modify config files in `configs_new/` directory
4. **For New Models**: Add model implementation in `models/` directory

## Results

![Beat Results](beat-new.png)

This table shows the results of 1-speaker and all-speaker comparisons. RAG-Gesture refers to [**Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis**](https://arxiv.org/abs/2412.06786), accepted by CVPR 2025. The stats for 1-speaker is based on speaker-id of 2, 'scott' in order to be consistent with the previous SOTA methods. I directly copied the stats from the RAG-Gesture repo, which is different from the stats in the current paper. 

## Important Notes

### Model Performance
- The statistics reported in the paper is based on 1-speaker with speaker-id of 2, 'scott' in order to be consistent with the previous SOTA methods.
- The pretrained models are trained on 1-speaker. (RVQ-VAEs, Diffusion, Shortcut, MeanFlow)
- If you want to use all-speaker, please modify the config files to include all speaker ids.
- April 16, 2025: updated the pretrained model to include all speakers. (RVQ-VAEs, Shortcut)
- No hyperparameter tuning was done for all-speaker - same settings as 1-speaker are used.

### Model Design Choices
- No speaker embedding is included to make the model capable of generating gestures for novel speakers.
- No gesture type information is used in the current version. This is intentional as gesture types are typically unknown for novel speakers and settings, making this approach more realistic for real-world applications.
- If you want to see better FGD scores, you can try adding gesture type information.

### Code Structure
- **Current Version**: Clean, unified codebase with MeanFlow support
- **Legacy Code**: Available in the `old` branch for historical reference
- **Accepted to ICCV 2025** - Thanks to all co-authors!


## Download Models

### Pretrained Models (Updated)
```
# Option 1: From Google Drive
# Download the pretrained models (Diffusion + Shortcut + MeanFlow + RVQ-VAEs)
gdown https://drive.google.com/drive/folders/1OfYWWJbaXal6q7LttQlYKWAy0KTwkPRw?usp=drive_link -O ./ckpt --folder

# Option 2: From Huggingface Hub
huggingface-cli download https://huggingface.co/pliu23/GestureLSM --local-dir ./ckpt

# Download the SMPL model
gdown https://drive.google.com/drive/folders/1MCks7CMNBtAzU2XihYezNmiGT_6pWex8?usp=drive_link -O ./datasets/hub --folder
```

### Available Checkpoints
- **Diffusion Model**: `ckpt/new_540_diffusion.bin`
- **Shortcut Model**: `ckpt/shortcut_reflow.bin` 
- **MeanFlow Model**: `ckpt/meanflow.pth`
- **RVQ-VAE Models**: `ckpt/net_300000_upper.pth`, `ckpt/net_300000_hands.pth`, `ckpt/net_300000_lower.pth`

## Download Dataset
> For evaluation and training, not necessary for running a web demo or inference.

### Download BEAT2 Dataset from Hugging Face
The original dataset download method is no longer available. Please use the Hugging Face dataset:

```bash
# Download BEAT2 dataset from Hugging Face
huggingface-cli download H-Liu1997/BEAT2 --local-dir ./datasets/BEAT2
```

**Dataset Information**:
- **Source**: [H-Liu1997/BEAT2 on Hugging Face](https://huggingface.co/datasets/H-Liu1997/BEAT2)
- **Size**: ~4.1K samples
- **Format**: CSV with train/test splits
- **License**: Apache 2.0

### Legacy Download (Deprecated)
> The original download method is no longer working
```bash
# This command is deprecated and no longer works
# bash preprocess/bash_raw_cospeech_download.sh
```

## Testing/Evaluation

> **Note**: Requires dataset download for evaluation. For inference only, see the Demo section below.

### Unified Testing Pipeline

The codebase now uses a unified `train.py` script for both training and testing. Use the `--mode test` flag for evaluation:

```bash
# Test Diffusion Model (20 steps)
python train.py --config configs_new/diffusion_rvqvae_128.yaml --ckpt ckpt/new_540_diffusion.bin --mode test

# Test Shortcut Model (2-step reflow)
python train.py --config configs_new/shortcut_rvqvae_128.yaml --ckpt ckpt/shortcut_reflow.bin --mode test

# Test MeanFlow Model (1-step flow-based)
python train.py --config configs_new/meanflow_rvqvae_128.yaml --ckpt ckpt/meanflow.pth --mode test
```

### Model Comparison

| Model | Steps | Description | Key Features | Use Case |
|-------|-------|-------------|--------------|----------|
| **Diffusion** | 20 | Denoising diffusion model | High quality, slower inference | High-quality generation |
| **Shortcut** | 2-4 | Latent shortcut with reflow | Fast inference, good quality | **Recommended for most users** |
| **MeanFlow** | 1 | Flow-based generation | Fastest inference, single step | Real-time applications |

### Performance Comparison

| Model | Steps | FGD Score ↓ | Beat Constancy ↑ | L1Div Score ↓ | Inference Speed |
|-------|-------|-------------|------------------|---------------|-----------------|
| **MeanFlow** | 1 | **0.4031** | **0.7489** | 12.4631 | **Fastest** |
| **Diffusion** | 20 | 0.4100 | 0.7384 | 12.5752 | Slowest |
| **Shortcut** | 20 | 0.4040 | 0.7144 | 13.4874 | Fast |
| **Shortcut-ReFlow** | 2 | 0.4104 | 0.7182 | **13.678** | Fast |

**Legend**: 
- **FGD Score** (↓): Lower is better - measures gesture quality
- **Beat Constancy** (↑): Higher is better - measures audio-gesture synchronization  
- **L1Div Score** (↑): Higher is better - measures diversity of generated gestures

**Recommendation**: **MeanFlow** is the clear winner, offering the best FGD and L1Div scores with the fastest inference speed.

### Legacy Testing (Deprecated)
> For reference only - use the unified pipeline above instead
```bash
# Old testing commands (deprecated)
python test.py -c configs/shortcut_rvqvae_128.yaml
python test.py -c configs/shortcut_reflow_test.yaml  
python test.py -c configs/diffuser_rvqvae_128.yaml
```

## Train RVQ-VAEs (1-speaker)
> Require download dataset 
```
bash train_rvq.sh
```

## Training

> **Note**: Requires dataset download for training.

### Unified Training Pipeline

The codebase now uses a unified `train.py` script for training all models. Use the new configuration files in `configs_new/`:

```bash
# Train Diffusion Model
python train.py --config configs_new/diffusion_rvqvae_128.yaml

# Train Shortcut Model  
python train.py --config configs_new/shortcut_rvqvae_128.yaml

# Train MeanFlow Model
python train.py --config configs_new/meanflow_rvqvae_128.yaml
```

### Training Configuration

- **Config Directory**: Use `configs_new/` for the latest configurations
- **Output Directory**: Models are saved to `./outputs/weights/`
- **Logging**: Supports Weights & Biases integration (configure in config files)
- **GPU Support**: Configure GPU usage in the config files

### Legacy Training (Deprecated)
> For reference only - use the unified pipeline above instead
```bash
# Old training commands (deprecated)
python train.py -c configs/shortcut_rvqvae_128.yaml
python train.py -c configs/diffuser_rvqvae_128.yaml
```


## Quick Start

### Demo/Inference (No Dataset Required)
```bash
# Run the web demo with Shortcut model
python demo.py -c configs/shortcut_rvqvae_128_hf.yaml
```

### Testing with Your Own Data
```bash
# Test with your own audio and text (requires pretrained models)
python train.py --config configs_new/shortcut_rvqvae_128.yaml --ckpt ckpt/shortcut_reflow.bin --mode test
```

## Demo

The demo provides a web interface for gesture generation. It uses the Shortcut model by default for fast inference.

```bash
python demo.py -c configs/shortcut_rvqvae_128_hf.yaml
```

**Features**:
- Web-based interface for easy interaction
- Real-time gesture generation
- Support for custom audio and text input
- Visualization of generated gestures


# 🙏 Acknowledgments
Thanks to [SynTalker](https://github.com/RobinWitch/SynTalker/tree/main), [EMAGE](https://github.com/PantoMatrix/PantoMatrix/tree/main/scripts/EMAGE_2024), [DiffuseStyleGesture](https://github.com/YoungSeng/DiffuseStyleGesture), our code is partially borrowing from them. Please check these useful repos.


# 📖 Citation

If you find our code or paper helps, please consider citing:

```bibtex
@inproceedings{liu2025gesturelsmlatentshortcutbased,
  title={{GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling}},
  author={Pinxin Liu and Luchuan Song and Junhua Huang and Chenliang Xu},
  booktitle={IEEE/CVF International Conference on Computer Vision},
  year={2025},
}
```