File size: 14,532 Bytes
ad0d9fe
 
 
 
 
 
 
efd385f
ad0d9fe
 
a39c621
ad0d9fe
a39c621
 
 
ad0d9fe
a39c621
 
ad0d9fe
4873960
ad0d9fe
 
 
 
 
 
 
 
 
a39c621
ad0d9fe
a39c621
ad0d9fe
a39c621
ad0d9fe
 
 
 
 
 
 
 
 
 
 
 
a39c621
ad0d9fe
 
 
a39c621
 
ad0d9fe
a39c621
ad0d9fe
a39c621
ad0d9fe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
---
title: GestureLSM Demo
emoji: "\U0001F57A"
colorFrom: purple
colorTo: pink
sdk: gradio
sdk_version: "4.42.0"
app_file: hf_space/app.py
pinned: false
---

[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/gesturelsm-latent-shortcut-based-co-speech/gesture-generation-on-beat2)](https://paperswithcode.com/sota/gesture-generation-on-beat2?p=gesturelsm-latent-shortcut-based-co-speech) <a href="https://arxiv.org/abs/2501.18898"><img src="https://img.shields.io/badge/arxiv-gray?logo=arxiv&amp"></a>



# GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling [ICCV 2025]


# πŸ“ Release Plans

- [x] Inference Code
- [x] Pretrained Models
- [x] A web demo
- [x] Training Code
- [x] Clean Code to make it look nicer
- [x] Support for [MeanFlow](https://arxiv.org/abs/2505.13447)
- [x] Unified training and testing pipeline
- [ ] MeanFlow Training Code (Coming Soon)
- [ ] Merge with [Intentional-Gesture](https://github.com/andypinxinliu/Intentional-Gesture)

## πŸ”„ Code Updates

**Latest Update**: The codebase has been cleaned and restructured. For legacy or historical information, please check out the `old` branch.

**New Features**:
- Added MeanFlow model support
- Unified training and testing pipeline using `train.py`
- New configuration files in `configs_new/` directory
- Updated checkpoint files with improved performance

# βš’οΈ Installation

## Build Environtment

```
conda create -n gesturelsm python=3.12
conda activate gesturelsm
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
bash demo/install_mfa.sh
```

## πŸ“ Code Structure

Understanding the codebase structure will help you navigate and customize the project effectively.

```
GestureLSM/
β”œβ”€β”€ πŸ“ configs_new/              # New unified configuration files
β”‚   β”œβ”€β”€ diffusion_rvqvae_128.yaml    # Diffusion model config
β”‚   β”œβ”€β”€ shortcut_rvqvae_128.yaml     # Shortcut model config
β”‚   └── meanflow_rvqvae_128.yaml     # MeanFlow model config
β”œβ”€β”€ πŸ“ configs/                  # Legacy configuration files (deprecated)
β”œβ”€β”€ πŸ“ ckpt/                     # Pretrained model checkpoints
β”‚   β”œβ”€β”€ new_540_diffusion.bin        # Diffusion model weights
β”‚   β”œβ”€β”€ shortcut_reflow.bin          # Shortcut model weights
β”‚   β”œβ”€β”€ meanflow.pth                 # MeanFlow model weights
β”‚   └── net_300000_*.pth            # RVQ-VAE model weights
β”œβ”€β”€ πŸ“ models/                   # Model implementations
β”‚   β”œβ”€β”€ Diffusion.py                 # Diffusion model
β”‚   β”œβ”€β”€ LSM.py                       # Latent Shortcut Model
β”‚   β”œβ”€β”€ MeanFlow.py                  # MeanFlow model
β”‚   β”œβ”€β”€ πŸ“ layers/                   # Neural network layers
β”‚   β”œβ”€β”€ πŸ“ vq/                       # Vector quantization modules
β”‚   └── πŸ“ utils/                    # Model utilities
β”œβ”€β”€ πŸ“ dataloaders/              # Data loading and preprocessing
β”‚   β”œβ”€β”€ beat_sep_lower.py            # Main dataset loader
β”‚   β”œβ”€β”€ πŸ“ pymo/                     # Motion processing library
β”‚   └── πŸ“ utils/                    # Data utilities
β”œβ”€β”€ πŸ“ trainer/                  # Training framework
β”‚   β”œβ”€β”€ base_trainer.py              # Base trainer class
β”‚   └── generative_trainer.py        # Generative model trainer
β”œβ”€β”€ πŸ“ utils/                    # General utilities
β”‚   β”œβ”€β”€ config.py                    # Configuration management
β”‚   β”œβ”€β”€ metric.py                    # Evaluation metrics
β”‚   └── rotation_conversions.py      # Rotation utilities
β”œβ”€β”€ πŸ“ demo/                     # Demo and visualization
β”‚   β”œβ”€β”€ examples/                    # Sample audio files
β”‚   └── install_mfa.sh               # MFA installation script
β”œβ”€β”€ πŸ“ datasets/                 # Dataset storage
β”‚   β”œβ”€β”€ BEAT_SMPL/                   # Original BEAT dataset
β”‚   β”œβ”€β”€ beat_cache/                  # Preprocessed cache
β”‚   └── hub/                         # SMPL models and pretrained weights
β”œβ”€β”€ πŸ“ outputs/                  # Training outputs and logs
β”‚   └── weights/                     # Saved model weights
β”œβ”€β”€ train.py                     # Unified training/testing script
β”œβ”€β”€ demo.py                      # Web demo script
β”œβ”€β”€ rvq_beatx_train.py          # RVQ-VAE training script
└── requirements.txt             # Python dependencies
```

### πŸ”§ Key Components

#### **Model Architecture**
- **`models/Diffusion.py`**: Denoising diffusion model for high-quality generation
- **`models/LSM.py`**: Latent Shortcut Model for fast inference
- **`models/MeanFlow.py`**: Flow-based model for single-step generation
- **`models/vq/`**: Vector quantization modules for latent space compression

#### **Configuration System**
- **`configs_new/`**: New unified configuration files for all models
- **`configs/`**: Legacy configuration files (deprecated)
- Each config file contains model parameters, training settings, and data paths

#### **Data Pipeline**
- **`dataloaders/beat_sep_lower.py`**: Main dataset loader for BEAT dataset
- **`dataloaders/pymo/`**: Motion processing library for gesture data
- **`datasets/beat_cache/`**: Preprocessed data cache for faster loading

#### **Training Framework**
- **`train.py`**: Unified script for training and testing all models
- **`trainer/`**: Training framework with base and generative trainers
- **`optimizers/`**: Optimizer and scheduler implementations

#### **Utilities**
- **`utils/config.py`**: Configuration management and validation
- **`utils/metric.py`**: Evaluation metrics (FGD, etc.)
- **`utils/rotation_conversions.py`**: 3D rotation utilities

### πŸš€ Getting Started with the Code

1. **For Training**: Use `train.py` with configs from `configs_new/`
2. **For Inference**: Use `demo.py` for web interface or `train.py --mode test`
3. **For Customization**: Modify config files in `configs_new/` directory
4. **For New Models**: Add model implementation in `models/` directory

## Results

![Beat Results](beat-new.png)

This table shows the results of 1-speaker and all-speaker comparisons. RAG-Gesture refers to [**Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis**](https://arxiv.org/abs/2412.06786), accepted by CVPR 2025. The stats for 1-speaker is based on speaker-id of 2, 'scott' in order to be consistent with the previous SOTA methods. I directly copied the stats from the RAG-Gesture repo, which is different from the stats in the current paper. 

## Important Notes

### Model Performance
- The statistics reported in the paper is based on 1-speaker with speaker-id of 2, 'scott' in order to be consistent with the previous SOTA methods.
- The pretrained models are trained on 1-speaker. (RVQ-VAEs, Diffusion, Shortcut, MeanFlow)
- If you want to use all-speaker, please modify the config files to include all speaker ids.
- April 16, 2025: updated the pretrained model to include all speakers. (RVQ-VAEs, Shortcut)
- No hyperparameter tuning was done for all-speaker - same settings as 1-speaker are used.

### Model Design Choices
- No speaker embedding is included to make the model capable of generating gestures for novel speakers.
- No gesture type information is used in the current version. This is intentional as gesture types are typically unknown for novel speakers and settings, making this approach more realistic for real-world applications.
- If you want to see better FGD scores, you can try adding gesture type information.

### Code Structure
- **Current Version**: Clean, unified codebase with MeanFlow support
- **Legacy Code**: Available in the `old` branch for historical reference
- **Accepted to ICCV 2025** - Thanks to all co-authors!





## Download Models

### Pretrained Models (Updated)
```
# Option 1: From Google Drive
# Download the pretrained models (Diffusion + Shortcut + MeanFlow + RVQ-VAEs)
gdown https://drive.google.com/drive/folders/1OfYWWJbaXal6q7LttQlYKWAy0KTwkPRw?usp=drive_link -O ./ckpt --folder

# Option 2: From Huggingface Hub
huggingface-cli download https://huggingface.co/pliu23/GestureLSM --local-dir ./ckpt

# Download the SMPL model
gdown https://drive.google.com/drive/folders/1MCks7CMNBtAzU2XihYezNmiGT_6pWex8?usp=drive_link -O ./datasets/hub --folder
```

### Available Checkpoints
- **Diffusion Model**: `ckpt/new_540_diffusion.bin`
- **Shortcut Model**: `ckpt/shortcut_reflow.bin` 
- **MeanFlow Model**: `ckpt/meanflow.pth`
- **RVQ-VAE Models**: `ckpt/net_300000_upper.pth`, `ckpt/net_300000_hands.pth`, `ckpt/net_300000_lower.pth`

## Download Dataset
> For evaluation and training, not necessary for running a web demo or inference.

### Download BEAT2 Dataset from Hugging Face
The original dataset download method is no longer available. Please use the Hugging Face dataset:

```bash
# Download BEAT2 dataset from Hugging Face
huggingface-cli download H-Liu1997/BEAT2 --local-dir ./datasets/BEAT2
```

**Dataset Information**:
- **Source**: [H-Liu1997/BEAT2 on Hugging Face](https://huggingface.co/datasets/H-Liu1997/BEAT2)
- **Size**: ~4.1K samples
- **Format**: CSV with train/test splits
- **License**: Apache 2.0

### Legacy Download (Deprecated)
> The original download method is no longer working
```bash
# This command is deprecated and no longer works
# bash preprocess/bash_raw_cospeech_download.sh
```

## Testing/Evaluation

> **Note**: Requires dataset download for evaluation. For inference only, see the Demo section below.

### Unified Testing Pipeline

The codebase now uses a unified `train.py` script for both training and testing. Use the `--mode test` flag for evaluation:

```bash
# Test Diffusion Model (20 steps)
python train.py --config configs_new/diffusion_rvqvae_128.yaml --ckpt ckpt/new_540_diffusion.bin --mode test

# Test Shortcut Model (2-step reflow)
python train.py --config configs_new/shortcut_rvqvae_128.yaml --ckpt ckpt/shortcut_reflow.bin --mode test

# Test MeanFlow Model (1-step flow-based)
python train.py --config configs_new/meanflow_rvqvae_128.yaml --ckpt ckpt/meanflow.pth --mode test
```

### Model Comparison

| Model | Steps | Description | Key Features | Use Case |
|-------|-------|-------------|--------------|----------|
| **Diffusion** | 20 | Denoising diffusion model | High quality, slower inference | High-quality generation |
| **Shortcut** | 2-4 | Latent shortcut with reflow | Fast inference, good quality | **Recommended for most users** |
| **MeanFlow** | 1 | Flow-based generation | Fastest inference, single step | Real-time applications |

### Performance Comparison

| Model | Steps | FGD Score ↓ | Beat Constancy ↑ | L1Div Score ↓ | Inference Speed |
|-------|-------|-------------|------------------|---------------|-----------------|
| **MeanFlow** | 1 | **0.4031** | **0.7489** | 12.4631 | **Fastest** |
| **Diffusion** | 20 | 0.4100 | 0.7384 | 12.5752 | Slowest |
| **Shortcut** | 20 | 0.4040 | 0.7144 | 13.4874 | Fast |
| **Shortcut-ReFlow** | 2 | 0.4104 | 0.7182 | **13.678** | Fast |

**Legend**: 
- **FGD Score** (↓): Lower is better - measures gesture quality
- **Beat Constancy** (↑): Higher is better - measures audio-gesture synchronization  
- **L1Div Score** (↑): Higher is better - measures diversity of generated gestures

**Recommendation**: **MeanFlow** is the clear winner, offering the best FGD and L1Div scores with the fastest inference speed.

### Legacy Testing (Deprecated)
> For reference only - use the unified pipeline above instead
```bash
# Old testing commands (deprecated)
python test.py -c configs/shortcut_rvqvae_128.yaml
python test.py -c configs/shortcut_reflow_test.yaml  
python test.py -c configs/diffuser_rvqvae_128.yaml
```

## Train RVQ-VAEs (1-speaker)
> Require download dataset 
```
bash train_rvq.sh
```

## Training

> **Note**: Requires dataset download for training.

### Unified Training Pipeline

The codebase now uses a unified `train.py` script for training all models. Use the new configuration files in `configs_new/`:

```bash
# Train Diffusion Model
python train.py --config configs_new/diffusion_rvqvae_128.yaml

# Train Shortcut Model  
python train.py --config configs_new/shortcut_rvqvae_128.yaml

# Train MeanFlow Model
python train.py --config configs_new/meanflow_rvqvae_128.yaml
```

### Training Configuration

- **Config Directory**: Use `configs_new/` for the latest configurations
- **Output Directory**: Models are saved to `./outputs/weights/`
- **Logging**: Supports Weights & Biases integration (configure in config files)
- **GPU Support**: Configure GPU usage in the config files

### Legacy Training (Deprecated)
> For reference only - use the unified pipeline above instead
```bash
# Old training commands (deprecated)
python train.py -c configs/shortcut_rvqvae_128.yaml
python train.py -c configs/diffuser_rvqvae_128.yaml
```


## Quick Start

### Demo/Inference (No Dataset Required)
```bash
# Run the web demo with Shortcut model
python demo.py -c configs/shortcut_rvqvae_128_hf.yaml
```

### Testing with Your Own Data
```bash
# Test with your own audio and text (requires pretrained models)
python train.py --config configs_new/shortcut_rvqvae_128.yaml --ckpt ckpt/shortcut_reflow.bin --mode test
```

## Demo

The demo provides a web interface for gesture generation. It uses the Shortcut model by default for fast inference.

```bash
python demo.py -c configs/shortcut_rvqvae_128_hf.yaml
```

**Features**:
- Web-based interface for easy interaction
- Real-time gesture generation
- Support for custom audio and text input
- Visualization of generated gestures



# πŸ™ Acknowledgments
Thanks to [SynTalker](https://github.com/RobinWitch/SynTalker/tree/main), [EMAGE](https://github.com/PantoMatrix/PantoMatrix/tree/main/scripts/EMAGE_2024), [DiffuseStyleGesture](https://github.com/YoungSeng/DiffuseStyleGesture), our code is partially borrowing from them. Please check these useful repos.


# πŸ“– Citation

If you find our code or paper helps, please consider citing:

```bibtex
@inproceedings{liu2025gesturelsmlatentshortcutbased,
  title={{GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling}},
  author={Pinxin Liu and Luchuan Song and Junhua Huang and Chenliang Xu},
  booktitle={IEEE/CVF International Conference on Computer Vision},
  year={2025},
}
```