Update README.md
Browse files
README.md
CHANGED
|
@@ -6,14 +6,14 @@ tags:
|
|
| 6 |
pipeline_tag: audio-to-audio
|
| 7 |
---
|
| 8 |
|
| 9 |
-
#
|
| 10 |
|
| 11 |
|
| 12 |
The X-Codec2 model was proposed in [Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis](https://huggingface.co/papers/2502.04128).
|
| 13 |
|
| 14 |
X-Codec2 is a neural audio codec designed to improve speech synthesis and general audio generation for large language model (LLM) pipelines. It extends the original X-Codec by refining how semantic and acoustic information is integrated and tokenized, enabling efficient and high-fidelity audio representation.
|
| 15 |
|
| 16 |
-
Its architecture is based on
|
| 17 |
|
| 18 |
- **Unified Semantic-Acoustic Tokenization**: X-Codec2 fuses outputs from a semantic encoder (e.g., Wav2Vec2-BERT) and an acoustic encoder into a single embedding, capturing both high-level meaning (e.g., text content, emotion) and low-level audio details (e.g., timbre).
|
| 19 |
- **Single-Stage Vector Quantization (VQ)**: Unlike the multi-layer residual VQ in most approaches (e.g., [X-Codec](./xcodec), [DAC](./dac), [EnCodec](./encodec)), X-Codec2 uses a single-layer Feature-Space Quantization (FSQ) for stability and compatibility with causal, autoregressive LLMs.
|
|
@@ -54,7 +54,7 @@ Here is a quick example of how to encode and decode an audio using this model:
|
|
| 54 |
```
|
| 55 |
|
| 56 |
This model was contributed by [Steven Zheng](https://huggingface.co/Steveeeeeeen) and [Eric Bezzam](https://huggingface.co/bezzam).
|
| 57 |
-
The original code can be found [here](https://github.com/zhenye234/X-Codec-2.0).
|
| 58 |
|
| 59 |
|
| 60 |
|
|
|
|
| 6 |
pipeline_tag: audio-to-audio
|
| 7 |
---
|
| 8 |
|
| 9 |
+
# Xcodec2 (Transformers-compatible version)
|
| 10 |
|
| 11 |
|
| 12 |
The X-Codec2 model was proposed in [Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis](https://huggingface.co/papers/2502.04128).
|
| 13 |
|
| 14 |
X-Codec2 is a neural audio codec designed to improve speech synthesis and general audio generation for large language model (LLM) pipelines. It extends the original X-Codec by refining how semantic and acoustic information is integrated and tokenized, enabling efficient and high-fidelity audio representation.
|
| 15 |
|
| 16 |
+
Its architecture is based on X-Codec with several major differences:
|
| 17 |
|
| 18 |
- **Unified Semantic-Acoustic Tokenization**: X-Codec2 fuses outputs from a semantic encoder (e.g., Wav2Vec2-BERT) and an acoustic encoder into a single embedding, capturing both high-level meaning (e.g., text content, emotion) and low-level audio details (e.g., timbre).
|
| 19 |
- **Single-Stage Vector Quantization (VQ)**: Unlike the multi-layer residual VQ in most approaches (e.g., [X-Codec](./xcodec), [DAC](./dac), [EnCodec](./encodec)), X-Codec2 uses a single-layer Feature-Space Quantization (FSQ) for stability and compatibility with causal, autoregressive LLMs.
|
|
|
|
| 54 |
```
|
| 55 |
|
| 56 |
This model was contributed by [Steven Zheng](https://huggingface.co/Steveeeeeeen) and [Eric Bezzam](https://huggingface.co/bezzam).
|
| 57 |
+
The original code can be found [here](https://github.com/zhenye234/X-Codec-2.0), and original checkpoints [here](https://huggingface.co/HKUSTAudio/xcodec2).
|
| 58 |
|
| 59 |
|
| 60 |
|