Pre-training dataset format?

by asif00 - opened Mar 21, 2025

Mar 21, 2025

Did anyone try to pre-train it for a new language?
I'm a bit confused. What should the structure of the pre-training datasets be?
text_QA_dataset: [I'm assuming this is for training the LLM]
TTS_dataset: [This is for training the TTS]
I'm just unsure what their format should be. An example (sample) or dataset link for both types would be awesome! Thanks in advance!

amuvarma

Canopy Labs org Mar 21, 2025

See the github for more info - there is also a similar issue there, where I go over the format. Happy to go into more detail if unclear (posting the issue in the github will probably be looked at sooner) https://github.com/canopyai/Orpheus-TTS

amuvarma changed discussion status to closed Mar 21, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment