Instructions to use Supertone/supertonic-3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Supertonic
How to use Supertone/supertonic-3 with Supertonic:
from supertonic import TTS tts = TTS(auto_download=True) style = tts.get_voice_style(voice_name="M1") text = "The train delay was announced at 4:45 PM on Wed, Apr 3, 2024 due to track maintenance." wav, duration = tts.synthesize(text, voice_style=style) tts.save_audio(wav, "output.wav")
- Notebooks
- Google Colab
- Kaggle
Question about the Russian dataset?
Now the model often generates И(vowel i) instead of Й (Voiced palatal approximant) during synthesis. This is a very strong distinction in many cases, for example the plural: герой [hero] and герои [heroes]
Similarly with the letter Ё (jo), which is synthesized as E (jɛ). For example, Всё | vsʲˈo [everything] and Все | vsʲˈe [Everyone / Everybody] — these words should be pronounced differently so that a native Russian speaker doesn't feel unnatural when listening to synthesized speech. There are also many other words with the letter Ё.
Was there over-normalization during the data preparation stage? Or what else could be causing this problem?
Hi @rraaww ,
Thank you so much for trying the model and for such a careful, detailed report — the examples you gave (герой/герои, Всё/Все) capture the issue exactly. You are absolutely right that these contrasts are phonologically essential in Russian, and we are sorry that the current model does not preserve them as a native speaker would expect.
A bit of context on what is happening:
Supertonic 3 uses a shared character vocabulary across many languages and scripts. To support this, the current text pipeline applies Unicode NFKD normalization, which decomposes characters with diacritics into a base character plus a combining mark. So й is represented as и + a combining breve, and ё as е + a combining diaeresis. The model then has to learn that the combining mark changes the pronunciation of the preceding letter — turning и into [j] and е into [jo].
So the information is not over-normalized away in preprocessing — the marks reach the model — but the signal is structurally weaker than the base letters. In a multilingual model where the same combining diaeresis is shared across many languages and is dominated by other languages' usage, and where the base и and е are far more common on their own than in combination with the mark, the model tends to under-realize the diacritic and fall back to the base vowel. That is the "И instead of Й" and "Е instead of Ё" pattern you are hearing.
For ё specifically there is also a data-side factor: written Russian very often omits the diaeresis on ё (the well-known "ёфикация" convention), so even in ASR-transcribed training text the letter appears much less frequently than its true phonological occurrence would suggest. This compounds the structural weakness above.
We are reviewing these language-specific pronunciation issues as part of our next improvements. A more robust fix would likely involve changes to tokenization or text representation, for example preserving й and ё (and other language-specific letters) as single tokens rather than relying on decomposed combining marks, and possibly rebalancing or restoring ё in the training text.
At the moment, we do not have a near-term plan to release a Russian-specific checkpoint or a quick Russian-only retraining update. But we agree this is an important issue, and if we make improvements to Russian pronunciation in a future release, we will share an update in the model card and discussions.
Thanks again for the careful examples and for the honest feedback — reports like this are exactly what helps us decide what to improve next.
Hi @rraaww ,
A quick update and correction from our side.
After investigating this again, we found that at least part of this issue was caused by the Python package preprocessing path, not by the model weights themselves. The package was applying NFKD normalization and then removing combining diacritic marks, which could turn cases like й into и and ё into е before the text reached the model.
We have fixed this in the Python package so that these combining marks are preserved properly.
Once the updated package is released, please try again with:
pip install -U supertonic
Sorry for the earlier inaccurate explanation, and thank you again for the detailed examples. They helped us catch a real SDK-side bug.