T5Gemma-TTS-2b-2b
T5Gemma-TTS-2b-2b is a multilingual Text-to-Speech (TTS) model developed as a personal project. It utilizes an Encoder-Decoder LLM architecture, supporting English, Chinese, and Japanese.
🌟 Overview
This model is an Encoder-Decoder LLM based TTS system initialized from the weights of google/t5gemma-2b-2b-ul2. While it leverages pre-trained LLM weights, the audio component has been trained from scratch specifically for TTS tasks.
You can try the interactive demo on Hugging Face Spaces: T5Gemma-TTS Demo
Key Features
- Multilingual Support: Supports English, Chinese, and Japanese.
- Voice Cloning: Capable of zero-shot voice cloning from reference audio.
- Duration Control: Allows users to control the speed and length of the generated audio explicitly.
- Open Source Code: Training code and inference scripts are available on GitHub.
Note: This is a hobby project. There are no formal objective evaluation metrics (WER/CER, SIM-O, etc.) available at this time.
🏗️ Technical Details
Architecture
The architecture is inspired by VoiceStar (arXiv:2505.19462). It adopts mechanisms such as PM-RoPE for length control.
- Base Model: google/t5gemma-2b-2b-ul2 (Weights used for initialization).
- Audio Codec: XCodec2 and its derivatives.
Training Data
The model was trained on approximately 170,000 hours of publicly available speech datasets (mainly Emilia and libriheavy).
| Language | Approx. Hours |
|---|---|
| English | ~100k hours |
| Chinese | ~50k hours |
| Japanese | ~20k hours |
Training Hardware
Training was conducted on the AMD Developer Cloud using 8x MI300X GPUs for approximately 2 weeks.
- You can check the training logs here: WandB
🎧 Audio Samples
Below are some samples generated by T5Gemma-TTS-2b-2b.
1. Multilingual TTS
Basic text-to-speech generation in supported languages.
| Language | Text Prompt | Audio |
|---|---|---|
| English | "The old library was silent, save for the gentle ticking of a clock somewhere in the shadows. As I ran my fingers along the dusty spines of the books, I felt a strange sense of nostalgia, as if I had lived a thousand lives within these walls." | |
| Chinese | "那是一个宁静的夜晚,月光洒在湖面上,波光粼粼。微风轻拂,带来了远处花朵的清香。我独自坐在岸边,心中涌起一股莫名的感动,仿佛整个世界都在这一刻静止了。" | |
| Japanese | "その森には、古い言い伝えがありました。月が最も高く昇る夜、静かに耳を澄ませば、風の歌声が聞こえるというのです。私は半信半疑でしたが、その夜、確かに誰かが私を呼ぶ声を聞いたのです。" |
2. Duration Control
Examples of generating the same text with different duration constraints.
English Sample
Text: "This new model allows users to strictly control the duration of the generated speech.
| Target Duration | Generated Audio |
|---|---|
| 3.0s (Fast) | |
| 5.0s (Normal) | |
| 7.0s (Slow) |
Japanese Sample
Text: "このモデルでは、生成音声の長さを自由に調整できます。"
| Target Duration | Generated Audio |
|---|---|
| 3.0s (Fast) | |
| 5.0s (Normal) | |
| 7.0s (Slow) |
3. Voice Cloning (Zero-shot)
Examples of cloning a voice from a reference audio clip.
Note: The reference audio samples below were generated using NandemoGHS/Anime-Llasa-3B and gemini-2.5-pro-preview-tts.
| Case | Reference Audio | Generated Audio |
|---|---|---|
| Example 1 | ||
| Example 2 | ||
| Example 3 |
🚀 Usage
For inference code, installation instructions, and training scripts, please refer to the GitHub repository:
👉 GitHub
⚠️ Limitations
- Inference Speed: The model is not optimized for real-time TTS applications. Autoregressive generation of audio tokens takes significant time, making it unsuitable for low-latency use cases.
- Duration Control: While the model supports explicit duration specification, control is not perfect. Generated audio may differ from the specified duration, and even when the duration matches, the speech pacing or naturalness may not always be optimal.
- Audio Quality: Quality depends on training data characteristics. Performance may vary for voices, accents, or speaking styles underrepresented in the training data.
📜 License
This model is released under a Dual License policy. Users must strictly comply with BOTH of the following sets of terms:
- Gemma Terms of Use: Since this model is derived from
google/t5gemma-2b-2b-ul2, you must adhere to the Gemma Terms of Use. - CC-BY-NC 4.0: Due to the constraints of the training datasets (such as Emilia), this model is restricted to Non-Commercial Use Only.
⚠️ Important Note on Codec: The audio codec used, XCodec2, is also released under a CC-BY-NC license. Please ensure you also follow their license terms when using the generated audio.
Ethical Restrictions: Do not use this model to impersonate specific individuals (e.g., voice cloning of voice actors, celebrities, or public figures) without their explicit consent.
🙏 Acknowledgments
I would like to thank the following for their open-source contributions, which made this project possible:
- VoiceStar - Architecture inspiration
- T5Gemma - Base model
- XCodec2 and XCodec2-Variant - Audio codec
🖊️ Citation
If you cite this model, please cite it as follows:
@misc{t5gemma-tts,
author = {Aratako},
title = {T5Gemma-TTS-2b-2b: An Encoder-Decoder LLM-based TTS Model},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face repository},
howpublished = {\url{https://huggingface.co/Aratako/T5Gemma-TTS-2b-2b}}
}
- Downloads last month
- 1,687
Model tree for Aratako/T5Gemma-TTS-2b-2b
Unable to build the model tree, the base model loops to the model itself. Learn more.
