License & Ethics Agreement

This model is for Non-Commercial Use Only (CC-BY-NC 4.0) and follows the Gemma Terms of Use. Malicious use, including impersonation, is strictly prohibited.

Log in or Sign Up to review the conditions and access this model content.

T5Gemma-TTS-2b-2b

GitHub WandB Demo Space

日本語版 README はこちら

T5Gemma-TTS-2b-2b is a multilingual Text-to-Speech (TTS) model developed as a personal project. It utilizes an Encoder-Decoder LLM architecture, supporting English, Chinese, and Japanese.

🌟 Overview

This model is an Encoder-Decoder LLM based TTS system initialized from the weights of google/t5gemma-2b-2b-ul2. While it leverages pre-trained LLM weights, the audio component has been trained from scratch specifically for TTS tasks.

You can try the interactive demo on Hugging Face Spaces: T5Gemma-TTS Demo

Key Features

  • Multilingual Support: Supports English, Chinese, and Japanese.
  • Voice Cloning: Capable of zero-shot voice cloning from reference audio.
  • Duration Control: Allows users to control the speed and length of the generated audio explicitly.
  • Open Source Code: Training code and inference scripts are available on GitHub.

Note: This is a hobby project. There are no formal objective evaluation metrics (WER/CER, SIM-O, etc.) available at this time.

🏗️ Technical Details

Architecture

The architecture is inspired by VoiceStar (arXiv:2505.19462). It adopts mechanisms such as PM-RoPE for length control.

Training Data

The model was trained on approximately 170,000 hours of publicly available speech datasets (mainly Emilia and libriheavy).

Language Approx. Hours
English ~100k hours
Chinese ~50k hours
Japanese ~20k hours

Training Hardware

Training was conducted on the AMD Developer Cloud using 8x MI300X GPUs for approximately 2 weeks.

  • You can check the training logs here: WandB

🎧 Audio Samples

Below are some samples generated by T5Gemma-TTS-2b-2b.

1. Multilingual TTS

Basic text-to-speech generation in supported languages.

Language Text Prompt Audio
English "The old library was silent, save for the gentle ticking of a clock somewhere in the shadows. As I ran my fingers along the dusty spines of the books, I felt a strange sense of nostalgia, as if I had lived a thousand lives within these walls."
Chinese "那是一个宁静的夜晚,月光洒在湖面上,波光粼粼。微风轻拂,带来了远处花朵的清香。我独自坐在岸边,心中涌起一股莫名的感动,仿佛整个世界都在这一刻静止了。"
Japanese "その森には、古い言い伝えがありました。月が最も高く昇る夜、静かに耳を澄ませば、風の歌声が聞こえるというのです。私は半信半疑でしたが、その夜、確かに誰かが私を呼ぶ声を聞いたのです。"

2. Duration Control

Examples of generating the same text with different duration constraints.

English Sample

Text: "This new model allows users to strictly control the duration of the generated speech.

Target Duration Generated Audio
3.0s (Fast)
5.0s (Normal)
7.0s (Slow)

Japanese Sample

Text: "このモデルでは、生成音声の長さを自由に調整できます。"

Target Duration Generated Audio
3.0s (Fast)
5.0s (Normal)
7.0s (Slow)

3. Voice Cloning (Zero-shot)

Examples of cloning a voice from a reference audio clip.

Note: The reference audio samples below were generated using NandemoGHS/Anime-Llasa-3B and gemini-2.5-pro-preview-tts.

Case Reference Audio Generated Audio
Example 1
Example 2
Example 3

🚀 Usage

For inference code, installation instructions, and training scripts, please refer to the GitHub repository:

👉 GitHub

⚠️ Limitations

  • Inference Speed: The model is not optimized for real-time TTS applications. Autoregressive generation of audio tokens takes significant time, making it unsuitable for low-latency use cases.
  • Duration Control: While the model supports explicit duration specification, control is not perfect. Generated audio may differ from the specified duration, and even when the duration matches, the speech pacing or naturalness may not always be optimal.
  • Audio Quality: Quality depends on training data characteristics. Performance may vary for voices, accents, or speaking styles underrepresented in the training data.

📜 License

This model is released under a Dual License policy. Users must strictly comply with BOTH of the following sets of terms:

  1. Gemma Terms of Use: Since this model is derived from google/t5gemma-2b-2b-ul2, you must adhere to the Gemma Terms of Use.
  2. CC-BY-NC 4.0: Due to the constraints of the training datasets (such as Emilia), this model is restricted to Non-Commercial Use Only.

⚠️ Important Note on Codec: The audio codec used, XCodec2, is also released under a CC-BY-NC license. Please ensure you also follow their license terms when using the generated audio.

Ethical Restrictions: Do not use this model to impersonate specific individuals (e.g., voice cloning of voice actors, celebrities, or public figures) without their explicit consent.

🙏 Acknowledgments

I would like to thank the following for their open-source contributions, which made this project possible:

🖊️ Citation

If you cite this model, please cite it as follows:

@misc{t5gemma-tts,
  author = {Aratako},
  title = {T5Gemma-TTS-2b-2b: An Encoder-Decoder LLM-based TTS Model},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  howpublished = {\url{https://huggingface.co/Aratako/T5Gemma-TTS-2b-2b}}
}
Downloads last month
1,687
Safetensors
Model size
5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Aratako/T5Gemma-TTS-2b-2b

Unable to build the model tree, the base model loops to the model itself. Learn more.

Datasets used to train Aratako/T5Gemma-TTS-2b-2b

Space using Aratako/T5Gemma-TTS-2b-2b 1

Collection including Aratako/T5Gemma-TTS-2b-2b