Extending Chatterbox TTS Model for Japanese-English Multilingual Support

 3 min read

YouTube video ID: g83rfFfPkHU

Source: YouTube video by Jarods JourneyWatch original video

PDF

Introduction

In this update the creator walks through the process of adapting the state‑of‑the‑art text‑to‑speech model Chatterbox to handle both English and Japanese. The original model lacked official fine‑tuning code, so a community‑provided script was used as a base.

Key Steps

  • Tokenizer Expansion: The default Chatterbox tokenizer only covers English. A new BPE tokenizer was trained on a large Japanese transcript file (converted to hiragana) with a 500‑token vocabulary. This tokenizer was then merged with the original 704‑token English tokenizer, resulting in a combined vocabulary of about 2,000 tokens.
  • Embedding Table Extension: To accommodate the larger vocabulary, the text‑embedding matrix of the T3 text model was extended from 704 to 2,000 entries using a custom script.
  • Configuration Update: The model configuration was edited to reflect the new vocab size, enabling the model to learn representations for the added Japanese tokens.
  • Freezing Existing Tokens: During fine‑tuning, the first 704 English token embeddings were frozen to preserve the already‑trained English speech quality while allowing the new Japanese tokens to be learned.
  • Data Preparation: Japanese audio transcripts from the Amelia dataset were concatenated into a single text file, normalized to hiragana (removing kanji) before tokenizer training.
  • Training & Inference: A new training run was started (the previous run was corrupted by accidental katakana handling). Early inference shows clear English output and emerging Japanese speech, though the Japanese side still needs more training.

Challenges Encountered

  • Katakana Issue: Using the pikkassi conversion library introduced duplicate tokens when handling katakana, breaking tokenization. The solution was to drop katakana entirely and stick to hiragana.
  • Token Overlap for Future Languages: Adding languages that share characters with English (e.g., German) would require un‑freezing overlapping tokens and possibly mixing English data to avoid catastrophic forgetting.
  • Checkpoint Management: The earlier run was lost, so a fresh checkpoint had to be created.

Tools & Infrastructure

  • Beam Cloud: The author leveraged Beam Cloud to run training jobs directly from Python without container orchestration, simplifying remote execution.
  • Tortoise Knowledge: Prior experience with the Tortoise TTS project helped in adapting the BPE tokenizer and embedding extensions.

Current Status & Next Steps

  • A new Japanese‑English model is actively training with a 2,000‑token vocabulary.
  • The creator plans to release a tutorial on fine‑tuning Chatterbox by next week.
  • Model checkpoints (including a toy Japanese‑only model) will be shared with channel members, with broader releases to follow.

Practical Takeaways

  1. Extending a TTS model to new languages primarily involves expanding the tokenizer and embedding matrix.
  2. Freezing existing language embeddings preserves performance while new tokens learn.
  3. Normalizing non‑Latin scripts (hiragana for Japanese) simplifies tokenization.
  4. Cloud‑based training platforms like Beam Cloud can accelerate experimentation.

By enlarging the tokenizer, extending the embedding table, and carefully freezing existing English tokens, the Chatterbox TTS model can be successfully fine‑tuned to speak both English and Japanese—a workflow that can be adapted for other languages with similar script considerations.

Frequently Asked Questions

Who is Jarods Journey on YouTube?

Jarods Journey is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.

Does this page include the full transcript of the video?

Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.

Helpful resources related to this video

If you want to practice or explore the concepts discussed in the video, these commonly used tools may help.

Links may be affiliate links. We only include resources that are genuinely relevant to the topic.

PDF