Skip to main content
Japanese text often includes kanji, numbers, symbols, and units whose pronunciation depends heavily on context.
When generating speech from such text, relying on the original written form alone can result in incorrect or unnatural pronunciation.
To reduce these issues, Supertone Text-to-Speech supports a normalized text feature for Japanese.
By providing a pronunciation-oriented version of the input text alongside the original text, the TTS engine can produce clearer and more accurate speech.
This guide explains why normalized text is useful, when to use it, and how to generate it effectively.

Why Normalized Text Is Needed

Japanese writing and spoken Japanese are not always aligned one-to-one. Kanji may have multiple readings, numbers are spoken differently depending on context, and symbols or units are often expanded or transformed when read aloud. These characteristics make it difficult for TTS systems to consistently infer correct pronunciation from raw text alone. Normalized text fills this gap by explicitly describing how the sentence should be spoken, while keeping the original text intact for semantic understanding.

How It Works

When generating speech, you provide the original Japanese text written naturally using kanji and kana, along with a normalized version of the same text that represents its pronunciation. The TTS engine uses both together:
the original text preserves meaning and context, while the normalized text guides pronunciation.
If normalized text is not provided, the system relies only on the original text.

When You Should Use Normalized Text

Normalized text is especially useful when your input contains numbers, measurement units, symbols, abbreviations, kanji with ambiguous readings, or mixed Japanese–English expressions. It is strongly recommended for narration, announcements, audiobooks, and character voices where pronunciation accuracy matters.
For short or casual conversational sentences, normalized text may not be necessary.

Generating Normalized Text with LLMs

In most workflows, normalized text is generated before calling the TTS API using an LLM. Below is a recommended prompt for converting Japanese text into a normalized pronunciation form.
The prompt produces structured JSON output that can be directly used in a TTS request.
You will receive a Japanese sentence that may contain kanji, numbers, symbols, and units.
For the given input, provide:
- the original text (natural Japanese text using standard kanji–kana mixed notation, without furigana)
- the normalized text, converted according to the rules below.

Important:
- You must respond only with pure JSON format.
- Do not include any explanations or additional text.
- In original_text, do not include furigana (ruby annotations).

Response Format

{
  "original_text": "[natural Japanese Text]",
  "normalized_text": "[converted Text]"
}

Transcription Conversion Rules
1. Convert all kanji into hiragana using context-appropriate readings.
2. Keep katakana as is.
3. Preserve punctuation exactly as written.
4. Convert Arabic numerals into hiragana.
5. Expand units and English abbreviations into full katakana forms.
6. Apply natural phonological changes such as gemination and sound alternations.

Conversion Examples

{
  "original_text": "今日はどんな一日だったの?",
  "normalized_text": "きょうはどんないちにちだったの?"
}

{
  "original_text": "今日は10%オフだよ。身長は170cm、体重は60kgだって!",
  "normalized_text": "きょうはじゅっパーセントオフだよ。しんちょうはひゃくななじゅっセンチメートル、たいじゅうはろくじゅっキログラムだって!"
}