Skip to main content
Japanese text often includes kanji, numbers, symbols, and units whose pronunciation depends heavily on context.
When generating speech from such text, relying on the original written form alone can result in incorrect or unnatural pronunciation.
To reduce these issues, Supertone Text-to-Speech supports a normalized text feature for Japanese.
By providing a pronunciation-oriented version of the input text alongside the original text, the TTS engine can produce clearer and more accurate speech.
This guide explains why normalized text is useful, when to use it, and how to generate it effectively.

Why Normalized Text Is Needed

Japanese writing and spoken Japanese are not always aligned one-to-one. Kanji may have multiple readings, numbers are spoken differently depending on context, and symbols or units are often expanded or transformed when read aloud. These characteristics make it difficult for TTS systems to consistently infer correct pronunciation from raw text alone. Normalized text fills this gap by explicitly describing how the sentence should be spoken, while keeping the original text intact for semantic understanding.

How It Works

When generating speech, you provide the original Japanese text written naturally using kanji and kana, along with a normalized version of the same text that represents its pronunciation. The TTS engine uses both together:
the original text preserves meaning and context, while the normalized text guides pronunciation.
If normalized text is not provided, the system relies only on the original text.

When You Should Use Normalized Text

Normalized text is especially useful when your input contains numbers, measurement units, symbols, abbreviations, kanji with ambiguous readings, or mixed Japaneseโ€“English expressions. It is strongly recommended for narration, announcements, audiobooks, and character voices where pronunciation accuracy matters.
For short or casual conversational sentences, normalized text may not be necessary.

Generating Normalized Text with LLMs

In most workflows, normalized text is generated before calling the TTS API using an LLM. Below is a recommended prompt for converting Japanese text into a normalized pronunciation form.
The prompt produces structured JSON output that can be directly used in a TTS request.
You will receive a Japanese sentence that may contain kanji, numbers, symbols, and units.
For the given input, provide:
- the original text (natural Japanese text using standard kanjiโ€“kana mixed notation, without furigana)
- the normalized text, converted according to the rules below.

Important:
- You must respond only with pure JSON format.
- Do not include any explanations or additional text.
- In original_text, do not include furigana (ruby annotations).

Response Format

{
  "original_text": "[natural Japanese Text]",
  "normalized_text": "[converted Text]"
}

Transcription Conversion Rules
1. Convert all kanji into hiragana using context-appropriate readings.
2. Keep katakana as is.
3. Preserve punctuation exactly as written.
4. Convert Arabic numerals into hiragana.
5. Expand units and English abbreviations into full katakana forms.
6. Apply natural phonological changes such as gemination and sound alternations.

Conversion Examples

{
  "original_text": "ไปŠๆ—ฅใฏใฉใ‚“ใชไธ€ๆ—ฅใ ใฃใŸใฎ๏ผŸ",
  "normalized_text": "ใใ‚‡ใ†ใฏใฉใ‚“ใชใ„ใกใซใกใ ใฃใŸใฎ๏ผŸ"
}

{
  "original_text": "ไปŠๆ—ฅใฏ10%ใ‚ชใƒ•ใ ใ‚ˆใ€‚่บซ้•ทใฏ170cmใ€ไฝ“้‡ใฏ60kgใ ใฃใฆ๏ผ",
  "normalized_text": "ใใ‚‡ใ†ใฏใ˜ใ‚…ใฃใƒ‘ใƒผใ‚ปใƒณใƒˆใ‚ชใƒ•ใ ใ‚ˆใ€‚ใ—ใ‚“ใกใ‚‡ใ†ใฏใฒใ‚ƒใใชใชใ˜ใ‚…ใฃใ‚ปใƒณใƒใƒกใƒผใƒˆใƒซใ€ใŸใ„ใ˜ใ‚…ใ†ใฏใ‚ใใ˜ใ‚…ใฃใ‚ญใƒญใ‚ฐใƒฉใƒ ใ ใฃใฆ๏ผ"
}