When generating speech from such text, relying on the original written form alone can result in incorrect or unnatural pronunciation. To reduce these issues, Supertone Text-to-Speech supports a normalized text feature for Japanese.
By providing a pronunciation-oriented version of the input text alongside the original text, the TTS engine can produce clearer and more accurate speech. This guide explains why normalized text is useful, when to use it, and how to generate it effectively.
Why Normalized Text Is Needed
Japanese writing and spoken Japanese are not always aligned one-to-one. Kanji may have multiple readings, numbers are spoken differently depending on context, and symbols or units are often expanded or transformed when read aloud. These characteristics make it difficult for TTS systems to consistently infer correct pronunciation from raw text alone. Normalized text fills this gap by explicitly describing how the sentence should be spoken, while keeping the original text intact for semantic understanding.How It Works
When generating speech, you provide the original Japanese text written naturally using kanji and kana, along with a normalized version of the same text that represents its pronunciation. The TTS engine uses both together:the original text preserves meaning and context, while the normalized text guides pronunciation.
If normalized text is not provided, the system relies only on the original text.
When You Should Use Normalized Text
Normalized text is especially useful when your input contains numbers, measurement units, symbols, abbreviations, kanji with ambiguous readings, or mixed JapaneseโEnglish expressions. It is strongly recommended for narration, announcements, audiobooks, and character voices where pronunciation accuracy matters.For short or casual conversational sentences, normalized text may not be necessary.
Generating Normalized Text with LLMs
In most workflows, normalized text is generated before calling the TTS API using an LLM. Below is a recommended prompt for converting Japanese text into a normalized pronunciation form.The prompt produces structured JSON output that can be directly used in a TTS request.