- The audio file.
- The phoneme symbols actually spoken.
- The start time and duration of each phoneme.
include_phonemes: true on a TTS request.
Python — request audio + phonemes
TypeScript — request audio + phonemes
Map phonemes to visemes
A common rendering pipeline maps each IPA-style symbol to a small set of mouth shapes (visemes), then drives a 3D rig or 2D sprite by interpolating between them.Stream phonemes in real time
When you callstream_speech with include_phonemes: true, the response becomes NDJSON. Parse each line as it arrives to drive lip-sync in real time:
Tips
- Use the model that supports phonemes.
sona_speech_2,sona_speech_2_flash, andsona_speech_1all support phonemes.supertonic_api_3andsupertonic_api_1do not. - Smooth transitions. Real mouths don’t snap between shapes — most engines interpolate viseme weights over 50–80 ms. The phoneme durations from the API are a good starting point for those tweens.
- Stress and pauses. Empty
symbolvalues mark silences/pauses — return the mouth to the rest pose during those. - Localize your mapping. Phoneme → viseme tables differ across languages. Tune your mapping for Korean and Japanese if you’re shipping multilingual content.
Related
Pronunciation and phonemes
Reference for
include_phonemes and the response shape.Stream speech
NDJSON streaming for real-time lip sync.