A common rendering pipeline maps each IPA-style symbol to a small set of mouth shapes (visemes), then drives a 3D rig or 2D sprite by interpolating between them.
In your render loop, advance the current audio time and look up the active viseme for that timestamp. Tween viseme weights so the mouth doesn’t snap between shapes.
When you call stream_speech with include_phonemes: true, the response becomes NDJSON. Parse each line as it arrives to drive lip-sync in real time:
import jsonresponse = client.text_to_speech.stream_speech( voice_id=VOICE_ID, text="Streaming lip sync in real time.", language="en", model="sona_speech_1", include_phonemes=True,)for line in response.result.iter_lines(): if not line: continue payload = json.loads(line) audio_chunk = base64.b64decode(payload["audio_base64"]) schedule_audio(audio_chunk) schedule_phonemes(payload["phonemes"])
Use the model that supports phonemes.sona_speech_2, sona_speech_2_flash, and sona_speech_1 all support phonemes. supertonic_api_3 and supertonic_api_1 do not.
Smooth transitions. Real mouths don’t snap between shapes — most engines interpolate viseme weights over 50–80 ms. The phoneme durations from the API are a good starting point for those tweens.
Stress and pauses. Empty symbol values mark silences/pauses — return the mouth to the rest pose during those.
Localize your mapping. Phoneme → viseme tables differ across languages. Tune your mapping for Korean and Japanese if you’re shipping multilingual content.