Generate phonemes for lip sync

To lip-sync a character to generated speech, you need three things in sync:

The audio file.
The phoneme symbols actually spoken.
The start time and duration of each phoneme.

Supertone returns all three when you pass include_phonemes: true on a TTS request.

Python — request audio + phonemes

import base64
import os
from supertone import Supertone

VOICE_ID = "20160a4c5ba38967330c84"  # replace with your voice ID

with Supertone(api_key=os.environ["SUPERTONE_API_KEY"]) as client:
    response = client.text_to_speech.create_speech(
        voice_id=VOICE_ID,
        text="Welcome to the workshop.",
        language="en",
        model="sona_speech_2",
        include_phonemes=True,
    )

    audio_bytes = base64.b64decode(response.result.audio_base64)
    with open("speech.wav", "wb") as f:
        f.write(audio_bytes)

    phonemes = response.result.phonemes
    for symbol, start, duration in zip(
        phonemes.symbols,
        phonemes.start_times_seconds,
        phonemes.durations_seconds,
    ):
        print(f"{start:7.3f}s  {duration:5.3f}s  {symbol!r}")

TypeScript — request audio + phonemes

import { Supertone } from "@supertone/supertone";
import * as fs from "node:fs";

const VOICE_ID = "20160a4c5ba38967330c84"; // replace with your voice ID

const client = new Supertone({ apiKey: process.env.SUPERTONE_API_KEY });

const response = await client.textToSpeech.createSpeech({
  voiceId: VOICE_ID,
  apiConvertTextToSpeechUsingCharacterRequest: {
    text: "Welcome to the workshop.",
    language: "en",
    model: "sona_speech_2",
    includePhonemes: true,
  },
});

const result = response.result as {
  audioBase64: string;
  phonemes?: { symbols?: string[]; startTimesSeconds?: number[]; durationsSeconds?: number[] };
};

fs.writeFileSync("speech.wav", Buffer.from(result.audioBase64, "base64"));

const { symbols = [], startTimesSeconds = [], durationsSeconds = [] } = result.phonemes ?? {};
for (let i = 0; i < symbols.length; i++) {
  console.log(
    `${startTimesSeconds[i].toFixed(3)}s  ${durationsSeconds[i].toFixed(3)}s  ${symbols[i]}`,
  );
}

Map phonemes to visemes

A common rendering pipeline maps each IPA-style symbol to a small set of mouth shapes (visemes), then drives a 3D rig or 2D sprite by interpolating between them.

// Minimal English IPA → viseme mapping (extend for your rig)
const PHONEME_TO_VISEME: Record<string, string> = {
  // Closed lip
  "p": "BMP",
  "b": "BMP",
  "m": "BMP",
  // Open vowel
  "ɑ": "Aa",
  "ʌ": "Aa",
  "ɐ": "Aa",
  // Wide smile
  "iː": "Ee",
  "i": "Ee",
  // Rounded
  "uː": "Oo",
  "u": "Oo",
  "o": "Oo",
  // Fricative
  "f": "FV",
  "v": "FV",
  "θ": "Th",
  // Silence
  "": "Rest",
};

interface VisemeKeyframe {
  time: number;
  duration: number;
  viseme: string;
}

function buildVisemeTrack(
  symbols: string[],
  starts: number[],
  durations: number[],
): VisemeKeyframe[] {
  return symbols.map((symbol, i) => ({
    time: starts[i],
    duration: durations[i],
    viseme: PHONEME_TO_VISEME[symbol] ?? "Rest",
  }));
}

In your render loop, advance the current audio time and look up the active viseme for that timestamp. Tween viseme weights so the mouth doesn’t snap between shapes.

Stream phonemes in real time

When you call stream_speech with include_phonemes: true, the response becomes NDJSON. Parse each line as it arrives to drive lip-sync in real time:

import json

response = client.text_to_speech.stream_speech(
    voice_id=VOICE_ID,
    text="Streaming lip sync in real time.",
    language="en",
    model="sona_speech_1",
    include_phonemes=True,
)

for line in response.result.iter_lines():
    if not line:
        continue
    payload = json.loads(line)
    audio_chunk = base64.b64decode(payload["audio_base64"])
    schedule_audio(audio_chunk)
    schedule_phonemes(payload["phonemes"])

Tips

Use the model that supports phonemes. sona_speech_2, sona_speech_2_flash, and sona_speech_1 all support phonemes. supertonic_api_3 and supertonic_api_1 do not.
Smooth transitions. Real mouths don’t snap between shapes — most engines interpolate viseme weights over 50–80 ms. The phoneme durations from the API are a good starting point for those tweens.
Stress and pauses. Empty symbol values mark silences/pauses — return the mouth to the rest pose during those.
Localize your mapping. Phoneme → viseme tables differ across languages. Tune your mapping for Korean and Japanese if you’re shipping multilingual content.

Pronunciation and phonemes

Reference for include_phonemes and the response shape.

Stream speech

NDJSON streaming for real-time lip sync.

Get started

Core concepts

Text-to-Speech

SDKs

Examples

Production

Resources

Python — request audio + phonemes

TypeScript — request audio + phonemes

Map phonemes to visemes

Stream phonemes in real time

Tips

Pronunciation and phonemes

Stream speech

Get started

Core concepts

Text-to-Speech

SDKs

Examples

Production

Resources

Documentation Index

​Python — request audio + phonemes

​TypeScript — request audio + phonemes

​Map phonemes to visemes

​Stream phonemes in real time

​Tips

​Related

Pronunciation and phonemes

Stream speech

Python — request audio + phonemes

TypeScript — request audio + phonemes

Map phonemes to visemes

Stream phonemes in real time

Tips

Related