> ## Documentation Index
> Fetch the complete documentation index at: https://docs.supertoneapi.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Generate phonemes for lip sync

> Request phoneme timestamps alongside the audio and use them to drive viseme animation.

To lip-sync a character to generated speech, you need three things in sync:

1. The audio file.
2. The phoneme symbols actually spoken.
3. The start time and duration of each phoneme.

Supertone returns all three when you pass `include_phonemes: true` on a TTS request.

## Python — request audio + phonemes

```python theme={"dark"}
import base64
import os
from supertone import Supertone

VOICE_ID = "20160a4c5ba38967330c84"  # replace with your voice ID

with Supertone(api_key=os.environ["SUPERTONE_API_KEY"]) as client:
    response = client.text_to_speech.create_speech(
        voice_id=VOICE_ID,
        text="Welcome to the workshop.",
        language="en",
        model="sona_speech_2",
        include_phonemes=True,
    )

    audio_bytes = base64.b64decode(response.result.audio_base64)
    with open("speech.wav", "wb") as f:
        f.write(audio_bytes)

    phonemes = response.result.phonemes
    for symbol, start, duration in zip(
        phonemes.symbols,
        phonemes.start_times_seconds,
        phonemes.durations_seconds,
    ):
        print(f"{start:7.3f}s  {duration:5.3f}s  {symbol!r}")
```

## TypeScript — request audio + phonemes

```typescript theme={"dark"}
import { Supertone } from "@supertone/supertone";
import * as fs from "node:fs";

const VOICE_ID = "20160a4c5ba38967330c84"; // replace with your voice ID

const client = new Supertone({ apiKey: process.env.SUPERTONE_API_KEY });

const response = await client.textToSpeech.createSpeech({
  voiceId: VOICE_ID,
  apiConvertTextToSpeechUsingCharacterRequest: {
    text: "Welcome to the workshop.",
    language: "en",
    model: "sona_speech_2",
    includePhonemes: true,
  },
});

const result = response.result as {
  audioBase64: string;
  phonemes?: { symbols?: string[]; startTimesSeconds?: number[]; durationsSeconds?: number[] };
};

fs.writeFileSync("speech.wav", Buffer.from(result.audioBase64, "base64"));

const { symbols = [], startTimesSeconds = [], durationsSeconds = [] } = result.phonemes ?? {};
for (let i = 0; i < symbols.length; i++) {
  console.log(
    `${startTimesSeconds[i].toFixed(3)}s  ${durationsSeconds[i].toFixed(3)}s  ${symbols[i]}`,
  );
}
```

## Map phonemes to visemes

A common rendering pipeline maps each IPA-style symbol to a small set of mouth shapes (visemes), then drives a 3D rig or 2D sprite by interpolating between them.

```typescript theme={"dark"}
// Minimal English IPA → viseme mapping (extend for your rig)
const PHONEME_TO_VISEME: Record<string, string> = {
  // Closed lip
  "p": "BMP",
  "b": "BMP",
  "m": "BMP",
  // Open vowel
  "ɑ": "Aa",
  "ʌ": "Aa",
  "ɐ": "Aa",
  // Wide smile
  "iː": "Ee",
  "i": "Ee",
  // Rounded
  "uː": "Oo",
  "u": "Oo",
  "o": "Oo",
  // Fricative
  "f": "FV",
  "v": "FV",
  "θ": "Th",
  // Silence
  "": "Rest",
};

interface VisemeKeyframe {
  time: number;
  duration: number;
  viseme: string;
}

function buildVisemeTrack(
  symbols: string[],
  starts: number[],
  durations: number[],
): VisemeKeyframe[] {
  return symbols.map((symbol, i) => ({
    time: starts[i],
    duration: durations[i],
    viseme: PHONEME_TO_VISEME[symbol] ?? "Rest",
  }));
}
```

In your render loop, advance the current audio time and look up the active viseme for that timestamp. Tween viseme weights so the mouth doesn't snap between shapes.

## Stream phonemes in real time

When you call `stream_speech` with `include_phonemes: true`, the response becomes NDJSON. Parse each line as it arrives to drive lip-sync in real time:

```python theme={"dark"}
import json

response = client.text_to_speech.stream_speech(
    voice_id=VOICE_ID,
    text="Streaming lip sync in real time.",
    language="en",
    model="sona_speech_1",
    include_phonemes=True,
)

for line in response.result.iter_lines():
    if not line:
        continue
    payload = json.loads(line)
    audio_chunk = base64.b64decode(payload["audio_base64"])
    schedule_audio(audio_chunk)
    schedule_phonemes(payload["phonemes"])
```

## Tips

* **Use the model that supports phonemes.** `sona_speech_2`, `sona_speech_2_flash`, and `sona_speech_1` all support phonemes. `supertonic_api_3` and `supertonic_api_1` do not.
* **Smooth transitions.** Real mouths don't snap between shapes — most engines interpolate viseme weights over 50–80 ms. The phoneme durations from the API are a good starting point for those tweens.
* **Stress and pauses.** Empty `symbol` values mark silences/pauses — return the mouth to the rest pose during those.
* **Localize your mapping.** Phoneme → viseme tables differ across languages. Tune your mapping for Korean and Japanese if you're shipping multilingual content.

## Related

<CardGroup cols={2}>
  <Card title="Pronunciation and phonemes" icon="face-smile" href="/en/docs/text-to-speech/pronunciation-and-phonemes">
    Reference for `include_phonemes` and the response shape.
  </Card>

  <Card title="Stream speech" icon="bolt" href="/en/docs/text-to-speech/stream-speech">
    NDJSON streaming for real-time lip sync.
  </Card>
</CardGroup>
