> ## Documentation Index
> Fetch the complete documentation index at: https://docs.supertoneapi.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Pronunciation and phonemes

> Get phoneme symbols and timestamps for lip-sync, animation, and pronunciation control.

The Supertone API can return **phoneme data** alongside the audio — the individual sound units the model spoke, with their start times and durations. This is the data you need to drive lip-sync in games and animation, build karaoke-style word highlighting, or analyze pronunciation.

To turn it on, set `include_phonemes: true` on a TTS request.

<Note>
  Supported on `sona_speech_2`, `sona_speech_2_flash`, and `sona_speech_1`. Not supported on `supertonic_api_3` or `supertonic_api_1`.
</Note>

## Usage

<Tabs>
  <Tab title="Python">
    ```python theme={"dark"}
    import base64
    import os
    from supertone import Supertone

    VOICE_ID = "20160a4c5ba38967330c84"  # replace with your voice ID

    with Supertone(api_key=os.environ["SUPERTONE_API_KEY"]) as client:
        response = client.text_to_speech.create_speech(
            voice_id=VOICE_ID,
            text="Hello, world.",
            language="en",
            include_phonemes=True,
        )

        result = response.result
        with open("speech.wav", "wb") as f:
            f.write(base64.b64decode(result.audio_base64))

        for symbol, start, duration in zip(
            result.phonemes.symbols,
            result.phonemes.start_times_seconds,
            result.phonemes.durations_seconds,
        ):
            print(f"{symbol!r} at {start:.3f}s for {duration:.3f}s")
    ```
  </Tab>

  <Tab title="TypeScript">
    ```typescript theme={"dark"}
    import { Supertone } from "@supertone/supertone";
    import * as fs from "node:fs";

    const VOICE_ID = "20160a4c5ba38967330c84"; // replace with your voice ID

    const client = new Supertone({ apiKey: process.env.SUPERTONE_API_KEY });

    const response = await client.textToSpeech.createSpeech({
      voiceId: VOICE_ID,
      apiConvertTextToSpeechUsingCharacterRequest: {
        text: "Hello, world.",
        language: "en",
        includePhonemes: true,
      },
    });

    const result = response.result as {
      audioBase64: string;
      phonemes?: {
        symbols?: string[];
        startTimesSeconds?: number[];
        durationsSeconds?: number[];
      };
    };

    fs.writeFileSync("speech.wav", Buffer.from(result.audioBase64, "base64"));

    const symbols = result.phonemes?.symbols ?? [];
    const starts = result.phonemes?.startTimesSeconds ?? [];
    const durations = result.phonemes?.durationsSeconds ?? [];

    for (let i = 0; i < symbols.length; i++) {
      console.log(`${symbols[i]} at ${starts[i].toFixed(3)}s for ${durations[i].toFixed(3)}s`);
    }
    ```
  </Tab>

  <Tab title="cURL">
    ```bash theme={"dark"}
    VOICE_ID="20160a4c5ba38967330c84"

    curl -X POST "https://supertoneapi.com/v1/text-to-speech/$VOICE_ID" \
      -H "x-sup-api-key: $SUPERTONE_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{
        "text": "Hello, world.",
        "language": "en",
        "include_phonemes": true
      }'
    ```

    Returns JSON (not binary audio):

    ```json theme={"dark"}
    {
      "audio_base64": "UklGRnoGAABXQVZF...",
      "phonemes": {
        "symbols": ["", "h", "ɐ", "ɡ", "ʌ", ""],
        "start_times_seconds": [0, 0.092, 0.197, 0.255, 0.29, 0.58],
        "durations_seconds": [0.092, 0.104, 0.058, 0.034, 0.29, 0.162]
      }
    }
    ```
  </Tab>
</Tabs>

## Response shape

| Field                          | Description                                                                     |
| ------------------------------ | ------------------------------------------------------------------------------- |
| `audio_base64`                 | Base64-encoded audio in the requested `output_format` (`wav` or `mp3`).         |
| `phonemes.symbols`             | Phoneme symbols in IPA-style notation. Empty strings represent silences/pauses. |
| `phonemes.start_times_seconds` | Start time of each symbol within the clip.                                      |
| `phonemes.durations_seconds`   | Duration of each symbol.                                                        |

The three phoneme arrays are aligned — `symbols[i]`, `start_times_seconds[i]`, and `durations_seconds[i]` describe the same phoneme.

## Streaming with phonemes

When you call `stream_speech` with `include_phonemes: true`, the response becomes **NDJSON** (newline-delimited JSON). Each line is a chunk with its own `audio_base64` and `phonemes` data:

```jsonl theme={"dark"}
{"audio_base64":"...","phonemes":{"symbols":["","h"],"start_times_seconds":[0,0.05],"durations_seconds":[0.05,0.08]}}
{"audio_base64":"...","phonemes":{"symbols":["ɐ","ɡ"],"start_times_seconds":[0.13,0.19],"durations_seconds":[0.06,0.04]}}
```

Parse each line as it arrives to drive your lip-sync renderer in real time.

## Use cases

* **Lip-sync in games and animation.** Map each phoneme to a viseme (mouth shape) and play visemes in sync with the audio. Most engines come with a default phoneme-to-viseme table — Supertone's symbols are standard IPA-style and compatible with most rigs.
* **Karaoke / word highlighting.** Use phoneme start times to highlight words as they're spoken.
* **Pronunciation analysis.** Compare actual phonemes against an expected sequence to check pronunciation in language-learning apps.

For an end-to-end example, see [Generate phonemes for lip sync](/en/docs/examples/lip-sync-phonemes).

## Related

<CardGroup cols={2}>
  <Card title="Lip sync example" icon="face-smile" href="/en/docs/examples/lip-sync-phonemes">
    Build a phoneme → viseme pipeline.
  </Card>

  <Card title="Normalized text" icon="language" href="/en/docs/text-to-speech/normalized-text">
    Improve pronunciation for ambiguous inputs.
  </Card>
</CardGroup>