Pronunciation and phonemes

The Supertone API can return phoneme data alongside the audio — the individual sound units the model spoke, with their start times and durations. This is the data you need to drive lip-sync in games and animation, build karaoke-style word highlighting, or analyze pronunciation. To turn it on, set include_phonemes: true on a TTS request.

Supported on sona_speech_2, sona_speech_2_flash, and sona_speech_1. Not supported on supertonic_api_3 or supertonic_api_1.

Usage

Python
TypeScript
cURL

import base64
import os
from supertone import Supertone

VOICE_ID = "20160a4c5ba38967330c84"  # replace with your voice ID

with Supertone(api_key=os.environ["SUPERTONE_API_KEY"]) as client:
    response = client.text_to_speech.create_speech(
        voice_id=VOICE_ID,
        text="Hello, world.",
        language="en",
        include_phonemes=True,
    )

    result = response.result
    with open("speech.wav", "wb") as f:
        f.write(base64.b64decode(result.audio_base64))

    for symbol, start, duration in zip(
        result.phonemes.symbols,
        result.phonemes.start_times_seconds,
        result.phonemes.durations_seconds,
    ):
        print(f"{symbol!r} at {start:.3f}s for {duration:.3f}s")

import { Supertone } from "@supertone/supertone";
import * as fs from "node:fs";

const VOICE_ID = "20160a4c5ba38967330c84"; // replace with your voice ID

const client = new Supertone({ apiKey: process.env.SUPERTONE_API_KEY });

const response = await client.textToSpeech.createSpeech({
  voiceId: VOICE_ID,
  apiConvertTextToSpeechUsingCharacterRequest: {
    text: "Hello, world.",
    language: "en",
    includePhonemes: true,
  },
});

const result = response.result as {
  audioBase64: string;
  phonemes?: {
    symbols?: string[];
    startTimesSeconds?: number[];
    durationsSeconds?: number[];
  };
};

fs.writeFileSync("speech.wav", Buffer.from(result.audioBase64, "base64"));

const symbols = result.phonemes?.symbols ?? [];
const starts = result.phonemes?.startTimesSeconds ?? [];
const durations = result.phonemes?.durationsSeconds ?? [];

for (let i = 0; i < symbols.length; i++) {
  console.log(`${symbols[i]} at ${starts[i].toFixed(3)}s for ${durations[i].toFixed(3)}s`);
}

VOICE_ID="20160a4c5ba38967330c84"

curl -X POST "https://supertoneapi.com/v1/text-to-speech/$VOICE_ID" \
  -H "x-sup-api-key: $SUPERTONE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello, world.",
    "language": "en",
    "include_phonemes": true
  }'

Returns JSON (not binary audio):

{
  "audio_base64": "UklGRnoGAABXQVZF...",
  "phonemes": {
    "symbols": ["", "h", "ɐ", "ɡ", "ʌ", ""],
    "start_times_seconds": [0, 0.092, 0.197, 0.255, 0.29, 0.58],
    "durations_seconds": [0.092, 0.104, 0.058, 0.034, 0.29, 0.162]
  }
}

Response shape

Field	Description
`audio_base64`	Base64-encoded audio in the requested `output_format` (`wav` or `mp3`).
`phonemes.symbols`	Phoneme symbols in IPA-style notation. Empty strings represent silences/pauses.
`phonemes.start_times_seconds`	Start time of each symbol within the clip.
`phonemes.durations_seconds`	Duration of each symbol.

The three phoneme arrays are aligned — symbols[i], start_times_seconds[i], and durations_seconds[i] describe the same phoneme.

Streaming with phonemes

When you call stream_speech with include_phonemes: true, the response becomes NDJSON (newline-delimited JSON). Each line is a chunk with its own audio_base64 and phonemes data:

{"audio_base64":"...","phonemes":{"symbols":["","h"],"start_times_seconds":[0,0.05],"durations_seconds":[0.05,0.08]}}
{"audio_base64":"...","phonemes":{"symbols":["ɐ","ɡ"],"start_times_seconds":[0.13,0.19],"durations_seconds":[0.06,0.04]}}

Parse each line as it arrives to drive your lip-sync renderer in real time.

Use cases

Lip-sync in games and animation. Map each phoneme to a viseme (mouth shape) and play visemes in sync with the audio. Most engines come with a default phoneme-to-viseme table — Supertone’s symbols are standard IPA-style and compatible with most rigs.
Karaoke / word highlighting. Use phoneme start times to highlight words as they’re spoken.
Pronunciation analysis. Compare actual phonemes against an expected sequence to check pronunciation in language-learning apps.

For an end-to-end example, see Generate phonemes for lip sync.

Lip sync example

Build a phoneme → viseme pipeline.

Normalized text

Improve pronunciation for ambiguous inputs.

Get started

Core concepts

Text-to-Speech

SDKs

Examples

Production

Resources

Usage

Response shape

Streaming with phonemes

Use cases

Lip sync example

Normalized text

Get started

Core concepts

Text-to-Speech

SDKs

Examples

Production

Resources

Documentation Index

​Usage

​Response shape

​Streaming with phonemes

​Use cases

​Related

Lip sync example

Normalized text

Usage

Response shape

Streaming with phonemes

Use cases

Related