Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.supertoneapi.com/llms.txt

Use this file to discover all available pages before exploring further.

The Supertone API can return phoneme data alongside the audio — the individual sound units the model spoke, with their start times and durations. This is the data you need to drive lip-sync in games and animation, build karaoke-style word highlighting, or analyze pronunciation. To turn it on, set include_phonemes: true on a TTS request.
Supported on sona_speech_2, sona_speech_2_flash, and sona_speech_1. Not supported on supertonic_api_3 or supertonic_api_1.

Usage

import base64
import os
from supertone import Supertone

VOICE_ID = "20160a4c5ba38967330c84"  # replace with your voice ID

with Supertone(api_key=os.environ["SUPERTONE_API_KEY"]) as client:
    response = client.text_to_speech.create_speech(
        voice_id=VOICE_ID,
        text="Hello, world.",
        language="en",
        include_phonemes=True,
    )

    result = response.result
    with open("speech.wav", "wb") as f:
        f.write(base64.b64decode(result.audio_base64))

    for symbol, start, duration in zip(
        result.phonemes.symbols,
        result.phonemes.start_times_seconds,
        result.phonemes.durations_seconds,
    ):
        print(f"{symbol!r} at {start:.3f}s for {duration:.3f}s")

Response shape

FieldDescription
audio_base64Base64-encoded audio in the requested output_format (wav or mp3).
phonemes.symbolsPhoneme symbols in IPA-style notation. Empty strings represent silences/pauses.
phonemes.start_times_secondsStart time of each symbol within the clip.
phonemes.durations_secondsDuration of each symbol.
The three phoneme arrays are aligned — symbols[i], start_times_seconds[i], and durations_seconds[i] describe the same phoneme.

Streaming with phonemes

When you call stream_speech with include_phonemes: true, the response becomes NDJSON (newline-delimited JSON). Each line is a chunk with its own audio_base64 and phonemes data:
{"audio_base64":"...","phonemes":{"symbols":["","h"],"start_times_seconds":[0,0.05],"durations_seconds":[0.05,0.08]}}
{"audio_base64":"...","phonemes":{"symbols":["ɐ","ɡ"],"start_times_seconds":[0.13,0.19],"durations_seconds":[0.06,0.04]}}
Parse each line as it arrives to drive your lip-sync renderer in real time.

Use cases

  • Lip-sync in games and animation. Map each phoneme to a viseme (mouth shape) and play visemes in sync with the audio. Most engines come with a default phoneme-to-viseme table — Supertone’s symbols are standard IPA-style and compatible with most rigs.
  • Karaoke / word highlighting. Use phoneme start times to highlight words as they’re spoken.
  • Pronunciation analysis. Compare actual phonemes against an expected sequence to check pronunciation in language-learning apps.
For an end-to-end example, see Generate phonemes for lip sync.

Lip sync example

Build a phoneme → viseme pipeline.

Normalized text

Improve pronunciation for ambiguous inputs.