Use this file to discover all available pages before exploring further.
The Supertone API can return phoneme data alongside the audio — the individual sound units the model spoke, with their start times and durations. This is the data you need to drive lip-sync in games and animation, build karaoke-style word highlighting, or analyze pronunciation.To turn it on, set include_phonemes: true on a TTS request.
Supported on sona_speech_2, sona_speech_2_flash, and sona_speech_1. Not supported on supertonic_api_3 or supertonic_api_1.
import base64import osfrom supertone import SupertoneVOICE_ID = "20160a4c5ba38967330c84" # replace with your voice IDwith Supertone(api_key=os.environ["SUPERTONE_API_KEY"]) as client: response = client.text_to_speech.create_speech( voice_id=VOICE_ID, text="Hello, world.", language="en", include_phonemes=True, ) result = response.result with open("speech.wav", "wb") as f: f.write(base64.b64decode(result.audio_base64)) for symbol, start, duration in zip( result.phonemes.symbols, result.phonemes.start_times_seconds, result.phonemes.durations_seconds, ): print(f"{symbol!r} at {start:.3f}s for {duration:.3f}s")
import { Supertone } from "@supertone/supertone";import * as fs from "node:fs";const VOICE_ID = "20160a4c5ba38967330c84"; // replace with your voice IDconst client = new Supertone({ apiKey: process.env.SUPERTONE_API_KEY });const response = await client.textToSpeech.createSpeech({ voiceId: VOICE_ID, apiConvertTextToSpeechUsingCharacterRequest: { text: "Hello, world.", language: "en", includePhonemes: true, },});const result = response.result as { audioBase64: string; phonemes?: { symbols?: string[]; startTimesSeconds?: number[]; durationsSeconds?: number[]; };};fs.writeFileSync("speech.wav", Buffer.from(result.audioBase64, "base64"));const symbols = result.phonemes?.symbols ?? [];const starts = result.phonemes?.startTimesSeconds ?? [];const durations = result.phonemes?.durationsSeconds ?? [];for (let i = 0; i < symbols.length; i++) { console.log(`${symbols[i]} at ${starts[i].toFixed(3)}s for ${durations[i].toFixed(3)}s`);}
When you call stream_speech with include_phonemes: true, the response becomes NDJSON (newline-delimited JSON). Each line is a chunk with its own audio_base64 and phonemes data:
Lip-sync in games and animation. Map each phoneme to a viseme (mouth shape) and play visemes in sync with the audio. Most engines come with a default phoneme-to-viseme table — Supertone’s symbols are standard IPA-style and compatible with most rigs.
Karaoke / word highlighting. Use phoneme start times to highlight words as they’re spoken.
Pronunciation analysis. Compare actual phonemes against an expected sequence to check pronunciation in language-learning apps.