- Stream tokens from your LLM.
- Group them into sentence-sized chunks.
- Send each chunk to Supertone TTS and forward the audio.
voice_id, and you’re done.
You may not need streaming
stream_speech is supported on sona_speech_1 only. If your priority is overall time-to-first-audio, you’ll often get there faster by picking a non-streaming model that simply finishes each request quickly:
supertonic_api_3— fastest inference, lowest latency, with significantly improved speech stability. Best for voice agents where time-to-first-audio matters most.sona_speech_2_flash— balanced; lower latency thansona_speech_2with similar quality.sona_speech_1withstream_speech— only useful when a single chunk of text is long enough that chunked streaming meaningfully starts playback earlier.
sona_speech_1 even starts emitting chunks. The examples default to supertonic_api_3; switch the model string to try the others.
Recipes
Pick your LLM and language stack below. All four recipes follow the same sentence-batching pattern — only the LLM streaming bit differs.- Python · Anthropic
- Python · OpenAI
- TypeScript · Anthropic
- TypeScript · OpenAI
Design notes
- Sentence batching matters. Sending one token at a time produces choppy, unnatural speech. The sentence splitter above flushes on
.,!,?,。,!,?. For lower latency, you can also flush on a comma once the buffer exceeds ~60 characters. - Strip markdown before sending. Instruction-tuned models (Claude especially) often wrap their answers in markdown — headings like
# Title, bold**text**, code spans, etc. Supertone TTS rejects text containing#(it’s a reserved character), so the snippets above pipe every sentence through a smallfor_tts/forTtshelper that removes#,*,_, and backticks. Without it, the first sentence of a Claude response will commonly fail with a 400. - Model choice for latency. Reserve
sona_speech_2for offline / high-quality use cases where the user can wait.sona_speech_2_flashis a good balance of quality and speed.supertonic_api_3gives the fastest time-to-first-audio with high speech stability.sona_speech_1is the only model that supportsstream_speechchunked streaming — useful if a single sentence is long and you want to start playing before it finishes. - Saving vs playing. The examples append every audio chunk to
response.wav. In a real agent you’d pipe each clip into your audio output (Web Audio, PortAudio, etc.) instead of (or in addition to) writing to disk. - Connection reuse. Reuse the Supertone client across requests — don’t recreate it per sentence.
- Long sentences. If a single sentence exceeds 300 characters, the SDK auto-chunks it internally, so you don’t need to split further.
Related
Models
Pick the right model for your latency budget.
Latency optimization
More tips for reducing time-to-audio.