Documentation Index
Fetch the complete documentation index at: https://docs.supertoneapi.com/llms.txt
Use this file to discover all available pages before exploring further.
stream_speech returns audio chunk-by-chunk so you can start playback or forwarding before the full clip is finished. The path is /v1/text-to-speech/{voice_id}/stream.
Streaming is currently supported on
sona_speech_1 only.When to use streaming
Streaming is most useful when a single TTS clip is long enough that waiting for the whole thing to finish would be noticeable — for example, a multi-sentence paragraph synthesized as one call. For interactive agents and chatbots, where each utterance is a short sentence, you’ll usually get better total latency by using a fast non-streaming model:sona_speech_2_flash— balanced speed and quality.supertonic_api_3— fastest inference with high speech stability. Use when time-to-first-audio is the priority.
stream_speech at all — it relies on fast non-streaming models firing per sentence.
Basic streaming
- Python (sync)
- Python (async)
- TypeScript
- cURL
Request fields
Same as Create speech, withmodel fixed at sona_speech_1 (the only model that supports streaming today). The path is /v1/text-to-speech/{voice_id}/stream.
Response
By default, the response body is a binary audio stream withContent-Type matching output_format:
audio/wav— chunks of the WAV file (the first chunk includes the WAV header).audio/mpeg— chunks of the MP3 file.
include_phonemes=true, the response switches to NDJSON — one JSON object per line, each with a base64 audio chunk and the matching phoneme data.
Streaming a long input
The SDKs auto-chunk text past 300 characters even when streaming. Internally they split the text, send sequential streaming requests, and forward chunks to the caller’s iterator — so your reading loop stays the same. See Long text for details.Tips
- Player buffering. Most players need an initial buffer before playback starts. Buffering 1–2 seconds of audio before play tends to feel smoother than playing the first chunk immediately.
- WAV vs MP3. WAV chunks are larger but easier to concatenate; MP3 streams are smaller and friendlier for delivery over slow networks.
- Error handling. Stream errors can surface mid-read — wrap iteration in your usual error handler and be prepared to retry, especially for transient
429or5xxresponses. See Retries and backoff.
Related
LLM streaming TTS
Sentence-by-sentence pattern for voice agents.
Latency optimization
Choose the right model and pattern for low latency.