Stream speech
Convert text to speech and return the output as a chunked audio stream.
Streams generated speech back chunk-by-chunk so you can start playback before the full clip is ready. For when to use streaming versus a fast non-streaming model, see Docs: Stream speech and Latency optimization.Documentation Index
Fetch the complete documentation index at: https://docs.supertoneapi.com/llms.txt
Use this file to discover all available pages before exploring further.
sona_speech_1 only.Endpoint
Path parameters
| Name | Required | Description |
|---|---|---|
voice_id | ✅ | The ID of the target voice. |
Request body
Content-Type: application/json
| Name | Required | Description |
|---|---|---|
text | ✅ | The text to convert. Max 300 characters. |
language | ✅ | Language code. Supported: en, ko, ja. |
style | — | Emotional style (e.g. neutral, happy). If omitted, the voice’s default style is used. |
model | — | Must be sona_speech_1 (the only model that supports streaming). |
output_format | — | wav (default) or mp3. |
voice_settings | — | Advanced voice parameters — same fields and ranges as Create speech. |
include_phonemes | — | If true, response is NDJSON with phoneme data per chunk. Default: false. |
Response
Default (include_phonemes=false): Binary audio stream.
Content-Type: audio/wavoraudio/mpeg(matchesoutput_format).- The first chunk includes the audio file header; subsequent chunks are raw audio data.
include_phonemes=true: Newline-delimited JSON (NDJSON), one object per chunk:
Notes
- Stream speech is currently in beta and supports only
sona_speech_1. textover 300 characters returns400. SDKs auto-chunk longer input and forward chunks to your iterator.speedapplies afterduration(e.g.duration=5+speed=2≈ 10 seconds).- When
styleis omitted, the voice’s default style is used. Use Get voice to inspect defaults.
See also
Docs: Stream speech
LLM streaming TTS
Authorizations
Path Parameters
Body
The text to convert to speech
300The language code of the text
en, ko, ja, bg, cs, da, el, es, et, fi, hu, it, nl, pl, pt, ro, ar, de, fr, hi, id, ru, vi, hr, lt, lv, sk, sl, sv, tr, uk The style of character to use for the text-to-speech conversion
The model type to use for the text-to-speech conversion
sona_speech_1, sona_speech_2, sona_speech_2_flash, supertonic_api_1, supertonic_api_3 The desired output format of the audio file (wav, mp3). Default is wav.
wav, mp3 Return phoneme timing data with the audio
Pre-normalized text for TTS. Only used with sona_speech_2 and sona_speech_2_flash models.
Response
Streaming audio data in binary format or NDJSON format with phoneme data based on includePhonemes parameter
Binary audio stream (when includePhonemes=false or omitted)