Convert text to speech with streaming response
Text to speech
Stream speech
Convert text to speech and return the output as a chunked audio stream.
POST
Convert text to speech with streaming response
Streams generated speech back chunk-by-chunk so you can start playback before the full clip is ready. For when to use streaming versus a fast non-streaming model, see Docs: Stream speech and Latency optimization.
Streaming is currently supported on
sona_speech_1 only.Endpoint
Path parameters
| Name | Required | Description |
|---|---|---|
voice_id | ✅ | The ID of the target voice. |
Request body
Content-Type: application/json
| Name | Required | Description |
|---|---|---|
text | ✅ | The text to convert. Max 300 characters. |
language | ✅ | Language code. Supported: en, ko, ja. |
style | — | Emotional style (e.g. neutral, happy). If omitted, the voice’s default style is used. |
model | — | Must be sona_speech_1 (the only model that supports streaming). |
output_format | — | wav (default) or mp3. |
voice_settings | — | Advanced voice parameters — same fields and ranges as Create speech. |
include_phonemes | — | If true, response is NDJSON with phoneme data per chunk. Default: false. |
Response
Default (include_phonemes=false): Binary audio stream.
Content-Type: audio/wavoraudio/mpeg(matchesoutput_format).- The first chunk includes the audio file header; subsequent chunks are raw audio data.
include_phonemes=true: Newline-delimited JSON (NDJSON), one object per chunk:
Notes
- Stream speech is currently in beta and supports only
sona_speech_1. textover 300 characters returns400. SDKs auto-chunk longer input and forward chunks to your iterator.speedapplies afterduration(e.g.duration=5+speed=2≈ 10 seconds).- When
styleis omitted, the voice’s default style is used. Use Get voice to inspect defaults.
See also
Docs: Stream speech
When to stream and how to consume chunks in each SDK.
LLM streaming TTS
End-to-end recipes with OpenAI and Anthropic.
Authorizations
Path Parameters
Body
application/json
The text to convert to speech
Maximum string length:
300The language code of the text
Available options:
en, ko, ja, bg, cs, da, el, es, et, fi, hu, it, nl, pl, pt, ro, ar, de, fr, hi, id, ru, vi, hr, lt, lv, sk, sl, sv, tr, uk The style of character to use for the text-to-speech conversion
The model type to use for the text-to-speech conversion
Available options:
sona_speech_1, sona_speech_2, sona_speech_2_flash, supertonic_api_1, supertonic_api_3 The desired output format of the audio file (wav, mp3). Default is wav.
Available options:
wav, mp3 Return phoneme timing data with the audio
Pre-normalized text for TTS. Only used with sona_speech_2 and sona_speech_2_flash models.
Response
Streaming audio data in binary format or NDJSON format with phoneme data based on includePhonemes parameter
Binary audio stream (when includePhonemes=false or omitted)