Skip to main content
POST
/
v1
/
text-to-speech
/
{voice_id}
/
stream
Convert text to speech with streaming response
curl --request POST \
  --url https://supertoneapi.com/v1/text-to-speech/{voice_id}/stream \
  --header 'Content-Type: application/json' \
  --header 'x-sup-api-key: <api-key>' \
  --data '
{
  "text": "<string>",
  "style": "<string>",
  "model": "sona_speech_1",
  "output_format": "wav",
  "voice_settings": {
    "pitch_shift": 0,
    "pitch_variance": 1,
    "speed": 1,
    "duration": 0,
    "similarity": 3,
    "text_guidance": 1,
    "subharmonic_amplitude_control": 1
  },
  "include_phonemes": false,
  "normalized_text": "<string>"
}
'
"<string>"

Documentation Index

Fetch the complete documentation index at: https://docs.supertoneapi.com/llms.txt

Use this file to discover all available pages before exploring further.

Streams generated speech back chunk-by-chunk so you can start playback before the full clip is ready. For when to use streaming versus a fast non-streaming model, see Docs: Stream speech and Latency optimization.
Streaming is currently supported on sona_speech_1 only.

Endpoint

POST https://supertoneapi.com/v1/text-to-speech/{voice_id}/stream

Path parameters

NameRequiredDescription
voice_idThe ID of the target voice.

Request body

Content-Type: application/json
NameRequiredDescription
textThe text to convert. Max 300 characters.
languageLanguage code. Supported: en, ko, ja.
styleEmotional style (e.g. neutral, happy). If omitted, the voice’s default style is used.
modelMust be sona_speech_1 (the only model that supports streaming).
output_formatwav (default) or mp3.
voice_settingsAdvanced voice parameters — same fields and ranges as Create speech.
include_phonemesIf true, response is NDJSON with phoneme data per chunk. Default: false.

Response

Default (include_phonemes=false): Binary audio stream.
  • Content-Type: audio/wav or audio/mpeg (matches output_format).
  • The first chunk includes the audio file header; subsequent chunks are raw audio data.
When include_phonemes=true: Newline-delimited JSON (NDJSON), one object per chunk:
{"audio_base64":"...","phonemes":{"symbols":["","h"],"start_times_seconds":[0,0.05],"durations_seconds":[0.05,0.08]}}
{"audio_base64":"...","phonemes":{"symbols":["ɐ","ɡ"],"start_times_seconds":[0.13,0.19],"durations_seconds":[0.06,0.04]}}

Notes

  • Stream speech is currently in beta and supports only sona_speech_1.
  • text over 300 characters returns 400. SDKs auto-chunk longer input and forward chunks to your iterator.
  • speed applies after duration (e.g. duration=5 + speed=2 ≈ 10 seconds).
  • When style is omitted, the voice’s default style is used. Use Get voice to inspect defaults.

See also

Docs: Stream speech

When to stream and how to consume chunks in each SDK.

LLM streaming TTS

End-to-end recipes with OpenAI and Anthropic.

Authorizations

x-sup-api-key
string
header
required

Path Parameters

voice_id
string
required

Body

application/json
text
string
required

The text to convert to speech

Maximum string length: 300
language
enum<string>
required

The language code of the text

Available options:
en,
ko,
ja,
bg,
cs,
da,
el,
es,
et,
fi,
hu,
it,
nl,
pl,
pt,
ro,
ar,
de,
fr,
hi,
id,
ru,
vi,
hr,
lt,
lv,
sk,
sl,
sv,
tr,
uk
style
string

The style of character to use for the text-to-speech conversion

model
enum<string>
default:sona_speech_1

The model type to use for the text-to-speech conversion

Available options:
sona_speech_1,
sona_speech_2,
sona_speech_2_flash,
supertonic_api_1,
supertonic_api_3
output_format
enum<string>
default:wav

The desired output format of the audio file (wav, mp3). Default is wav.

Available options:
wav,
mp3
voice_settings
object
include_phonemes
boolean
default:false

Return phoneme timing data with the audio

normalized_text
string

Pre-normalized text for TTS. Only used with sona_speech_2 and sona_speech_2_flash models.

Response

Streaming audio data in binary format or NDJSON format with phoneme data based on includePhonemes parameter

Binary audio stream (when includePhonemes=false or omitted)