Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.supertoneapi.com/llms.txt

Use this file to discover all available pages before exploring further.

stream_speech returns audio chunk-by-chunk so you can start playback or forwarding before the full clip is finished. The path is /v1/text-to-speech/{voice_id}/stream.
Streaming is currently supported on sona_speech_1 only.

When to use streaming

Streaming is most useful when a single TTS clip is long enough that waiting for the whole thing to finish would be noticeable — for example, a multi-sentence paragraph synthesized as one call. For interactive agents and chatbots, where each utterance is a short sentence, you’ll usually get better total latency by using a fast non-streaming model:
  • sona_speech_2_flash — balanced speed and quality.
  • supertonic_api_3 — fastest inference with high speech stability. Use when time-to-first-audio is the priority.
See Latency optimization for the full discussion. The sentence-by-sentence pattern in Stream TTS from an LLM response doesn’t use stream_speech at all — it relies on fast non-streaming models firing per sentence.

Basic streaming

from supertone import Supertone

VOICE_ID = "20160a4c5ba38967330c84"  # replace with your voice ID

with Supertone(api_key=API_KEY) as client:
    response = client.text_to_speech.stream_speech(
        voice_id=VOICE_ID,
        text="This response is streamed chunk by chunk.",
        language="en",
        model="sona_speech_1",
        output_format="wav",
    )

    with open("streamed.wav", "wb") as f:
        for chunk in response.result.iter_bytes():
            f.write(chunk)

Request fields

Same as Create speech, with model fixed at sona_speech_1 (the only model that supports streaming today). The path is /v1/text-to-speech/{voice_id}/stream.

Response

By default, the response body is a binary audio stream with Content-Type matching output_format:
  • audio/wav — chunks of the WAV file (the first chunk includes the WAV header).
  • audio/mpeg — chunks of the MP3 file.
When include_phonemes=true, the response switches to NDJSON — one JSON object per line, each with a base64 audio chunk and the matching phoneme data.

Streaming a long input

The SDKs auto-chunk text past 300 characters even when streaming. Internally they split the text, send sequential streaming requests, and forward chunks to the caller’s iterator — so your reading loop stays the same. See Long text for details.

Tips

  • Player buffering. Most players need an initial buffer before playback starts. Buffering 1–2 seconds of audio before play tends to feel smoother than playing the first chunk immediately.
  • WAV vs MP3. WAV chunks are larger but easier to concatenate; MP3 streams are smaller and friendlier for delivery over slow networks.
  • Error handling. Stream errors can surface mid-read — wrap iteration in your usual error handler and be prepared to retry, especially for transient 429 or 5xx responses. See Retries and backoff.

LLM streaming TTS

Sentence-by-sentence pattern for voice agents.

Latency optimization

Choose the right model and pattern for low latency.