Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.supertoneapi.com/llms.txt

Use this file to discover all available pages before exploring further.

For voice agents and chatbots, the user should hear the answer as the LLM is producing it — not after the full response is done. The pattern is:
  1. Stream tokens from your LLM.
  2. Group them into sentence-sized chunks.
  3. Send each chunk to Supertone TTS and forward the audio.
Below are end-to-end recipes you can paste into a fresh project and run. Set the two API keys, swap in a voice_id, and you’re done.

You may not need streaming

stream_speech is supported on sona_speech_1 only. If your priority is overall time-to-first-audio, you’ll often get there faster by picking a non-streaming model that simply finishes each request quickly:
  • supertonic_api_3 — fastest inference, lowest latency, with significantly improved speech stability. Best for voice agents where time-to-first-audio matters most.
  • sona_speech_2_flash — balanced; lower latency than sona_speech_2 with similar quality.
  • sona_speech_1 with stream_speech — only useful when a single chunk of text is long enough that chunked streaming meaningfully starts playback earlier.
For the sentence-by-sentence LLM pattern below, each TTS call covers one short sentence — and a non-streaming call on a fast model usually returns before streaming on sona_speech_1 even starts emitting chunks. The examples default to supertonic_api_3; switch the model string to try the others.

Recipes

Pick your LLM and language stack below. All four recipes follow the same sentence-batching pattern — only the LLM streaming bit differs.
pip install supertone anthropic
export SUPERTONE_API_KEY="Kp9mZ3xQ7v..."
export ANTHROPIC_API_KEY="sk-ant-..."
import os
import re
from anthropic import Anthropic
from supertone import Supertone

VOICE_ID = "20160a4c5ba38967330c84"  # replace with your voice ID
MODEL = "supertonic_api_3"            # try sona_speech_2_flash for higher quality

SENTENCE_END = re.compile(r"[.!?。!?]\s+")
# Supertone TTS rejects text containing '#' (reserved). Instruction-tuned
# LLMs often emit markdown — strip the common inline markers before sending.
MARKDOWN_MARKERS = re.compile(r"[#*_`]+")

def for_tts(text: str) -> str:
    return MARKDOWN_MARKERS.sub("", text).strip()

def sentences_from_stream(token_stream):
    """Yield sentence-sized strings from an iterable of text tokens."""
    buffer = ""
    for token in token_stream:
        buffer += token
        while True:
            match = SENTENCE_END.search(buffer)
            if not match:
                break
            sentence = for_tts(buffer[: match.end()])
            if sentence:
                yield sentence
            buffer = buffer[match.end():]
    tail = for_tts(buffer)
    if tail:
        yield tail

def stream_claude_tokens(prompt: str):
    anthropic = Anthropic()  # reads ANTHROPIC_API_KEY from env
    with anthropic.messages.stream(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
    ) as stream:
        for text in stream.text_stream:
            yield text

def play_or_save(audio_bytes: bytes, path: str):
    """Replace with your audio player. Here we just append to a file."""
    with open(path, "ab") as f:
        f.write(audio_bytes)

def main():
    prompt = "Tell me a short story about a curious robot in three sentences."
    out_path = "response.wav"
    open(out_path, "wb").close()  # truncate

    with Supertone(api_key=os.environ["SUPERTONE_API_KEY"]) as supertone:
        for sentence in sentences_from_stream(stream_claude_tokens(prompt)):
            print(f"→ {sentence}")
            response = supertone.text_to_speech.create_speech(
                voice_id=VOICE_ID,
                text=sentence,
                language="en",
                model=MODEL,
            )
            play_or_save(response.result.read(), out_path)

    print(f"Saved {out_path}")

if __name__ == "__main__":
    main()

Design notes

  • Sentence batching matters. Sending one token at a time produces choppy, unnatural speech. The sentence splitter above flushes on ., !, ?, , , . For lower latency, you can also flush on a comma once the buffer exceeds ~60 characters.
  • Strip markdown before sending. Instruction-tuned models (Claude especially) often wrap their answers in markdown — headings like # Title, bold **text**, code spans, etc. Supertone TTS rejects text containing # (it’s a reserved character), so the snippets above pipe every sentence through a small for_tts / forTts helper that removes #, *, _, and backticks. Without it, the first sentence of a Claude response will commonly fail with a 400.
  • Model choice for latency. Reserve sona_speech_2 for offline / high-quality use cases where the user can wait. sona_speech_2_flash is a good balance of quality and speed. supertonic_api_3 gives the fastest time-to-first-audio with high speech stability. sona_speech_1 is the only model that supports stream_speech chunked streaming — useful if a single sentence is long and you want to start playing before it finishes.
  • Saving vs playing. The examples append every audio chunk to response.wav. In a real agent you’d pipe each clip into your audio output (Web Audio, PortAudio, etc.) instead of (or in addition to) writing to disk.
  • Connection reuse. Reuse the Supertone client across requests — don’t recreate it per sentence.
  • Long sentences. If a single sentence exceeds 300 characters, the SDK auto-chunks it internally, so you don’t need to split further.

Models

Pick the right model for your latency budget.

Latency optimization

More tips for reducing time-to-audio.