stream_speech is supported on sona_speech_1 only. If your priority is overall time-to-first-audio, you’ll often get there faster by picking a non-streaming model that simply finishes each request quickly:
supertonic_api_3 — fastest inference, lowest latency, with significantly improved speech stability. Best for voice agents where time-to-first-audio matters most.
sona_speech_2_flash — balanced; lower latency than sona_speech_2 with similar quality.
sona_speech_1 with stream_speech — only useful when a single chunk of text is long enough that chunked streaming meaningfully starts playback earlier.
For the sentence-by-sentence LLM pattern below, each TTS call covers one short sentence — and a non-streaming call on a fast model usually returns before streaming on sona_speech_1 even starts emitting chunks. The examples default to supertonic_api_3; switch the model string to try the others.
import osimport refrom anthropic import Anthropicfrom supertone import SupertoneVOICE_ID = "20160a4c5ba38967330c84" # replace with your voice IDMODEL = "supertonic_api_3" # try sona_speech_2_flash for higher qualitySENTENCE_END = re.compile(r"[.!?。!?]\s+")# Supertone TTS rejects text containing '#' (reserved). Instruction-tuned# LLMs often emit markdown — strip the common inline markers before sending.MARKDOWN_MARKERS = re.compile(r"[#*_`]+")def for_tts(text: str) -> str: return MARKDOWN_MARKERS.sub("", text).strip()def sentences_from_stream(token_stream): """Yield sentence-sized strings from an iterable of text tokens.""" buffer = "" for token in token_stream: buffer += token while True: match = SENTENCE_END.search(buffer) if not match: break sentence = for_tts(buffer[: match.end()]) if sentence: yield sentence buffer = buffer[match.end():] tail = for_tts(buffer) if tail: yield taildef stream_claude_tokens(prompt: str): anthropic = Anthropic() # reads ANTHROPIC_API_KEY from env with anthropic.messages.stream( model="claude-sonnet-4-5", max_tokens=1024, messages=[{"role": "user", "content": prompt}], ) as stream: for text in stream.text_stream: yield textdef play_or_save(audio_bytes: bytes, path: str): """Replace with your audio player. Here we just append to a file.""" with open(path, "ab") as f: f.write(audio_bytes)def main(): prompt = "Tell me a short story about a curious robot in three sentences." out_path = "response.wav" open(out_path, "wb").close() # truncate with Supertone(api_key=os.environ["SUPERTONE_API_KEY"]) as supertone: for sentence in sentences_from_stream(stream_claude_tokens(prompt)): print(f"→ {sentence}") response = supertone.text_to_speech.create_speech( voice_id=VOICE_ID, text=sentence, language="en", model=MODEL, ) play_or_save(response.result.read(), out_path) print(f"Saved {out_path}")if __name__ == "__main__": main()
Sentence batching matters. Sending one token at a time produces choppy, unnatural speech. The sentence splitter above flushes on ., !, ?, 。, !, ?. For lower latency, you can also flush on a comma once the buffer exceeds ~60 characters.
Strip markdown before sending. Instruction-tuned models (Claude especially) often wrap their answers in markdown — headings like # Title, bold **text**, code spans, etc. Supertone TTS rejects text containing # (it’s a reserved character), so the snippets above pipe every sentence through a small for_tts / forTts helper that removes #, *, _, and backticks. Without it, the first sentence of a Claude response will commonly fail with a 400.
Model choice for latency. Reserve sona_speech_2 for offline / high-quality use cases where the user can wait. sona_speech_2_flash is a good balance of quality and speed. supertonic_api_3 gives the fastest time-to-first-audio with high speech stability. sona_speech_1 is the only model that supports stream_speech chunked streaming — useful if a single sentence is long and you want to start playing before it finishes.
Saving vs playing. The examples append every audio chunk to response.wav. In a real agent you’d pipe each clip into your audio output (Web Audio, PortAudio, etc.) instead of (or in addition to) writing to disk.
Connection reuse. Reuse the Supertone client across requests — don’t recreate it per sentence.
Long sentences. If a single sentence exceeds 300 characters, the SDK auto-chunks it internally, so you don’t need to split further.