Stream TTS from an LLM response

For voice agents and chatbots, the user should hear the answer as the LLM is producing it — not after the full response is done. The pattern is:

Stream tokens from your LLM.
Group them into sentence-sized chunks.
Send each chunk to Supertone TTS and forward the audio.

Below are end-to-end recipes you can paste into a fresh project and run. Set the two API keys, swap in a voice_id, and you’re done.

You may not need streaming

stream_speech is supported on sona_speech_1 only. If your priority is overall time-to-first-audio, you’ll often get there faster by picking a non-streaming model that simply finishes each request quickly:

supertonic_api_3 — fastest inference, lowest latency, with significantly improved speech stability. Best for voice agents where time-to-first-audio matters most.
sona_speech_2_flash — balanced; lower latency than sona_speech_2 with similar quality.
sona_speech_1 with stream_speech — only useful when a single chunk of text is long enough that chunked streaming meaningfully starts playback earlier.

For the sentence-by-sentence LLM pattern below, each TTS call covers one short sentence — and a non-streaming call on a fast model usually returns before streaming on sona_speech_1 even starts emitting chunks. The examples default to supertonic_api_3; switch the model string to try the others.

Recipes

Pick your LLM and language stack below. All four recipes follow the same sentence-batching pattern — only the LLM streaming bit differs.

Python · Anthropic
Python · OpenAI
TypeScript · Anthropic
TypeScript · OpenAI

pip install supertone anthropic
export SUPERTONE_API_KEY="Kp9mZ3xQ7v..."
export ANTHROPIC_API_KEY="sk-ant-..."

import os
import re
from anthropic import Anthropic
from supertone import Supertone

VOICE_ID = "20160a4c5ba38967330c84"  # replace with your voice ID
MODEL = "supertonic_api_3"            # try sona_speech_2_flash for higher quality

SENTENCE_END = re.compile(r"[.!?。！？]\s+")
# Supertone TTS rejects text containing '#' (reserved). Instruction-tuned
# LLMs often emit markdown — strip the common inline markers before sending.
MARKDOWN_MARKERS = re.compile(r"[#*_`]+")

def for_tts(text: str) -> str:
    return MARKDOWN_MARKERS.sub("", text).strip()

def sentences_from_stream(token_stream):
    """Yield sentence-sized strings from an iterable of text tokens."""
    buffer = ""
    for token in token_stream:
        buffer += token
        while True:
            match = SENTENCE_END.search(buffer)
            if not match:
                break
            sentence = for_tts(buffer[: match.end()])
            if sentence:
                yield sentence
            buffer = buffer[match.end():]
    tail = for_tts(buffer)
    if tail:
        yield tail

def stream_claude_tokens(prompt: str):
    anthropic = Anthropic()  # reads ANTHROPIC_API_KEY from env
    with anthropic.messages.stream(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
    ) as stream:
        for text in stream.text_stream:
            yield text

def play_or_save(audio_bytes: bytes, path: str):
    """Replace with your audio player. Here we just append to a file."""
    with open(path, "ab") as f:
        f.write(audio_bytes)

def main():
    prompt = "Tell me a short story about a curious robot in three sentences."
    out_path = "response.wav"
    open(out_path, "wb").close()  # truncate

    with Supertone(api_key=os.environ["SUPERTONE_API_KEY"]) as supertone:
        for sentence in sentences_from_stream(stream_claude_tokens(prompt)):
            print(f"→ {sentence}")
            response = supertone.text_to_speech.create_speech(
                voice_id=VOICE_ID,
                text=sentence,
                language="en",
                model=MODEL,
            )
            play_or_save(response.result.read(), out_path)

    print(f"Saved {out_path}")

if __name__ == "__main__":
    main()

pip install supertone openai
export SUPERTONE_API_KEY="Kp9mZ3xQ7v..."
export OPENAI_API_KEY="sk-..."

import os
import re
from openai import OpenAI
from supertone import Supertone

VOICE_ID = "20160a4c5ba38967330c84"  # replace with your voice ID
MODEL = "supertonic_api_3"            # try sona_speech_2_flash for higher quality

SENTENCE_END = re.compile(r"[.!?。！？]\s+")
# Supertone TTS rejects text containing '#' (reserved). Instruction-tuned
# LLMs often emit markdown — strip the common inline markers before sending.
MARKDOWN_MARKERS = re.compile(r"[#*_`]+")

def for_tts(text: str) -> str:
    return MARKDOWN_MARKERS.sub("", text).strip()

def sentences_from_stream(token_stream):
    buffer = ""
    for token in token_stream:
        buffer += token
        while True:
            match = SENTENCE_END.search(buffer)
            if not match:
                break
            sentence = for_tts(buffer[: match.end()])
            if sentence:
                yield sentence
            buffer = buffer[match.end():]
    tail = for_tts(buffer)
    if tail:
        yield tail

def stream_openai_tokens(prompt: str):
    openai = OpenAI()  # reads OPENAI_API_KEY from env
    stream = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        stream=True,
    )
    for chunk in stream:
        delta = chunk.choices[0].delta.content
        if delta:
            yield delta

def main():
    prompt = "Tell me a short story about a curious robot in three sentences."
    out_path = "response.wav"
    open(out_path, "wb").close()  # truncate

    with Supertone(api_key=os.environ["SUPERTONE_API_KEY"]) as supertone:
        for sentence in sentences_from_stream(stream_openai_tokens(prompt)):
            print(f"→ {sentence}")
            response = supertone.text_to_speech.create_speech(
                voice_id=VOICE_ID,
                text=sentence,
                language="en",
                model=MODEL,
            )
            with open(out_path, "ab") as f:
                f.write(response.result.read())

    print(f"Saved {out_path}")

if __name__ == "__main__":
    main()

npm add @supertone/supertone @anthropic-ai/sdk
export SUPERTONE_API_KEY="Kp9mZ3xQ7v..."
export ANTHROPIC_API_KEY="sk-ant-..."

import Anthropic from "@anthropic-ai/sdk";
import { Supertone } from "@supertone/supertone";
import * as fs from "node:fs";

const VOICE_ID = "20160a4c5ba38967330c84"; // replace with your voice ID
const MODEL = "supertonic_api_3";          // try sona_speech_2_flash for higher quality

const SENTENCE_END = /[.!?。！？]\s+/;
// Supertone TTS rejects text containing '#' (reserved). Instruction-tuned
// LLMs often emit markdown — strip the common inline markers before sending.
const MARKDOWN_MARKERS = /[#*_`]+/g;
const forTts = (text: string) => text.replace(MARKDOWN_MARKERS, "").trim();

async function* sentencesFromStream(tokenStream: AsyncIterable<string>) {
  let buffer = "";
  for await (const token of tokenStream) {
    buffer += token;
    while (true) {
      const match = SENTENCE_END.exec(buffer);
      if (!match) break;
      const sentence = forTts(buffer.slice(0, match.index + match[0].length));
      if (sentence) yield sentence;
      buffer = buffer.slice(match.index + match[0].length);
    }
  }
  const tail = forTts(buffer);
  if (tail) yield tail;
}

async function* streamClaudeTokens(prompt: string) {
  const anthropic = new Anthropic(); // reads ANTHROPIC_API_KEY from env
  const stream = anthropic.messages.stream({
    model: "claude-sonnet-4-5",
    max_tokens: 1024,
    messages: [{ role: "user", content: prompt }],
  });
  for await (const event of stream) {
    if (event.type === "content_block_delta" && event.delta.type === "text_delta") {
      yield event.delta.text;
    }
  }
}

async function main() {
  const prompt = "Tell me a short story about a curious robot in three sentences.";
  const outPath = "response.wav";
  fs.writeFileSync(outPath, Buffer.alloc(0));

  const supertone = new Supertone({ apiKey: process.env.SUPERTONE_API_KEY });

  for await (const sentence of sentencesFromStream(streamClaudeTokens(prompt))) {
    console.log(`→ ${sentence}`);
    const response = await supertone.textToSpeech.createSpeech({
      voiceId: VOICE_ID,
      apiConvertTextToSpeechUsingCharacterRequest: {
        text: sentence,
        language: "en",
        model: MODEL,
      },
    });

    if (response.result instanceof Uint8Array) {
      fs.appendFileSync(outPath, response.result);
    } else if (response.result && "getReader" in response.result) {
      const reader = (response.result as ReadableStream<Uint8Array>).getReader();
      while (true) {
        const { done, value } = await reader.read();
        if (done) break;
        if (value) fs.appendFileSync(outPath, value);
      }
    }
  }

  console.log(`Saved ${outPath}`);
}

main();

npm add @supertone/supertone openai
export SUPERTONE_API_KEY="Kp9mZ3xQ7v..."
export OPENAI_API_KEY="sk-..."

import OpenAI from "openai";
import { Supertone } from "@supertone/supertone";
import * as fs from "node:fs";

const VOICE_ID = "20160a4c5ba38967330c84"; // replace with your voice ID
const MODEL = "supertonic_api_3";          // try sona_speech_2_flash for higher quality

const SENTENCE_END = /[.!?。！？]\s+/;
// Supertone TTS rejects text containing '#' (reserved). Instruction-tuned
// LLMs often emit markdown — strip the common inline markers before sending.
const MARKDOWN_MARKERS = /[#*_`]+/g;
const forTts = (text: string) => text.replace(MARKDOWN_MARKERS, "").trim();

async function* sentencesFromStream(tokenStream: AsyncIterable<string>) {
  let buffer = "";
  for await (const token of tokenStream) {
    buffer += token;
    while (true) {
      const match = SENTENCE_END.exec(buffer);
      if (!match) break;
      const sentence = forTts(buffer.slice(0, match.index + match[0].length));
      if (sentence) yield sentence;
      buffer = buffer.slice(match.index + match[0].length);
    }
  }
  const tail = forTts(buffer);
  if (tail) yield tail;
}

async function* streamOpenAITokens(prompt: string) {
  const openai = new OpenAI(); // reads OPENAI_API_KEY from env
  const stream = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [{ role: "user", content: prompt }],
    stream: true,
  });
  for await (const chunk of stream) {
    const delta = chunk.choices[0]?.delta?.content;
    if (delta) yield delta;
  }
}

async function main() {
  const prompt = "Tell me a short story about a curious robot in three sentences.";
  const outPath = "response.wav";
  fs.writeFileSync(outPath, Buffer.alloc(0));

  const supertone = new Supertone({ apiKey: process.env.SUPERTONE_API_KEY });

  for await (const sentence of sentencesFromStream(streamOpenAITokens(prompt))) {
    console.log(`→ ${sentence}`);
    const response = await supertone.textToSpeech.createSpeech({
      voiceId: VOICE_ID,
      apiConvertTextToSpeechUsingCharacterRequest: {
        text: sentence,
        language: "en",
        model: MODEL,
      },
    });

    if (response.result instanceof Uint8Array) {
      fs.appendFileSync(outPath, response.result);
    } else if (response.result && "getReader" in response.result) {
      const reader = (response.result as ReadableStream<Uint8Array>).getReader();
      while (true) {
        const { done, value } = await reader.read();
        if (done) break;
        if (value) fs.appendFileSync(outPath, value);
      }
    }
  }

  console.log(`Saved ${outPath}`);
}

main();

Design notes

Sentence batching matters. Sending one token at a time produces choppy, unnatural speech. The sentence splitter above flushes on ., !, ?, 。, ！, ？. For lower latency, you can also flush on a comma once the buffer exceeds ~60 characters.
Strip markdown before sending. Instruction-tuned models (Claude especially) often wrap their answers in markdown — headings like # Title, bold **text**, code spans, etc. Supertone TTS rejects text containing # (it’s a reserved character), so the snippets above pipe every sentence through a small for_tts / forTts helper that removes #, *, _, and backticks. Without it, the first sentence of a Claude response will commonly fail with a 400.
Model choice for latency. Reserve sona_speech_2 for offline / high-quality use cases where the user can wait. sona_speech_2_flash is a good balance of quality and speed. supertonic_api_3 gives the fastest time-to-first-audio with high speech stability. sona_speech_1 is the only model that supports stream_speech chunked streaming — useful if a single sentence is long and you want to start playing before it finishes.
Saving vs playing. The examples append every audio chunk to response.wav. In a real agent you’d pipe each clip into your audio output (Web Audio, PortAudio, etc.) instead of (or in addition to) writing to disk.
Connection reuse. Reuse the Supertone client across requests — don’t recreate it per sentence.
Long sentences. If a single sentence exceeds 300 characters, the SDK auto-chunks it internally, so you don’t need to split further.

Models

Pick the right model for your latency budget.

Latency optimization

More tips for reducing time-to-audio.

Get started

Core concepts

Text-to-Speech

SDKs

Examples

Production

Resources

You may not need streaming

Recipes

Design notes

Models

Latency optimization

Get started

Core concepts

Text-to-Speech

SDKs

Examples

Production

Resources

Documentation Index

​You may not need streaming

​Recipes

​Design notes

​Related

Models

Latency optimization

You may not need streaming

Recipes

Design notes

Related