LLMレスポンスをTTSでストリーミング

このドキュメントは英語の原文から自動翻訳されています。表現に不自然な箇所がある場合があります。正確な内容は英語の原文もあわせてご確認ください。

ボイスエージェントやチャットボットでは、LLMが回答を生成しているそばからユーザーに音声を届ける必要があります。レスポンス全体が完了してからではなく、生成中に再生を開始するのが望ましい挙動です。パターンは次のとおりです。

LLMからトークンをストリーミング受信します。
文単位のチャンクにまとめます。
各チャンクをSupertone TTSへ送信し、オーディオを順次再生します。

以下に、そのままプロジェクトに貼り付けて実行できるエンドツーエンドのレシピを示します。2つのAPIキーを設定し、voice_idを差し替えるだけで動作します。

ストリーミングが不要な場合もあります

stream_speechはsona_speech_1でのみサポートされています。優先事項が**最初のオーディオが返るまでの総時間（time-to-first-audio）**であれば、各リクエストを高速に完了する非ストリーミングモデルを選んだほうが結果的に速くなることが多くあります。

supertonic_api_3 — 最も推論が速く、レイテンシが低いモデルです。発話の安定性も大幅に向上しており、time-to-first-audioを最重視するボイスエージェントに最適です。
sona_speech_2_flash — バランス型。sona_speech_2と同等の品質を保ちつつ、レイテンシが低くなっています。
sona_speech_1 + stream_speech — テキスト1チャンクが十分に長く、チャンク単位のストリーミングによって再生開始が意味のある形で早まる場合にのみ有効です。

下記の「文単位でLLMを処理する」パターンでは、各TTS呼び出しは短い1文を対象とします。この場合、高速モデルで非ストリーミング呼び出しを行うほうが、sona_speech_1のストリーミングが最初のチャンクを出力するよりも早く完了することがほとんどです。サンプルでは既定としてsupertonic_api_3を使用しています。modelの文字列を入れ替えて他のモデルも試してみてください。

レシピ

以下からLLMと言語スタックを選択してください。4つのレシピはいずれも同じ「文単位のバッチ処理」パターンに従っており、異なるのはLLMストリーミングの部分のみです。

Python · Anthropic
Python · OpenAI
TypeScript · Anthropic
TypeScript · OpenAI

pip install supertone anthropic
export SUPERTONE_API_KEY="Kp9mZ3xQ7v..."
export ANTHROPIC_API_KEY="sk-ant-..."

import os
import re
from anthropic import Anthropic
from supertone import Supertone

VOICE_ID = "20160a4c5ba38967330c84"  # replace with your voice ID
MODEL = "supertonic_api_3"            # try sona_speech_2_flash for higher quality

SENTENCE_END = re.compile(r"[.!?。！？]\s+")
# Supertone TTS rejects text containing '#' (reserved). Instruction-tuned
# LLMs often emit markdown — strip the common inline markers before sending.
MARKDOWN_MARKERS = re.compile(r"[#*_`]+")

def for_tts(text: str) -> str:
    return MARKDOWN_MARKERS.sub("", text).strip()

def sentences_from_stream(token_stream):
    """Yield sentence-sized strings from an iterable of text tokens."""
    buffer = ""
    for token in token_stream:
        buffer += token
        while True:
            match = SENTENCE_END.search(buffer)
            if not match:
                break
            sentence = for_tts(buffer[: match.end()])
            if sentence:
                yield sentence
            buffer = buffer[match.end():]
    tail = for_tts(buffer)
    if tail:
        yield tail

def stream_claude_tokens(prompt: str):
    anthropic = Anthropic()  # reads ANTHROPIC_API_KEY from env
    with anthropic.messages.stream(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
    ) as stream:
        for text in stream.text_stream:
            yield text

def play_or_save(audio_bytes: bytes, path: str):
    """Replace with your audio player. Here we just append to a file."""
    with open(path, "ab") as f:
        f.write(audio_bytes)

def main():
    prompt = "Tell me a short story about a curious robot in three sentences."
    out_path = "response.wav"
    open(out_path, "wb").close()  # truncate

    with Supertone(api_key=os.environ["SUPERTONE_API_KEY"]) as supertone:
        for sentence in sentences_from_stream(stream_claude_tokens(prompt)):
            print(f"→ {sentence}")
            response = supertone.text_to_speech.create_speech(
                voice_id=VOICE_ID,
                text=sentence,
                language="en",
                model=MODEL,
            )
            play_or_save(response.result.read(), out_path)

    print(f"Saved {out_path}")

if __name__ == "__main__":
    main()

pip install supertone openai
export SUPERTONE_API_KEY="Kp9mZ3xQ7v..."
export OPENAI_API_KEY="sk-..."

import os
import re
from openai import OpenAI
from supertone import Supertone

VOICE_ID = "20160a4c5ba38967330c84"  # replace with your voice ID
MODEL = "supertonic_api_3"            # try sona_speech_2_flash for higher quality

SENTENCE_END = re.compile(r"[.!?。！？]\s+")
# Supertone TTS rejects text containing '#' (reserved). Instruction-tuned
# LLMs often emit markdown — strip the common inline markers before sending.
MARKDOWN_MARKERS = re.compile(r"[#*_`]+")

def for_tts(text: str) -> str:
    return MARKDOWN_MARKERS.sub("", text).strip()

def sentences_from_stream(token_stream):
    buffer = ""
    for token in token_stream:
        buffer += token
        while True:
            match = SENTENCE_END.search(buffer)
            if not match:
                break
            sentence = for_tts(buffer[: match.end()])
            if sentence:
                yield sentence
            buffer = buffer[match.end():]
    tail = for_tts(buffer)
    if tail:
        yield tail

def stream_openai_tokens(prompt: str):
    openai = OpenAI()  # reads OPENAI_API_KEY from env
    stream = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        stream=True,
    )
    for chunk in stream:
        delta = chunk.choices[0].delta.content
        if delta:
            yield delta

def main():
    prompt = "Tell me a short story about a curious robot in three sentences."
    out_path = "response.wav"
    open(out_path, "wb").close()  # truncate

    with Supertone(api_key=os.environ["SUPERTONE_API_KEY"]) as supertone:
        for sentence in sentences_from_stream(stream_openai_tokens(prompt)):
            print(f"→ {sentence}")
            response = supertone.text_to_speech.create_speech(
                voice_id=VOICE_ID,
                text=sentence,
                language="en",
                model=MODEL,
            )
            with open(out_path, "ab") as f:
                f.write(response.result.read())

    print(f"Saved {out_path}")

if __name__ == "__main__":
    main()

npm add @supertone/supertone @anthropic-ai/sdk
export SUPERTONE_API_KEY="Kp9mZ3xQ7v..."
export ANTHROPIC_API_KEY="sk-ant-..."

import Anthropic from "@anthropic-ai/sdk";
import { Supertone } from "@supertone/supertone";
import * as fs from "node:fs";

const VOICE_ID = "20160a4c5ba38967330c84"; // replace with your voice ID
const MODEL = "supertonic_api_3";          // try sona_speech_2_flash for higher quality

const SENTENCE_END = /[.!?。！？]\s+/;
// Supertone TTS rejects text containing '#' (reserved). Instruction-tuned
// LLMs often emit markdown — strip the common inline markers before sending.
const MARKDOWN_MARKERS = /[#*_`]+/g;
const forTts = (text: string) => text.replace(MARKDOWN_MARKERS, "").trim();

async function* sentencesFromStream(tokenStream: AsyncIterable<string>) {
  let buffer = "";
  for await (const token of tokenStream) {
    buffer += token;
    while (true) {
      const match = SENTENCE_END.exec(buffer);
      if (!match) break;
      const sentence = forTts(buffer.slice(0, match.index + match[0].length));
      if (sentence) yield sentence;
      buffer = buffer.slice(match.index + match[0].length);
    }
  }
  const tail = forTts(buffer);
  if (tail) yield tail;
}

async function* streamClaudeTokens(prompt: string) {
  const anthropic = new Anthropic(); // reads ANTHROPIC_API_KEY from env
  const stream = anthropic.messages.stream({
    model: "claude-sonnet-4-5",
    max_tokens: 1024,
    messages: [{ role: "user", content: prompt }],
  });
  for await (const event of stream) {
    if (event.type === "content_block_delta" && event.delta.type === "text_delta") {
      yield event.delta.text;
    }
  }
}

async function main() {
  const prompt = "Tell me a short story about a curious robot in three sentences.";
  const outPath = "response.wav";
  fs.writeFileSync(outPath, Buffer.alloc(0));

  const supertone = new Supertone({ apiKey: process.env.SUPERTONE_API_KEY });

  for await (const sentence of sentencesFromStream(streamClaudeTokens(prompt))) {
    console.log(`→ ${sentence}`);
    const response = await supertone.textToSpeech.createSpeech({
      voiceId: VOICE_ID,
      apiConvertTextToSpeechUsingCharacterRequest: {
        text: sentence,
        language: "en",
        model: MODEL,
      },
    });

    if (response.result instanceof Uint8Array) {
      fs.appendFileSync(outPath, response.result);
    } else if (response.result && "getReader" in response.result) {
      const reader = (response.result as ReadableStream<Uint8Array>).getReader();
      while (true) {
        const { done, value } = await reader.read();
        if (done) break;
        if (value) fs.appendFileSync(outPath, value);
      }
    }
  }

  console.log(`Saved ${outPath}`);
}

main();

npm add @supertone/supertone openai
export SUPERTONE_API_KEY="Kp9mZ3xQ7v..."
export OPENAI_API_KEY="sk-..."

import OpenAI from "openai";
import { Supertone } from "@supertone/supertone";
import * as fs from "node:fs";

const VOICE_ID = "20160a4c5ba38967330c84"; // replace with your voice ID
const MODEL = "supertonic_api_3";          // try sona_speech_2_flash for higher quality

const SENTENCE_END = /[.!?。！？]\s+/;
// Supertone TTS rejects text containing '#' (reserved). Instruction-tuned
// LLMs often emit markdown — strip the common inline markers before sending.
const MARKDOWN_MARKERS = /[#*_`]+/g;
const forTts = (text: string) => text.replace(MARKDOWN_MARKERS, "").trim();

async function* sentencesFromStream(tokenStream: AsyncIterable<string>) {
  let buffer = "";
  for await (const token of tokenStream) {
    buffer += token;
    while (true) {
      const match = SENTENCE_END.exec(buffer);
      if (!match) break;
      const sentence = forTts(buffer.slice(0, match.index + match[0].length));
      if (sentence) yield sentence;
      buffer = buffer.slice(match.index + match[0].length);
    }
  }
  const tail = forTts(buffer);
  if (tail) yield tail;
}

async function* streamOpenAITokens(prompt: string) {
  const openai = new OpenAI(); // reads OPENAI_API_KEY from env
  const stream = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [{ role: "user", content: prompt }],
    stream: true,
  });
  for await (const chunk of stream) {
    const delta = chunk.choices[0]?.delta?.content;
    if (delta) yield delta;
  }
}

async function main() {
  const prompt = "Tell me a short story about a curious robot in three sentences.";
  const outPath = "response.wav";
  fs.writeFileSync(outPath, Buffer.alloc(0));

  const supertone = new Supertone({ apiKey: process.env.SUPERTONE_API_KEY });

  for await (const sentence of sentencesFromStream(streamOpenAITokens(prompt))) {
    console.log(`→ ${sentence}`);
    const response = await supertone.textToSpeech.createSpeech({
      voiceId: VOICE_ID,
      apiConvertTextToSpeechUsingCharacterRequest: {
        text: sentence,
        language: "en",
        model: MODEL,
      },
    });

    if (response.result instanceof Uint8Array) {
      fs.appendFileSync(outPath, response.result);
    } else if (response.result && "getReader" in response.result) {
      const reader = (response.result as ReadableStream<Uint8Array>).getReader();
      while (true) {
        const { done, value } = await reader.read();
        if (done) break;
        if (value) fs.appendFileSync(outPath, value);
      }
    }
  }

  console.log(`Saved ${outPath}`);
}

main();

設計上のポイント

文単位のバッチ処理が重要です。 トークンを1つずつ送信すると、途切れがちで不自然な音声になります。上記の文分割処理は., !, ?, 。, ！, ？で区切ります。さらにレイテンシを下げたい場合は、バッファが約60文字を超えた時点でカンマでもフラッシュする方法もあります。
送信前にマークダウンを除去してください。 インストラクションチューニング済みのモデル（特にClaude）は、回答をマークダウンで包んで返すことが多くあります（# Titleのような見出し、**text**の太字、コードスパンなど）。Supertone TTSは#を含むテキストを拒否します（予約文字のため）。そのため上記のサンプルではすべての文を小さなヘルパーfor_tts / forTtsに通し、#、*、_、バッククォートを取り除いています。これを行わないと、Claudeレスポンスの最初の文がしばしば400で失敗します。
レイテンシ重視のモデル選定。 sona_speech_2は、ユーザーが待てるオフライン用途や高品質用途に取っておきましょう。sona_speech_2_flashは品質と速度のバランスが良いモデルです。supertonic_api_3は最も短いtime-to-first-audioと高い発話安定性を実現します。sona_speech_1はstream_speechによるチャンク単位のストリーミングをサポートする唯一のモデルです — 1文が長く、文の生成が終わる前に再生を開始したい場合に有効です。
保存と再生。 サンプルではすべてのオーディオチャンクをresponse.wavに追記しています。実際のエージェントでは、ディスクへの書き込みに加えて（あるいは代わりに）、各クリップをWeb AudioやPortAudioなどのオーディオ出力にパイプすることになります。
コネクションの再利用。 Supertoneクライアントは複数のリクエストで使い回してください。文ごとに作り直さないようにしましょう。
長い文。 1文が300文字を超える場合は、SDKが内部で自動的にチャンク分割するため、こちらで分割する必要はありません。

モデル

レイテンシ予算に合わせて適切なモデルを選択しましょう。

レイテンシ最適化

time-to-audioを短縮するためのヒントをさらに紹介します。

​ストリーミングが不要な場合もあります

​レシピ

​設計上のポイント

​関連情報

モデル

レイテンシ最適化

ストリーミングが不要な場合もあります

レシピ

設計上のポイント

関連情報