LLM 응답을 TTS로 스트리밍하기

이 문서는 영어 원문을 기반으로 자동 번역되었습니다. 표현이 어색하거나 모호한 부분이 있을 수 있으니, 정확한 내용은 영어 원문을 함께 확인해 주세요.

보이스 에이전트와 챗봇에서는 LLM이 응답을 모두 생성한 뒤가 아니라 응답이 생성되는 동안 사용자가 답변을 들을 수 있어야 합니다. 패턴은 다음과 같습니다.

LLM에서 토큰을 스트리밍합니다.
문장 단위 크기의 청크로 묶습니다.
각 청크를 Supertone TTS로 보내 오디오를 전달합니다.

아래는 새 프로젝트에 그대로 붙여 넣고 실행할 수 있는 엔드투엔드 레시피입니다. 두 개의 API Key를 설정하고 voice_id만 교체하면 끝입니다.

스트리밍이 꼭 필요하지 않을 수도 있습니다

stream_speech는 sona_speech_1에서만 지원됩니다. 만약 우선순위가 전체 첫 오디오 출력까지의 시간(time-to-first-audio) 이라면, 각 요청을 빠르게 완료하는 비스트리밍 모델을 선택하는 편이 더 빠른 경우가 많습니다.

supertonic_api_3 — 가장 빠른 추론 속도와 가장 낮은 지연시간을 제공하며, 음성 안정성이 크게 향상되었습니다. 첫 오디오까지의 시간이 가장 중요한 보이스 에이전트에 가장 적합합니다.
sona_speech_2_flash — 균형형으로, sona_speech_2와 비슷한 품질을 유지하면서 지연시간이 더 낮습니다.
stream_speech를 사용하는 sona_speech_1 — 청크 스트리밍으로 재생 시작을 의미 있게 앞당길 수 있을 만큼 단일 텍스트 청크가 충분히 긴 경우에만 유용합니다.

아래의 LLM 문장 단위 패턴에서는 각 TTS 호출이 짧은 한 문장만 다루므로, 빠른 모델에서의 비스트리밍 호출이 sona_speech_1의 스트리밍이 청크를 내보내기 시작하기도 전에 응답을 끝내는 경우가 대부분입니다. 아래 예제는 기본값으로 supertonic_api_3를 사용하며, model 문자열을 바꾸어 다른 모델을 시도해 볼 수 있습니다.

레시피

아래에서 사용할 LLM과 언어 스택을 선택하세요. 네 가지 레시피 모두 동일한 문장 배칭 패턴을 따르며, LLM 스트리밍 부분만 다릅니다.

Python · Anthropic
Python · OpenAI
TypeScript · Anthropic
TypeScript · OpenAI

pip install supertone anthropic
export SUPERTONE_API_KEY="Kp9mZ3xQ7v..."
export ANTHROPIC_API_KEY="sk-ant-..."

import os
import re
from anthropic import Anthropic
from supertone import Supertone

VOICE_ID = "20160a4c5ba38967330c84"  # replace with your voice ID
MODEL = "supertonic_api_3"            # try sona_speech_2_flash for higher quality

SENTENCE_END = re.compile(r"[.!?。！？]\s+")
# Supertone TTS rejects text containing '#' (reserved). Instruction-tuned
# LLMs often emit markdown — strip the common inline markers before sending.
MARKDOWN_MARKERS = re.compile(r"[#*_`]+")

def for_tts(text: str) -> str:
    return MARKDOWN_MARKERS.sub("", text).strip()

def sentences_from_stream(token_stream):
    """Yield sentence-sized strings from an iterable of text tokens."""
    buffer = ""
    for token in token_stream:
        buffer += token
        while True:
            match = SENTENCE_END.search(buffer)
            if not match:
                break
            sentence = for_tts(buffer[: match.end()])
            if sentence:
                yield sentence
            buffer = buffer[match.end():]
    tail = for_tts(buffer)
    if tail:
        yield tail

def stream_claude_tokens(prompt: str):
    anthropic = Anthropic()  # reads ANTHROPIC_API_KEY from env
    with anthropic.messages.stream(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
    ) as stream:
        for text in stream.text_stream:
            yield text

def play_or_save(audio_bytes: bytes, path: str):
    """Replace with your audio player. Here we just append to a file."""
    with open(path, "ab") as f:
        f.write(audio_bytes)

def main():
    prompt = "Tell me a short story about a curious robot in three sentences."
    out_path = "response.wav"
    open(out_path, "wb").close()  # truncate

    with Supertone(api_key=os.environ["SUPERTONE_API_KEY"]) as supertone:
        for sentence in sentences_from_stream(stream_claude_tokens(prompt)):
            print(f"→ {sentence}")
            response = supertone.text_to_speech.create_speech(
                voice_id=VOICE_ID,
                text=sentence,
                language="en",
                model=MODEL,
            )
            play_or_save(response.result.read(), out_path)

    print(f"Saved {out_path}")

if __name__ == "__main__":
    main()

pip install supertone openai
export SUPERTONE_API_KEY="Kp9mZ3xQ7v..."
export OPENAI_API_KEY="sk-..."

import os
import re
from openai import OpenAI
from supertone import Supertone

VOICE_ID = "20160a4c5ba38967330c84"  # replace with your voice ID
MODEL = "supertonic_api_3"            # try sona_speech_2_flash for higher quality

SENTENCE_END = re.compile(r"[.!?。！？]\s+")
# Supertone TTS rejects text containing '#' (reserved). Instruction-tuned
# LLMs often emit markdown — strip the common inline markers before sending.
MARKDOWN_MARKERS = re.compile(r"[#*_`]+")

def for_tts(text: str) -> str:
    return MARKDOWN_MARKERS.sub("", text).strip()

def sentences_from_stream(token_stream):
    buffer = ""
    for token in token_stream:
        buffer += token
        while True:
            match = SENTENCE_END.search(buffer)
            if not match:
                break
            sentence = for_tts(buffer[: match.end()])
            if sentence:
                yield sentence
            buffer = buffer[match.end():]
    tail = for_tts(buffer)
    if tail:
        yield tail

def stream_openai_tokens(prompt: str):
    openai = OpenAI()  # reads OPENAI_API_KEY from env
    stream = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        stream=True,
    )
    for chunk in stream:
        delta = chunk.choices[0].delta.content
        if delta:
            yield delta

def main():
    prompt = "Tell me a short story about a curious robot in three sentences."
    out_path = "response.wav"
    open(out_path, "wb").close()  # truncate

    with Supertone(api_key=os.environ["SUPERTONE_API_KEY"]) as supertone:
        for sentence in sentences_from_stream(stream_openai_tokens(prompt)):
            print(f"→ {sentence}")
            response = supertone.text_to_speech.create_speech(
                voice_id=VOICE_ID,
                text=sentence,
                language="en",
                model=MODEL,
            )
            with open(out_path, "ab") as f:
                f.write(response.result.read())

    print(f"Saved {out_path}")

if __name__ == "__main__":
    main()

npm add @supertone/supertone @anthropic-ai/sdk
export SUPERTONE_API_KEY="Kp9mZ3xQ7v..."
export ANTHROPIC_API_KEY="sk-ant-..."

import Anthropic from "@anthropic-ai/sdk";
import { Supertone } from "@supertone/supertone";
import * as fs from "node:fs";

const VOICE_ID = "20160a4c5ba38967330c84"; // replace with your voice ID
const MODEL = "supertonic_api_3";          // try sona_speech_2_flash for higher quality

const SENTENCE_END = /[.!?。！？]\s+/;
// Supertone TTS rejects text containing '#' (reserved). Instruction-tuned
// LLMs often emit markdown — strip the common inline markers before sending.
const MARKDOWN_MARKERS = /[#*_`]+/g;
const forTts = (text: string) => text.replace(MARKDOWN_MARKERS, "").trim();

async function* sentencesFromStream(tokenStream: AsyncIterable<string>) {
  let buffer = "";
  for await (const token of tokenStream) {
    buffer += token;
    while (true) {
      const match = SENTENCE_END.exec(buffer);
      if (!match) break;
      const sentence = forTts(buffer.slice(0, match.index + match[0].length));
      if (sentence) yield sentence;
      buffer = buffer.slice(match.index + match[0].length);
    }
  }
  const tail = forTts(buffer);
  if (tail) yield tail;
}

async function* streamClaudeTokens(prompt: string) {
  const anthropic = new Anthropic(); // reads ANTHROPIC_API_KEY from env
  const stream = anthropic.messages.stream({
    model: "claude-sonnet-4-5",
    max_tokens: 1024,
    messages: [{ role: "user", content: prompt }],
  });
  for await (const event of stream) {
    if (event.type === "content_block_delta" && event.delta.type === "text_delta") {
      yield event.delta.text;
    }
  }
}

async function main() {
  const prompt = "Tell me a short story about a curious robot in three sentences.";
  const outPath = "response.wav";
  fs.writeFileSync(outPath, Buffer.alloc(0));

  const supertone = new Supertone({ apiKey: process.env.SUPERTONE_API_KEY });

  for await (const sentence of sentencesFromStream(streamClaudeTokens(prompt))) {
    console.log(`→ ${sentence}`);
    const response = await supertone.textToSpeech.createSpeech({
      voiceId: VOICE_ID,
      apiConvertTextToSpeechUsingCharacterRequest: {
        text: sentence,
        language: "en",
        model: MODEL,
      },
    });

    if (response.result instanceof Uint8Array) {
      fs.appendFileSync(outPath, response.result);
    } else if (response.result && "getReader" in response.result) {
      const reader = (response.result as ReadableStream<Uint8Array>).getReader();
      while (true) {
        const { done, value } = await reader.read();
        if (done) break;
        if (value) fs.appendFileSync(outPath, value);
      }
    }
  }

  console.log(`Saved ${outPath}`);
}

main();

npm add @supertone/supertone openai
export SUPERTONE_API_KEY="Kp9mZ3xQ7v..."
export OPENAI_API_KEY="sk-..."

import OpenAI from "openai";
import { Supertone } from "@supertone/supertone";
import * as fs from "node:fs";

const VOICE_ID = "20160a4c5ba38967330c84"; // replace with your voice ID
const MODEL = "supertonic_api_3";          // try sona_speech_2_flash for higher quality

const SENTENCE_END = /[.!?。！？]\s+/;
// Supertone TTS rejects text containing '#' (reserved). Instruction-tuned
// LLMs often emit markdown — strip the common inline markers before sending.
const MARKDOWN_MARKERS = /[#*_`]+/g;
const forTts = (text: string) => text.replace(MARKDOWN_MARKERS, "").trim();

async function* sentencesFromStream(tokenStream: AsyncIterable<string>) {
  let buffer = "";
  for await (const token of tokenStream) {
    buffer += token;
    while (true) {
      const match = SENTENCE_END.exec(buffer);
      if (!match) break;
      const sentence = forTts(buffer.slice(0, match.index + match[0].length));
      if (sentence) yield sentence;
      buffer = buffer.slice(match.index + match[0].length);
    }
  }
  const tail = forTts(buffer);
  if (tail) yield tail;
}

async function* streamOpenAITokens(prompt: string) {
  const openai = new OpenAI(); // reads OPENAI_API_KEY from env
  const stream = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [{ role: "user", content: prompt }],
    stream: true,
  });
  for await (const chunk of stream) {
    const delta = chunk.choices[0]?.delta?.content;
    if (delta) yield delta;
  }
}

async function main() {
  const prompt = "Tell me a short story about a curious robot in three sentences.";
  const outPath = "response.wav";
  fs.writeFileSync(outPath, Buffer.alloc(0));

  const supertone = new Supertone({ apiKey: process.env.SUPERTONE_API_KEY });

  for await (const sentence of sentencesFromStream(streamOpenAITokens(prompt))) {
    console.log(`→ ${sentence}`);
    const response = await supertone.textToSpeech.createSpeech({
      voiceId: VOICE_ID,
      apiConvertTextToSpeechUsingCharacterRequest: {
        text: sentence,
        language: "en",
        model: MODEL,
      },
    });

    if (response.result instanceof Uint8Array) {
      fs.appendFileSync(outPath, response.result);
    } else if (response.result && "getReader" in response.result) {
      const reader = (response.result as ReadableStream<Uint8Array>).getReader();
      while (true) {
        const { done, value } = await reader.read();
        if (done) break;
        if (value) fs.appendFileSync(outPath, value);
      }
    }
  }

  console.log(`Saved ${outPath}`);
}

main();

설계 노트

문장 배칭이 중요합니다. 토큰을 하나씩 전송하면 끊기고 부자연스러운 음성이 만들어집니다. 위의 문장 분리기는 ., !, ?, 。, ！, ？에서 플러시합니다. 지연시간을 더 줄이고 싶다면, 버퍼가 약 60자를 넘었을 때 쉼표에서도 플러시하도록 할 수 있습니다.
전송 전에 마크다운을 제거하세요. 인스트럭션 튜닝된 모델(특히 Claude)은 답변을 마크다운으로 감싸 출력하는 경우가 많습니다. # Title 같은 헤딩, **text** 같은 볼드, 코드 스팬 등이 그 예입니다. Supertone TTS는 #(예약 문자)이 포함된 텍스트를 거부하므로, 위 스니펫은 모든 문장을 for_tts / forTts라는 작은 헬퍼에 통과시켜 #, *, _, 백틱을 제거합니다. 이 과정을 거치지 않으면 Claude 응답의 첫 문장은 보통 400 오류로 실패합니다.
지연시간을 고려한 모델 선택. sona_speech_2는 사용자가 기다릴 수 있는 오프라인 / 고품질 용도에만 사용하세요. sona_speech_2_flash는 품질과 속도의 균형이 좋습니다. supertonic_api_3는 높은 음성 안정성과 함께 가장 빠른 첫 오디오 출력 시간을 제공합니다. sona_speech_1은 stream_speech 청크 스트리밍을 지원하는 유일한 모델로, 단일 문장이 길어 문장이 끝나기 전에 재생을 시작하고 싶을 때 유용합니다.
저장 대 재생. 위 예제는 모든 오디오 청크를 response.wav에 이어 붙입니다. 실제 에이전트에서는 디스크에 쓰는 대신(또는 그와 함께) 각 클립을 오디오 출력(Web Audio, PortAudio 등)으로 흘려보내야 합니다.
커넥션 재사용. Supertone 클라이언트는 요청 간에 재사용하세요 — 문장마다 새로 만들지 마세요.
긴 문장 처리. 단일 문장이 300자를 초과해도 SDK가 내부적으로 자동 분할하므로, 직접 추가로 나눌 필요는 없습니다.

모델

지연시간 예산에 맞는 모델을 선택하세요.

지연시간 최적화

오디오 출력까지의 시간을 줄이기 위한 더 많은 팁입니다.

시작하기

핵심 개념

Text-to-Speech

SDK

예제

프로덕션 운영

리소스

스트리밍이 꼭 필요하지 않을 수도 있습니다

레시피

설계 노트

관련 문서

모델

지연시간 최적화

시작하기

핵심 개념

Text-to-Speech

SDK

예제

프로덕션 운영

리소스

Documentation Index

​스트리밍이 꼭 필요하지 않을 수도 있습니다

​레시피

​설계 노트

​관련 문서

모델

지연시간 최적화

스트리밍이 꼭 필요하지 않을 수도 있습니다

레시피

설계 노트

관련 문서