립싱크용 음소 생성하기 - Supertone API Documentation

이 문서는 영어 원문을 기반으로 자동 번역되었습니다. 표현이 어색하거나 모호한 부분이 있을 수 있으니, 정확한 내용은 영어 원문을 함께 확인해 주세요.

캐릭터를 생성된 음성에 맞춰 립싱크하려면, 다음 세 가지가 서로 동기화되어 있어야 합니다.

오디오 파일.
실제로 발화된 음소(phoneme) 심볼.
각 음소의 시작 시간과 지속 시간.

TTS 요청에 include_phonemes: true를 전달하면, Supertone이 위 세 가지를 모두 반환합니다.

Python — 오디오 + 음소 요청하기

import base64
import os
from supertone import Supertone

VOICE_ID = "20160a4c5ba38967330c84"  # replace with your voice ID

with Supertone(api_key=os.environ["SUPERTONE_API_KEY"]) as client:
    response = client.text_to_speech.create_speech(
        voice_id=VOICE_ID,
        text="Welcome to the workshop.",
        language="en",
        model="sona_speech_2",
        include_phonemes=True,
    )

    audio_bytes = base64.b64decode(response.result.audio_base64)
    with open("speech.wav", "wb") as f:
        f.write(audio_bytes)

    phonemes = response.result.phonemes
    for symbol, start, duration in zip(
        phonemes.symbols,
        phonemes.start_times_seconds,
        phonemes.durations_seconds,
    ):
        print(f"{start:7.3f}s  {duration:5.3f}s  {symbol!r}")

TypeScript — 오디오 + 음소 요청하기

import { Supertone } from "@supertone/supertone";
import * as fs from "node:fs";

const VOICE_ID = "20160a4c5ba38967330c84"; // replace with your voice ID

const client = new Supertone({ apiKey: process.env.SUPERTONE_API_KEY });

const response = await client.textToSpeech.createSpeech({
  voiceId: VOICE_ID,
  apiConvertTextToSpeechUsingCharacterRequest: {
    text: "Welcome to the workshop.",
    language: "en",
    model: "sona_speech_2",
    includePhonemes: true,
  },
});

const result = response.result as {
  audioBase64: string;
  phonemes?: { symbols?: string[]; startTimesSeconds?: number[]; durationsSeconds?: number[] };
};

fs.writeFileSync("speech.wav", Buffer.from(result.audioBase64, "base64"));

const { symbols = [], startTimesSeconds = [], durationsSeconds = [] } = result.phonemes ?? {};
for (let i = 0; i < symbols.length; i++) {
  console.log(
    `${startTimesSeconds[i].toFixed(3)}s  ${durationsSeconds[i].toFixed(3)}s  ${symbols[i]}`,
  );
}

음소를 비짐(viseme)에 매핑하기

일반적인 렌더링 파이프라인에서는 각 IPA 스타일 심볼을 소수의 입 모양(비짐) 집합에 매핑한 뒤, 그 사이를 보간하여 3D 리그나 2D 스프라이트를 구동합니다.

// Minimal English IPA → viseme mapping (extend for your rig)
const PHONEME_TO_VISEME: Record<string, string> = {
  // Closed lip
  "p": "BMP",
  "b": "BMP",
  "m": "BMP",
  // Open vowel
  "ɑ": "Aa",
  "ʌ": "Aa",
  "ɐ": "Aa",
  // Wide smile
  "iː": "Ee",
  "i": "Ee",
  // Rounded
  "uː": "Oo",
  "u": "Oo",
  "o": "Oo",
  // Fricative
  "f": "FV",
  "v": "FV",
  "θ": "Th",
  // Silence
  "": "Rest",
};

interface VisemeKeyframe {
  time: number;
  duration: number;
  viseme: string;
}

function buildVisemeTrack(
  symbols: string[],
  starts: number[],
  durations: number[],
): VisemeKeyframe[] {
  return symbols.map((symbol, i) => ({
    time: starts[i],
    duration: durations[i],
    viseme: PHONEME_TO_VISEME[symbol] ?? "Rest",
  }));
}

렌더 루프에서는 현재 오디오 시간을 진행시키면서, 해당 시점에 활성화된 비짐을 조회합니다. 비짐 가중치를 트윈(tween)하여 입 모양이 갑자기 튀지 않도록 하세요.

음소를 실시간으로 스트리밍하기

include_phonemes: true로 stream_speech를 호출하면, 응답이 NDJSON 형식으로 바뀝니다. 도착하는 각 줄을 파싱하여 실시간 립싱크를 구동하세요.

import json

response = client.text_to_speech.stream_speech(
    voice_id=VOICE_ID,
    text="Streaming lip sync in real time.",
    language="en",
    model="sona_speech_1",
    include_phonemes=True,
)

for line in response.result.iter_lines():
    if not line:
        continue
    payload = json.loads(line)
    audio_chunk = base64.b64decode(payload["audio_base64"])
    schedule_audio(audio_chunk)
    schedule_phonemes(payload["phonemes"])

팁

음소를 지원하는 모델을 사용하세요. sona_speech_2, sona_speech_2_flash, sona_speech_1은 모두 음소를 지원합니다. supertonic_api_3와 supertonic_api_1은 지원하지 않습니다.
부드러운 전환. 실제 입은 모양 사이를 갑자기 튀지 않습니다. 대부분의 엔진은 비짐 가중치를 50~80ms 동안 보간합니다. API가 반환하는 음소 지속 시간은 이러한 트윈의 좋은 출발점이 됩니다.
강세와 멈춤. 빈 symbol 값은 무음/멈춤을 나타냅니다 — 이 구간에서는 입을 기본 자세로 되돌리세요.
매핑을 현지화하세요. 음소 → 비짐 매핑 표는 언어마다 다릅니다. 다국어 콘텐츠를 출시한다면 한국어와 일본어에 맞게 매핑을 조정하세요.

발음과 음소

include_phonemes와 응답 형태에 대한 레퍼런스입니다.

Stream speech

실시간 립싱크를 위한 NDJSON 스트리밍입니다.

​Python — 오디오 + 음소 요청하기

​TypeScript — 오디오 + 음소 요청하기

​음소를 비짐(viseme)에 매핑하기

​음소를 실시간으로 스트리밍하기

​팁

​관련 문서

발음과 음소

Stream speech

Python — 오디오 + 음소 요청하기

TypeScript — 오디오 + 음소 요청하기

음소를 비짐(viseme)에 매핑하기

음소를 실시간으로 스트리밍하기

팁

관련 문서