발음과 음소 - Supertone API Documentation

이 문서는 영어 원문을 기반으로 자동 번역되었습니다. 표현이 어색하거나 모호한 부분이 있을 수 있으니, 정확한 내용은 영어 원문을 함께 확인해 주세요.

Supertone API는 오디오와 함께 음소(phoneme) 데이터를 반환할 수 있습니다. 음소는 모델이 발화한 개별 소리 단위로, 각 음소의 시작 시간과 지속 시간이 함께 제공됩니다. 이 데이터는 게임과 애니메이션의 립싱크 구동, 가라오케 스타일의 단어 하이라이팅, 발음 분석 등에 활용할 수 있습니다. 이 기능을 사용하려면 TTS 요청에 include_phonemes: true를 설정해 주십시오.

sona_speech_2, sona_speech_2_flash, sona_speech_1에서 지원됩니다. supertonic_api_3 및 supertonic_api_1에서는 지원되지 않습니다.

사용법

Python
TypeScript
cURL

import base64
import os
from supertone import Supertone

VOICE_ID = "20160a4c5ba38967330c84"  # replace with your voice ID

with Supertone(api_key=os.environ["SUPERTONE_API_KEY"]) as client:
    response = client.text_to_speech.create_speech(
        voice_id=VOICE_ID,
        text="Hello, world.",
        language="en",
        include_phonemes=True,
    )

    result = response.result
    with open("speech.wav", "wb") as f:
        f.write(base64.b64decode(result.audio_base64))

    for symbol, start, duration in zip(
        result.phonemes.symbols,
        result.phonemes.start_times_seconds,
        result.phonemes.durations_seconds,
    ):
        print(f"{symbol!r} at {start:.3f}s for {duration:.3f}s")

import { Supertone } from "@supertone/supertone";
import * as fs from "node:fs";

const VOICE_ID = "20160a4c5ba38967330c84"; // replace with your voice ID

const client = new Supertone({ apiKey: process.env.SUPERTONE_API_KEY });

const response = await client.textToSpeech.createSpeech({
  voiceId: VOICE_ID,
  apiConvertTextToSpeechUsingCharacterRequest: {
    text: "Hello, world.",
    language: "en",
    includePhonemes: true,
  },
});

const result = response.result as {
  audioBase64: string;
  phonemes?: {
    symbols?: string[];
    startTimesSeconds?: number[];
    durationsSeconds?: number[];
  };
};

fs.writeFileSync("speech.wav", Buffer.from(result.audioBase64, "base64"));

const symbols = result.phonemes?.symbols ?? [];
const starts = result.phonemes?.startTimesSeconds ?? [];
const durations = result.phonemes?.durationsSeconds ?? [];

for (let i = 0; i < symbols.length; i++) {
  console.log(`${symbols[i]} at ${starts[i].toFixed(3)}s for ${durations[i].toFixed(3)}s`);
}

VOICE_ID="20160a4c5ba38967330c84"

curl -X POST "https://supertoneapi.com/v1/text-to-speech/$VOICE_ID" \
  -H "x-sup-api-key: $SUPERTONE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello, world.",
    "language": "en",
    "include_phonemes": true
  }'

바이너리 오디오가 아닌 JSON을 반환합니다.

{
  "audio_base64": "UklGRnoGAABXQVZF...",
  "phonemes": {
    "symbols": ["", "h", "ɐ", "ɡ", "ʌ", ""],
    "start_times_seconds": [0, 0.092, 0.197, 0.255, 0.29, 0.58],
    "durations_seconds": [0.092, 0.104, 0.058, 0.034, 0.29, 0.162]
  }
}

응답 구조

Field	Description
`audio_base64`	요청한 `output_format`(`wav` 또는 `mp3`)으로 인코딩된 base64 오디오입니다.
`phonemes.symbols`	IPA 스타일 표기법의 음소 심볼입니다. 빈 문자열은 무음/휴지를 나타냅니다.
`phonemes.start_times_seconds`	클립 내 각 심볼의 시작 시간입니다.
`phonemes.durations_seconds`	각 심볼의 지속 시간입니다.

세 개의 음소 배열은 정렬되어 있습니다. symbols[i], start_times_seconds[i], durations_seconds[i]는 동일한 음소를 가리킵니다.

음소와 함께 스트리밍하기

stream_speech를 include_phonemes: true로 호출하면 응답이 NDJSON(줄바꿈으로 구분되는 JSON)으로 전환됩니다. 각 줄은 자체 audio_base64와 phonemes 데이터를 포함하는 청크입니다.

{"audio_base64":"...","phonemes":{"symbols":["","h"],"start_times_seconds":[0,0.05],"durations_seconds":[0.05,0.08]}}
{"audio_base64":"...","phonemes":{"symbols":["ɐ","ɡ"],"start_times_seconds":[0.13,0.19],"durations_seconds":[0.06,0.04]}}

각 줄이 도착하는 대로 파싱하여 실시간으로 립싱크 렌더러를 구동할 수 있습니다.

활용 사례

게임 및 애니메이션의 립싱크. 각 음소를 비짐(viseme, 입 모양)에 매핑하고 오디오에 맞춰 비짐을 재생합니다. 대부분의 엔진은 기본 음소-비짐 매핑 테이블을 제공하며, Supertone의 심볼은 표준 IPA 스타일이므로 대부분의 리그와 호환됩니다.
가라오케 / 단어 하이라이팅. 음소 시작 시간을 이용해 발화되는 순간에 맞춰 단어를 강조 표시할 수 있습니다.
발음 분석. 실제 음소와 예상 시퀀스를 비교하여 어학 학습 앱에서 발음을 검사할 수 있습니다.

엔드투엔드 예제는 립싱크용 음소 생성을 참고해 주십시오.

립싱크 예제

음소에서 비짐으로 이어지는 파이프라인을 구축합니다.

노말라이즈드 텍스트

모호한 입력의 발음을 개선합니다.

​사용법

​응답 구조

​음소와 함께 스트리밍하기

​활용 사례

​관련 문서

립싱크 예제

노말라이즈드 텍스트

사용법

응답 구조

음소와 함께 스트리밍하기

활용 사례

관련 문서