Stream speech

엔드포인트

https://supertoneapi.com/v1/text-to-speech/{voice_id}/stream

경로 파라미터

Name	Required	Description
`voice_id`	Yes	대상 보이스의 ID입니다.

요청 본문

Name	Required	Description
`text`	Yes	변환할 텍스트입니다(최대 300자).
`language`	Yes	언어 코드입니다. 지원: `en`, `ko`, `ja`.
`style`	No	감정 스타일입니다. 예: `neutral`, `happy`, `sad` 등. 미지정 시 캐릭터의 기본 스타일이 적용됩니다.
`model`	No	TTS 모델입니다. 기본값: `sona_speech_1`.
`output_format`	No	출력 포맷입니다. 옵션: `wav`, `mp3`. 기본값: `wav`.
`voice_settings`	No	고급 보이스 파라미터입니다(아래 참조).
`include_phonemes`	No	`true`이면 오디오(Base64 인코딩)와 함께 음소 타이밍 데이터를 반환합니다. 기본값: `false`.

보이스 설정(선택)

Name	Range	Default	Description
`pitch_shift`	-24 → 24	0	반음(semitone) 단위의 피치 조정입니다.
`pitch_variance`	0 → 2	1	피치 변동 정도입니다.
`speed`	0.5 → 2	1	생성 오디오를 균일하게 더 빠르거나 느리게 조정합니다(비율).
`duration`	0 → 60	0	값을 지정하면 해당 길이(초)에 맞추어 음성이 생성됩니다.
`similarity`	1 → 5	3	생성 음성이 원본 캐릭터 보이스와 얼마나 유사한지 제어합니다.
`text_guidance`	0 → 4	1	텍스트 내용에 따라 발화 특성이 얼마나 민감하게 적응할지 제어합니다.
`subharmonic_amplitude_control`	0 → 2	1	생성 음성의 서브하모닉 진폭 양을 제어합니다.

응답

include_phonemes 값에 따라 다음 중 하나를 반환합니다. Audio Stream
(기본값 & include_phonemes=false일 때)
audio/wav – 바이너리 오디오 스트림.
audio/mpeg – 바이너리 오디오 스트림. NDJSON stream with Phoneme Data
(include_phonemes=true일 때)
개행으로 구분된 JSON(Newline-Delimited JSON) 형태로 스트리밍됩니다.

{
  "audio_base64": "UklGRnoGAABXQVZF...",
  "phonemes": {
    "symbols": ["", "h", "ɐ", "ɡ", "ʌ", ""],
    "start_times_seconds": [0, 0.092, 0.197, 0.255, 0.29, 0.58],
    "durations_seconds": [0.092, 0.104, 0.058, 0.034, 0.29, 0.162]
  }
}

참고사항

스트림 스피치는 베타 기능이며 현재 sona_speech_1 모델에서만 지원됩니다.
text 길이가 300자를 초과하면 400 오류가 발생합니다.
speed는 duration 적용 후에 반영됩니다. (예: duration=5seconds, speed=2times → 최종 오디오 ≈ 10seconds)
style을 지정하지 않아도 호출할 수 있으나, 기본 스타일은 캐릭터마다 다를 수 있습니다. 기본값은 Get Voices API에서 확인해 주십시오(스타일 배열의 첫 번째 값이 기본).
반환된 오디오 파일은 저장하거나 바로 재생하실 수 있습니다. (클라이언트에 따라 적절한 처리가 필요할 수 있습니다.)

Authorizations

x-sup-api-key

string

header

required

Path Parameters

voice_id

string

required

Body

application/json

text

string

required

The text to convert to speech

Maximum string length: 300

language

enum<string>

required

The language code of the text

Available options:

en,

ko,

ja,

bg,

cs,

da,

el,

es,

et,

fi,

hu,

it,

nl,

pl,

pt,

ro,

ar,

de,

fr,

hi,

id,

ru,

vi

style

string

The style of character to use for the text-to-speech conversion

model

enum<string>

default:sona_speech_1

The model type to use for the text-to-speech conversion

Available options:

sona_speech_1,

sona_speech_2,

sona_speech_2_flash,

sona_speech_2t,

supertonic_api_1

output_format

enum<string>

default:wav

The desired output format of the audio file (wav, mp3). Default is wav.

Available options:

wav,

mp3

voice_settings

object

Show child attributes

include_phonemes

boolean

default:false

Return phoneme timing data with the audio

normalized_text

string

Pre-normalized text for TTS. Only used with sona_speech_2 and sona_speech_2_flash models.

Response

Streaming audio data in binary format or NDJSON format with phoneme data based on includePhonemes parameter

Binary audio stream (when includePhonemes=false or omitted)

Supertone API

Voices

Custom voices

Text to speech

Usage

엔드포인트

경로 파라미터

요청 본문

보이스 설정(선택)

응답

참고사항

Authorizations

Path Parameters

Body

Response

Supertone API

Voices

Custom voices

Text to speech

Usage

​엔드포인트

​경로 파라미터

​요청 본문

​보이스 설정(선택)

​응답

​참고사항

Authorizations

Path Parameters

Body

Response

엔드포인트

경로 파라미터

요청 본문

보이스 설정(선택)

응답

참고사항