Create speech

엔드포인트

https://supertoneapi.com/v1/text-to-speech/{voice_id}

경로 파라미터

Name	Required	Description
`voice_id`	Yes	대상 보이스의 ID입니다.

요청 본문

Name	Required	Description
`text`	Yes	변환할 텍스트입니다(최대 300자).
`language`	Yes	언어 코드입니다. 지원: `en`, `ko`, `ja`.
`style`	No	감정 스타일입니다. 예: `neutral`, `happy`, `sad` 등. 미지정 시 캐릭터의 기본 스타일이 적용됩니다.
`model`	No	TTS 모델입니다. 기본값: `sona_speech_1`.
`output_format`	No	출력 포맷입니다. 옵션: `wav`, `mp3`. 기본값: `wav`.
`voice_settings`	No	고급 보이스 파라미터입니다(아래 참조).
`include_phonemes`	No	`true`이면 오디오(Base64 인코딩)와 함께 음소 타이밍 데이터를 반환합니다. 기본값: `false`.

보이스 설정(선택)

Name	Range	Default	Description
`pitch_shift`	-24 → 24	0	반음(semitone) 단위의 피치 조정입니다.
`pitch_variance`	0 → 2	1	피치 변동 정도입니다.
`speed`	0.5 → 2	1	생성 오디오를 균일하게 더 빠르거나 느리게 조정합니다(비율).
`duration`	0 → 60	0	값을 지정하면 해당 길이(초)에 맞추어 음성이 생성됩니다.
`similarity`	1 → 5	3	생성 음성이 원본 캐릭터 보이스와 얼마나 유사한지 제어합니다.
`text_guidance`	0 → 4	1	텍스트 내용에 따라 발화 특성이 얼마나 민감하게 적응할지 제어합니다.
`subharmonic_amplitude_control`	0 → 2	1	생성 음성의 서브하모닉 진폭 양을 제어합니다.

응답

include_phonemes 값에 따라 다음 중 하나를 반환합니다. 바이너리 오디오
(기본값 & include_phonemes=false일 때)
audio/wav – 원시 WAV 파일.
audio/mpeg – 원시 MP3 파일. 음소 데이터가 포함된 JSON
(include_phonemes=true일 때)

{
  "audio_base64": "UklGRnoGAABXQVZF...",
  "phonemes": {
    "symbols": ["", "h", "ɐ", "ɡ", "ʌ", ""],
    "start_times_seconds": [0, 0.092, 0.197, 0.255, 0.29, 0.58],
    "durations_seconds": [0.092, 0.104, 0.058, 0.034, 0.29, 0.162]
  }
}

참고사항

text 길이가 300자를 초과하면 400 오류가 발생합니다.
speed는 duration 적용 후에 반영됩니다. (예: duration=5seconds, speed=2times → 최종 오디오 ≈ 10seconds)
style을 지정하지 않아도 호출할 수 있으나, 기본 스타일은 캐릭터마다 다를 수 있습니다. 기본 스타일은 Get Voices API에서 확인해 주십시오(스타일 배열의 첫 번째 값이 기본).
응답의 오디오 파일은 바로 저장하거나 재생하실 수 있습니다(클라이언트에 따라 적절한 처리가 필요할 수 있습니다).

Authorizations

x-sup-api-key

string

header

required

Path Parameters

voice_id

string

required

Body

application/json

text

string

required

The text to convert to speech

Maximum length: 300

language

enum<string>

required

The language code of the text

Available options:

en,

ko,

ja

style

string

The style of character to use for the text-to-speech conversion

model

string

default:sona_speech_1

The model type to use for the text-to-speech conversion

output_format

enum<string>

default:wav

The desired output format of the audio file (wav, mp3). Default is wav.

Available options:

wav,

mp3

voice_settings

object

Show child attributes

include_phonemes

boolean

default:false

Return phoneme timing data with the audio

Response

Returns either binary audio or JSON with phoneme data based on include_phonemes parameter

Binary audio file (when include_phonemes=false or omitted)

Supertone API

Voices

Custom voices

Text to speech

Usage

엔드포인트

경로 파라미터

요청 본문

보이스 설정(선택)

응답

참고사항

Authorizations

Path Parameters

Body

Response

Supertone API

Voices

Custom voices

Text to speech

Usage

​엔드포인트

​경로 파라미터

​요청 본문

​보이스 설정(선택)

​응답

​참고사항

Authorizations

Path Parameters

Body

Response

엔드포인트

경로 파라미터

요청 본문

보이스 설정(선택)

응답

참고사항