Skip to main content
POST
/
v1
/
predict-duration
/
{voice_id}
Predict text-to-speech duration
curl --request POST \
  --url https://supertoneapi.com/v1/predict-duration/{voice_id} \
  --header 'Content-Type: application/json' \
  --header 'x-sup-api-key: <api-key>' \
  --data '{
  "text": "<string>",
  "language": "en",
  "style": "<string>",
  "model": "sona_speech_1",
  "output_format": "wav",
  "voice_settings": {
    "pitch_shift": 0,
    "pitch_variance": 1,
    "speed": 1,
    "duration": 0,
    "similarity": 3,
    "text_guidance": 1,
    "subharmonic_amplitude_control": 1
  }
}'
{
  "duration": 123
}
This API does not actually generate speech, but only returns the expected speech length (in seconds) based on the input text.
It’s useful for understanding expected credit consumption or adjusting text length before making TTS calls.

Endpoint

https://supertoneapi.com/v1/predict-duration/{voice_id}

Notes

  • The calling method and Request Body are almost identical to the text-to-speech API.
  • However, only the duration value is returned as a result, not audio.
  • No credits are consumed when calling the Predict Duration API.
  • Credits are not actually deducted. (because no speech generation occurs)
  • You can get results very similar to when actually calling with the same text.
  • Since adjusting voice_settings.speed changes the length, it’s better to test with a fixed speech speed.

Request Body

ItemRequiredDescription
textYesText to analyze. Maximum 300 characters
languageYesText language. One of ko, en, ja
styleNoEmotional style. Default style is used if not specified
modelNoDefault is sona_speech_1. Currently only this model is available
voice_settingsNoSpeech speed or pitch adjustment values. May affect result length

Request Example

POST /v1/predict-duration/{voice_id}
Content-Type: application/json
x-sup-api-key: [YOUR_API_KEY]

{
  "text": "This is a long-form sentence for duration prediction.",
  "language": "en",
  "style": "neutral"
}

Response Example

{
  "duration": 3.57381983
}
This means that generating this text would create approximately 3.57 seconds of audio.

Authorizations

x-sup-api-key
string
header
required

Path Parameters

voice_id
string
required

Body

application/json
text
string
required

The text to convert to speech. Max length is 300 characters.

Maximum length: 300
language
enum<string>
required

Language code of the voice

Available options:
en,
ko,
ja
style
string

The style of character to use for the text-to-speech conversion

model
string
default:sona_speech_1

The model type to use for the text-to-speech conversion

output_format
enum<string>
default:wav

The desired output format of the audio file (wav, mp3). Default is wav.

Available options:
wav,
mp3
voice_settings
object

Response

Returns predicted duration of the audio in seconds

duration
number
I