Text-to-speech Guide
Learn how to generate voice from text using Supertone’s voice AI.
Overview
This document explains how to use Supertone’s Text To Speech API. You can convert text into natural-sounding speech using Supertone’s Text-to-speech technology.
Supported Languages
Supertone API provides voice models optimized for each language and currently supports the following languages:
- Korean(ko)
- Japanese(ja)
- English(en)
Voice Selection Guide
1. Checking Available Voices
You can check available voice information in two ways.
Check by calling the Get Voices API
You can check the list of available voices by calling the Get Voices API. The API returns a JSON response in the following format. Use the voice_id
included in the JSON response as a parameter when calling the API.
Explore voices on Supertone Play
You can explore various voices within the Supertone Play product and copy the settings of voices you like to use them.
Create content by applying various voices and detailed settings in Supertone Play, then copy the detailed settings of the line you like. You can copy the ID and voice settings by clicking the button at the top right of the line panel on the right side of the screen.
2. Understanding Voice Properties
Each voice has the following properties:
Age
A tag indicating the age range of the voice. Supertone API provides 4 age range tags.
child
, young-adult
, middle-aged
, elder
Gender
A tag indicating the gender of the voice. Supertone API provides 2 gender tags.
male
, female
Use Case
A tag indicating the recommended use case for the voice. Supertone API provides 6 use case tags.
advertisement
, announcement
, audiobook
, documentary
, education
, game
Language
A tag indicating the optimized language for the voice. Supertone API provides 3 language tags.
ko
, ja
, en
Style
A tag indicating the unique emotion and tone of each voice. Since each voice has its own unique style value, it is recommended to check directly by calling the Get Voices API.
3. Selecting a Voice
- Filter voices by desired language
- Select voices with use_case that matches your project purpose
- Select voices with desired age range and gender
- Copy the
voice_id
of the selected voice and use it when calling the API
API Call Guide
1. Text Input Requirements
- Maximum length: 200 characters including spaces
- Text requirements:
- Text consisting only of spaces or punctuation marks cannot be used
- Correct grammar and notation must be used for accurate pronunciation
- Special characters like “\n” should not be included
2. Audio Output Format
- Specified by the query parameter
output_format
- Supported formats:
wav
(default)mp3
3. Model Selection
Choose from the following using the body parameter model
:
turbo
- Features: Medium quality, very fast latency
- Recommended use: Real-time conversation
pro
- Features: High quality, medium latency
- Recommended use: Content creation, when high-quality voice is needed
Key Metrics
turbo | pro | |
---|---|---|
Voice Quality Score(NISQA) | 4.15 | 4.20 |
Average Response Time(Latency) | - 50 characters: 820ms - 100 characters: 1,000ms | - 50 characters: 1,500ms - 100 characters: 2,300ms |
Supported Languages | ko , ja , en | ko , ja , en |
Recommended Use Cases | Voice conversation services with AI | Content creation such as audiobooks and videos |
4. Voice Detailed Settings
pitch_shift
- Range: -24 ~ 24 (Default: 0.0)
- Description:
- 1 unit = 1 semitone
- Positive: Pitch increase (e.g., 10 = 5 tones up)
- Negative: Pitch decrease (e.g., -10 = 5 tones down)
- ±24 = 2 octave change
pitch_variance
- Range: 0 ~ 2 (Default: 1.0)
- Description:
- 0: Minimum variation
- 2: Maximum variation
- Higher value means more dynamic voice
speed
- Range: 0.5 ~ 2.0 (Default: 1.0)
- Description:
- 1.0: Default speed
- 2.0: Double speed
- 0.5: Half speed
Generated Results Guide
1. File Name Structure
Example
2024-08-28_05-02-53_arin_ko_happy_gv0_av15_ps0_pv100_s100.mp3
Components
- Generation time:
2024-08-28_05-02-53
(YYYY-MM-DD_HH-mm-ss) - Voice name:
arin
- Language:
ko
- Style:
happy
- Pitch shift:
ps0
- Pitch variance range:
pv100
(100 = 1.0) - Speech speed:
s100
(100 = 1.0) - File format:
.mp3
2. Notes
Since the generated voice is based on machine learning technology, even with the same settings, the results may not be completely identical, and the quality may not be completely uniform.