Overview

This document explains how to use Supertone’s Text To Speech API. You can convert text into natural-sounding speech using Supertone’s Text-to-speech technology.

Supported Languages

Supertone API provides voice models optimized for each language and currently supports the following languages:

  1. Korean(ko)
  2. Japanese(ja)
  3. English(en)

Voice Selection Guide

1. Checking Available Voices

You can check available voice information in two ways.

Check by calling the Get Voices API

You can check the list of available voices by calling the Get Voices API. The API returns a JSON response in the following format. Use the voice_id included in the JSON response as a parameter when calling the API.

{
    "voices": [
        {
            "voice_id": "54CyP2zU9HCeLVCpzDRFPi",
            "name": "Yoonho",
            "description": "Yoonho is a sarcastic and indifferent teenager. He doesn't express his emotions well.",
            "age": "young-adult",
            "gender": "male",
            "use_case": "game",
            "language": "ko",
            "style": "blank_high"
        }
    ]
}

Explore voices on Supertone Play

You can explore various voices within the Supertone Play product and copy the settings of voices you like to use them.

Create content by applying various voices and detailed settings in Supertone Play, then copy the detailed settings of the line you like. You can copy the ID and voice settings by clicking the button at the top right of the line panel on the right side of the screen.

2. Understanding Voice Properties

Each voice has the following properties:

Age

A tag indicating the age range of the voice. Supertone API provides 4 age range tags.


child, young-adult, middle-aged, elder

Gender

A tag indicating the gender of the voice. Supertone API provides 2 gender tags.


male, female

Use Case

A tag indicating the recommended use case for the voice. Supertone API provides 6 use case tags.


advertisement, announcement, audiobook, documentary, education, game

Language

A tag indicating the optimized language for the voice. Supertone API provides 3 language tags.


ko, ja, en

Style

A tag indicating the unique emotion and tone of each voice. Since each voice has its own unique style value, it is recommended to check directly by calling the Get Voices API.

3. Selecting a Voice

  1. Filter voices by desired language
  2. Select voices with use_case that matches your project purpose
  3. Select voices with desired age range and gender
  4. Copy the voice_id of the selected voice and use it when calling the API

API Call Guide

1. Text Input Requirements

  • Maximum length: 200 characters including spaces
  • Text requirements:
    • Text consisting only of spaces or punctuation marks cannot be used
    • Correct grammar and notation must be used for accurate pronunciation
    • Special characters like “\n” should not be included

2. Audio Output Format

  • Specified by the query parameter output_format
  • Supported formats:
    • wav (default)
    • mp3

3. Model Selection

Choose from the following using the body parameter model:

turbo

  • Features: Medium quality, very fast latency
  • Recommended use: Real-time conversation

pro

  • Features: High quality, medium latency
  • Recommended use: Content creation, when high-quality voice is needed

Key Metrics

turbopro
Voice Quality Score(NISQA)4.154.20
Average Response Time(Latency)- 50 characters: 820ms
- 100 characters: 1,000ms
- 50 characters: 1,500ms
- 100 characters: 2,300ms
Supported Languagesko, ja, enko, ja, en
Recommended Use CasesVoice conversation services with AIContent creation such as audiobooks and videos

4. Voice Detailed Settings

  1. pitch_shift
    • Range: -24 ~ 24 (Default: 0.0)
    • Description:
      • 1 unit = 1 semitone
      • Positive: Pitch increase (e.g., 10 = 5 tones up)
      • Negative: Pitch decrease (e.g., -10 = 5 tones down)
      • ±24 = 2 octave change
  2. pitch_variance
    • Range: 0 ~ 2 (Default: 1.0)
    • Description:
      • 0: Minimum variation
      • 2: Maximum variation
      • Higher value means more dynamic voice
  3. speed
    • Range: 0.5 ~ 2.0 (Default: 1.0)
    • Description:
      • 1.0: Default speed
      • 2.0: Double speed
      • 0.5: Half speed

Generated Results Guide

1. File Name Structure

Example

2024-08-28_05-02-53_arin_ko_happy_gv0_av15_ps0_pv100_s100.mp3

Components

  • Generation time: 2024-08-28_05-02-53 (YYYY-MM-DD_HH-mm-ss)
  • Voice name: arin
  • Language: ko
  • Style: happy
  • Pitch shift: ps0
  • Pitch variance range: pv100 (100 = 1.0)
  • Speech speed: s100 (100 = 1.0)
  • File format: .mp3

2. Notes

Since the generated voice is based on machine learning technology, even with the same settings, the results may not be completely identical, and the quality may not be completely uniform.