Skip to main content
Integrate Supertone TTS with Claude, Cursor, and other MCP-compatible clients. The Supertone MCP server exposes the Text-to-Speech API as a set of composable Model Context Protocol tools, so an AI agent can discover voices, preview samples, estimate cost, clone voices, and synthesize speech — and chain those steps into multi-step workflows on its own. Source: supertone-inc/supertone-mcp.

What you can do

  • Synthesize speech with control over voice, language, speed, pitch, and emotion style.
  • Discover voices by language, gender, age, use case, or style — and preview samples before committing.
  • Clone and manage custom voices from a local audio file.
  • Track usage — check credit balance and usage history.
  • Stitch audio — merge multiple clips into one file, with optional silence gaps or crossfades.

Prerequisites

  • uv installed (provides uvx), or Python with pip.
  • A Supertone API key from the Developer Console.

Install

Every client runs the same server — uvx supertone-mcp — with your API key passed as an environment variable. Pick your client below.
Add to ~/.cursor/mcp.json (global) or .cursor/mcp.json (per-project), then fill in your API key:
{
  "mcpServers": {
    "supertone-tts": {
      "command": "uvx",
      "args": ["supertone-mcp"],
      "env": { "SUPERTONE_API_KEY": "your-api-key-here" }
    }
  }
}

Environment variables

VariableRequiredDefaultPurpose
SUPERTONE_API_KEYYesAuthentication
SUPERTONE_MCP_VOICE_IDNoAiden (multilingual)Default voice_id for text_to_speech
SUPERTONE_OUTPUT_DIRNo~/supertone-tts-output/Where generated audio files are saved

Tools

The server exposes its capabilities as composable building blocks the agent can chain.
ToolDescription
text_to_speechGenerate audio with control over speed, pitch, emotion style, and output format.
predict_durationEstimate synthesis duration and credit cost before generating.
ToolDescription
search_voiceFilter preset voices by language, gender, age, use case, or style.
get_voiceRetrieve full details for a voice.
preview_voiceFetch sample audio URLs to evaluate a voice.
ToolDescription
clone_voiceCreate a cloned voice from a local WAV/MP3 (≤ 3 MB).
search_custom_voiceList and filter your cloned voices.
get_custom_voiceFetch details for a cloned voice.
edit_custom_voiceUpdate a cloned voice’s name or description.
delete_custom_voicePermanently remove a cloned voice (irreversible).
ToolDescription
get_credit_balanceCheck remaining credits.
get_usage_historyView usage over a time window.
get_voice_usageUsage metrics for a specific voice.
ToolDescription
merge_audio_filesMerge two or more local audio files into one — plain concatenation, silence gaps (gap_ms), or crossfade blending (crossfade_ms). Useful for stitching multiple text_to_speech outputs.

Key text_to_speech parameters

  • text (required), voice_id, language, output_format (mp3 / wav)
  • model — e.g. sona_speech_2_flash, sona_speech_1
  • speed (0.5–2.0), pitch_shift (−24 to +24 semitones), style
  • output_mode (files / resources / both), autoplay (default false), streaming (sona_speech_1 only)
These are per-call parameters, so the agent controls output mode, autoplay, and model on each invocation.

Key merge_audio_files parameters

  • input_paths (required) — two or more local audio file paths, in order. (A single path is returned unchanged.)
  • gap_ms — silence inserted between clips, in milliseconds.
  • crossfade_ms — crossfade blend between clips, in milliseconds. Mutually exclusive with gap_ms.
  • output_format — override the output format. By default it’s auto-detected: all inputs sharing an extension → that extension; mixed → mp3. Differing sample rates or channel counts are normalized automatically before merging.
ffmpeg is bundled via imageio-ffmpeg, so merging works out of the box with uvx supertone-mcp — no system ffmpeg install required.

Example workflows

Discover → preview → estimate → synthesize

“Find a calm Korean female voice, let me hear a sample, check the cost, then make this announcement as mp3.”Chains search_voice()preview_voice()predict_duration() + get_credit_balance()text_to_speech().

Clone and use immediately

“Create a cloned voice from ~/recordings/sample.wav named MyVoice, then read this greeting with it and play it.”Chains clone_voice()get_custom_voice()text_to_speech(autoplay=true).

Narrate a script and stitch it together

“Generate each paragraph of this script, then merge them into one mp3 with a short pause between each.”Chains text_to_speech() per segment → merge_audio_files(gap_ms=...).

Troubleshooting

Make sure the config file is valid JSON and the client was fully restarted. Most clients only load MCP servers at startup.
Install uv (which provides uvx): see the uv install guide. Alternatively pip install supertone-mcp and set the command to supertone-mcp.
Confirm SUPERTONE_API_KEY is set in the server’s env block (not just your shell) and is valid. Get a key from the Developer Console.
With output_mode: files, audio is written to SUPERTONE_OUTPUT_DIR (default ~/supertone-tts-output/). Set autoplay: true to also play it immediately.

CLI

The same capabilities from your terminal and scripts.

Custom voices

How voice cloning works on Supertone.