Skip to main content
TokenCost logoTokenCost
ComparisonApril 20, 2026·7 min read

Voice AI APIs in 2026: what Gemini TTS, Voxtral TTS, and OpenAI TTS actually cost per hour

Per-token pricing makes TTS costs hard to reason about. We converted all of them to cost per hour of audio. The range is narrower than you might expect - and ElevenLabs overage pricing is much higher than the per-plan numbers suggest.

Sound wave visualization representing voice AI and TTS API pricing comparison

Image source: Google / The Keyword

Google launched Gemini 3.1 Flash TTS on April 15, 2026 at $20.00/1M audio output tokens. Mistral shipped Voxtral TTS on March 23 at $0.016/1k characters. OpenAI's TTS-1 is still character-priced at $15/1M chars. At 150 words per minute (typical speech), the per-hour costs are:

  • OpenAI TTS-1: $0.74/hr
  • Mistral Voxtral TTS: $0.79/hr (or free if self-hosted)
  • Gemini 3.1 Flash TTS (batch): $0.91/hr
  • OpenAI TTS-1 HD: $1.49/hr
  • Gemini 3.1 Flash TTS (standard): $1.81/hr

ElevenLabs overage on the Pro plan runs to ~$10.20/hr. Within-plan effective cost on Pro ($99/month, 600k chars) is ~$8.18/hr if you use the full allocation. API access requires at minimum the Starter plan ($5/month).

Why audio output costs 6-7x more per token than text

Looking at Gemini 3.1 Flash TTS at $20/1M audio tokens next to Gemini 3 Flash text at $3/1M output tokens, the first instinct is that voice AI is expensive. It is, on a per-token basis. But audio tokens and text tokens pack in very different amounts of time.

Google's audio token rate is 25 tokens per second of speech. One minute of audio at 150 words per minute uses 1,500 audio output tokens. One hour uses 90,000. At $20/1M, that is $1.80 per hour of audio. For comparison, a text output of the same transcript (roughly 12,000 tokens at 200 tokens/minute) would cost $0.036 at $3/1M. So audio output costs about 50x more than text output for the same words - but the audio is also playable, which is the whole point.

OpenAI and Mistral use character-based pricing rather than audio tokens. This sidesteps the per-second conversion problem but creates a different comparison challenge. We used 825 characters per minute (150 words at 5.5 chars/word average) to convert everything to cost per hour.

TTS API pricing: cost per hour of audio

All figures assume 150 words/minute (825 chars/minute, 90,000 Gemini audio tokens/hour). Prices retrieved April 20, 2026.

ServiceModelHeadline price$/hr audio
OpenAITTS-1$15/1M chars$0.74
MistralVoxtral TTSopen weights (non-commercial)$16/1M chars$0.79
OpenAIgpt-4o-mini-tts$12/1M audio tokens~$0.90
GoogleGemini 3.1 Flash TTS (batch)batch API only$10/1M audio tokens$0.91
OpenAITTS-1 HD$30/1M chars$1.49
GoogleGemini 3.1 Flash TTSstandard rate$20/1M audio tokens$1.81
ElevenLabsv2 Multilingual (overage)Pro plan, past allocation~$0.17/min overage~$10.20

ElevenLabs API access requires Starter plan ($5/month) minimum. The overage rate applies after the plan's included characters are used. Gemini 3.1 Flash TTS is in preview; prices may change at GA.

Gemini 3.1 Flash TTS: the April 2026 entry

Google announced Gemini 3.1 Flash TTS Preview on April 15, 2026. The pricing structure is different from every other TTS API: audio output is billed in tokens at 25 tokens per second, with a separate text input charge. Standard rates are $1.00/1M text input and $20.00/1M audio output. Batch API halves both.

On quality: Gemini 3.1 Flash TTS scores 1,211 Elo on the Artificial Analysis TTS leaderboard (blind human preference testing across 70+ models), putting it in the top tier for quality-to-cost ratio. We tested it against a sample documentation narration and the multi-speaker dialogue feature is genuinely useful - generating a two-voice explainer from a single prompt without any stitching or post-processing. Google claims 70+ language support and expressive audio tags for controlling emotion and delivery.

There is a free tier during preview, though usage is subject to rate limits and Google uses that data to improve their products. The 32,000 token context window per TTS session is the meaningful constraint for long content - documents over roughly 24,000 words need to be chunked. SynthID watermarking ships on all outputs.

Gemini TTS modelText input / 1MAudio output / 1M$/hr (standard)
Gemini 3.1 Flash TTS Preview$1.00$20.00$1.81
Gemini 3.1 Flash TTS Preview (batch)$0.50$10.00$0.91
Gemini 2.5 Flash TTS Preview$0.50$10.00$0.91
Gemini 2.5 Pro TTS Preview$1.00$20.00$1.81

Source: Google AI pricing, April 2026. All Gemini TTS models are currently in preview.

Mistral Voxtral TTS: character-priced with open weights

Voxtral TTS (model ID: voxtral-mini-tts-2603) is a 4B parameter model announced March 23, 2026. The API uses character-based pricing at $0.016 per 1,000 characters, working out to $0.79/hr at 150 words per minute.

Two things make Voxtral TTS different from everything else on this list. Open weights are available on Hugging Face under CC BY NC 4.0, meaning non-commercial self-hosting is free at zero marginal cost. And zero-shot voice cloning works from as little as 3 seconds of reference audio - a feature ElevenLabs charges separately for at scale. Mistral's own human evaluations put quality at parity with ElevenLabs v3 on naturalness.

The limitations: 9 supported languages versus Gemini's 70+, and the non-commercial license on open weights rules out self-hosting for most commercial deployments.

One note on naming: Mistral also has models called Voxtral Small 24B and Voxtral Mini 3B. Those are speech-to-text transcription models, not TTS. The text-to-speech model is specifically voxtral-mini-tts-2603.

OpenAI's three TTS options

OpenAI has three TTS models at different price points. TTS-1 and TTS-1 HD have been available since late 2023 and use character-based pricing. The newer gpt-4o-mini-tts uses audio token pricing, similar to Gemini's approach but at half the rate.

ModelPricing unitRate$/hr audio
TTS-1characters (output)$15/1M chars$0.74
gpt-4o-mini-ttsaudio tokens (output)$12/1M audio tokens~$0.90
TTS-1 HDcharacters (output)$30/1M chars$1.49

TTS-1 is fast and cheap. TTS-1 HD is higher quality but at $1.49/hr it is approaching Gemini standard pricing without any of the multi-speaker features. The gpt-4o-mini-tts uses a neural audio approach closer to modern voice models; on expressive or conversational content, the quality gap over TTS-1 is noticeable.

ElevenLabs: the overage cost is what catches you

ElevenLabs is the quality benchmark everyone compares against. Voice cloning, emotional range, naturalness at varying pace - for anything where the listening experience is the whole product, it is still genuinely better than the API-native models. The pricing model is plan-based with character credits; API access starts at Starter ($5/month).

PlanMonthly costChars includedAudio hrs in planEffective $/hr
Starter$530K~0.6 hrs~$8/hr
Creator$22121K~2.4 hrs~$9.20/hr
Pro$99600K~12 hrs~$8.18/hr
Scale$3301.8M~36 hrs~$9.28/hr

The table above divides plan cost by hours of audio in the plan. Past the monthly limit, overage is roughly $0.17-$0.20 per minute depending on plan - that is $10-12 per hour. Compared to Voxtral TTS ($0.79/hr) or OpenAI TTS-1 ($0.74/hr), that is a 13x difference at scale.

PCM audio output (lossless, required for some voice agent pipelines) is locked to Pro plan ($99/month) and above. If you need raw PCM and are currently on a lower tier, that changes your effective floor.

When the API-native options win on cost

At low volume, ElevenLabs Pro at $22/month is reasonable if you use most of the allocation. The math only breaks badly once you overshoot.

Notification system: 5 hrs/month of short audio clips
ElevenLabs Pro: ~$22/mo (Pro plan, within allocation)Voxtral TTS: ~$3.97/mo (Voxtral TTS)Gemini 3.1 Flash TTS: ~$9.06/mo (Gemini 3.1 Flash TTS standard)

ElevenLabs wins at low volumes if you stay within plan. Past 12 hrs/month, Voxtral or Gemini batch become cheaper than the Pro plan alone.

Audiobook pipeline: 200 hrs/month
ElevenLabs Pro: ~$2,040/mo (Pro + 188 hrs overage at $0.17/min)Voxtral TTS: ~$158/mo (Voxtral TTS)Gemini 3.1 Flash TTS: ~$362/mo (Gemini 3.1 Flash TTS standard)

At scale, Voxtral TTS is 13x cheaper than ElevenLabs overage. Gemini batch ($0.91/hr) drops the Gemini line to ~$182/mo.

The crossover point between ElevenLabs Pro and Voxtral TTS is around 125 hours of audio per month. Run your specific numbers in the cost calculator.

Picking one

If voice quality is what your users notice first - a branded podcast, a consumer app, anything where someone is genuinely listening - ElevenLabs is still the benchmark. You pay for it, but under 12 hrs/month the Pro plan is not absurd, and the emotional range is real.

For high-volume pipelines where cost is the primary constraint, the decision comes down to ecosystem fit. Already on OpenAI? TTS-1 at $0.74/hr is the default choice - stable, well-documented, nothing to rethink. Already on Google? Gemini 3.1 Flash TTS batch mode at $0.91/hr gets you 70+ language coverage and multi-speaker output that is genuinely hard to replicate otherwise.

Voxtral TTS is the interesting wildcard. $0.79/hr, zero-shot voice cloning from 3 seconds of audio, and an open-weights license that lets you self-host for non-commercial work. If you need voice cloning without committing to a platform, it is the only API-tier option that does it at this price. The 9-language limit is real though - if you need Southeast Asian or East Asian language support, you are looking at Gemini or ElevenLabs.

The pricing picture

The per-token TTS headline prices look alarming before you work out how much audio time they represent. Gemini at $20/1M audio tokens covers an hour of speech. Once you convert everything to cost per hour, all the paid API options land between $0.74 and $1.81 - much tighter than the per-token headlines suggest, and far below ElevenLabs at overage rates.

Gemini 3.1 Flash TTS is the most capable new entry - 70+ languages, multi-speaker, Elo 1,211 - but its standard rate is the most expensive among dedicated TTS APIs. The batch rate ($0.91/hr) brings it in line with the rest. Voxtral TTS is the cost leader at $0.79/hr with a self-hosting path nobody else offers.

See current TTS and text model pricing on the pricing page, or estimate your monthly voice AI costs in the cost calculator.

Sources