Skip to main content
TokenCost logoTokenCost
Model ReleaseMay 12, 2026·10 min read

OpenAI's three new voice models price three different ways. Here is what an hour actually costs.

GPT-Realtime-2 bills per million tokens, split between audio and text. Realtime-Translate bills a flat $0.034 per audio minute. The Whisper successor bills $0.017 per minute and costs 2.8 times what legacy whisper-1 still costs. Treating these as one stack is the fastest way to get the unit economics wrong. The five-day-old launch left a lot of coverage doing exactly that.

Black condenser studio microphone on dark backdrop representing OpenAI voice model API pricing

Photo by Sindre N. Aalberg on Unsplash

On May 7, 2026, OpenAI dropped three voice models into the Realtime API at once. GPT-Realtime-2 is the conversational successor with GPT-5-class reasoning running mid-turn. Realtime-Translate is a new endpoint for live multilingual audio at a flat per-minute price. Realtime-Whisper is the streaming follow-up to the four-year-old whisper-1, and it is the most expensive of the three on a per-minute basis. The pricing surfaces are not consistent across the three, which makes the cost decision for any voice product harder than it looks at first glance.

Three models, three meters

The first thing to do is map each model to its actual billing dimension. Token rates, per-minute rates, and cached prefix rates all coexist in the same product family.

ModelBilling unitHeadline rateWhat it is for
GPT-Realtime-2Per 1M tokens (audio + text split)$32 in / $64 out (audio); $0.40 cachedTwo-way conversation with reasoning
GPT-Realtime-TranslatePer audio minute (flat)$0.034 / minLive speech-to-speech translation
GPT-Realtime-WhisperPer audio minute (streaming)$0.017 / minLive transcription, captions, notes
whisper-1 (legacy)Per audio minute (batch)$0.006 / minFile-based offline transcription

The token math on Realtime-2 is the source of most confusion. $32 per million audio input tokens sounds like text pricing, but audio tokens are dense. Roughly 600 audio tokens encode one minute of speech, so the bare audio-in rate is about $0.019 per minute. Audio-out at $64 per million works out to about $0.077 per minute. OpenAI's quoted "$0.30 per minute typical" figure bundles those audio rates with the text-mode reasoning passes that fire in the middle of a turn. The text pass is where most of the dollars actually go.

One hour of voice, priced four ways

Here is what a single hour of voice work costs across the four OpenAI surfaces. The workloads are deliberately picked to match common product shapes: an agent that holds a real conversation, a translator inserted between two speakers, a transcription service that streams captions, and a transcription job that processes recordings overnight.

Workload (1 hour)Best OpenAI modelHourly costNotes
Conversational voice agentGPT-Realtime-2~$18.00Typical mix; can drop to $6-8 with caching
Live two-way translationRealtime-Translate$2.04Flat rate, regardless of language pair
Streaming captions, meeting notesRealtime-Whisper$1.02Low latency, words appear while spoken
Offline transcription (recordings)whisper-1 (legacy)$0.36Still GA; cheapest OpenAI option

The 50x spread between conversational agent ($18) and offline transcription ($0.36) is the kind of math that breaks naive forecasts. A voice product that bundles "all of the above" on Realtime-2 will burn cash on transcription that could have run on whisper-1 for a fraction of the price. Picking the right model per task matters more than picking the right family.

Why the new Whisper costs 2.8x what whisper-1 still costs

At $0.017 per minute, GPT-Realtime-Whisper is 2.83 times more expensive than the four- year-old whisper-1 at $0.006 per minute. OpenAI did not retire whisper-1; both are live in the API simultaneously. The price gap is entirely the streaming premium. The new model produces tokens as they are spoken; the legacy model produces tokens after the file finishes processing.

For products that show captions in real time (video calls, accessibility overlays, voice memos that transcribe while the user is still talking), the new model is the only OpenAI option that meets the latency target. For products that ingest finished audio files and produce a transcript a minute later (podcast transcription, meeting recordings, batch voicemail), the legacy model is still the rational choice. We are in the unusual position of recommending the four-year-old endpoint as the cost- optimal pick for a real workload.

For comparison, Deepgram Nova-3 streams at $0.0048 per minute (English PAYG) and Groq runs Whisper Large v3 Turbo in batch mode at roughly $0.000667 per minute. Both are cheaper than either OpenAI option for the workloads they each cover. Streaming transcription on a budget belongs on Deepgram. Bulk batch transcription belongs on Groq. The OpenAI options make sense when you want everything in one vendor or when Whisper's specific accuracy profile is what your eval needs.

Translate at $0.034 per minute is the most aggressive pricing of the three

Realtime-Translate is the one product in this launch where the price is sharper than any DIY assembly. Translating a 10-minute call costs 34 cents. The endpoint accepts 70+ source languages and produces 13 target languages, with the round-trip designed for back-and-forth conversation rather than batch dubbing. Deutsche Telekom is named in OpenAI's launch coverage as an early adopter for multilingual voice work, though OpenAI did not pin them specifically to the Translate endpoint.

The DIY alternative looks like: Deepgram or AssemblyAI STT in the source language, GPT-5.5 or Claude for translation, ElevenLabs Multilingual TTS for the output voice. That stack costs somewhere between $0.30 and $0.60 per minute on real workloads, depending on output verbosity and which TTS voice you pick. Translate is roughly an order of magnitude cheaper for the live use case, and the latency profile is different too: one round-trip to one endpoint, not three serial network calls.

Translation stackCost / 10 minCost / hourLatency profile
Realtime-Translate$0.34$2.04Single round-trip, sub-second
Deepgram + GPT-5.5 + ElevenLabs$3.00 - $6.00$18 - $36Three serial calls, 1-3s
Gemini Live API (translation prompt)~$0.23~$1.38Single round-trip

Gemini Live actually edges out Translate on raw cost, and is the right pick if you already have Vertex AI procurement in place. Translate wins on language coverage (70 input languages versus Gemini Live's narrower set), and on the dedicated routing tuned for back-and-forth dialogue rather than open-ended generation.

How Realtime-2 stacks up against the voice-agent vendors

The conversational tier ($18 per hour typical, less with caching) sits in the middle of the voice-agent vendor pricing band, but Realtime-2 is doing more work per minute than the alternatives. ElevenLabs Agents, Vapi, and Retell route audio through a speech-to-text model, then a language model, then a text-to-speech model. Realtime-2 holds all three in a single forward pass and runs GPT-5-class reasoning between turns. The unit comparison is not apples-to-apples.

Provider / tierCost / hourArchitectureReasoning depth
GPT-Realtime-2 (cached)$6 - $8Unified speech-to-speechGPT-5 class, mid-turn
GPT-Realtime-2 (typical)~$18Unified speech-to-speechGPT-5 class, mid-turn
ElevenLabs Agents Premium$7.20 + LLM feesSTT + LLM + TTS pipelineDepends on chosen LLM
ElevenLabs Agents Standard$4.80 + LLM feesSTT + LLM + TTS pipelineDepends on chosen LLM
Gemini Live API~$1.38Unified speech-to-speechGemini 2.5/3.x native
Anthropic ClaudeNo native voice APIExternal STT/TTS onlyn/a

Gemini Live is the cheapest unified speech-to-speech option by a wide margin, and on most general-purpose conversational tasks the gap to Realtime-2 narrows fast. Realtime-2 earns its premium when the use case actually exercises GPT-5-class reasoning during the conversation: financial advisors, complex troubleshooting, interview-style intake. Anthropic remains absent from this category, which is the most surprising structural fact in the voice market right now.

Caching is the dial that decides if Realtime-2 is affordable

Cached input on Realtime-2 sits at $0.40 per million tokens, against $32 uncached. That is an 80x reduction for prefix content that fires every turn: system prompt, instructions, persona, tool definitions, knowledge base. On a real conversation, the prefix accounts for 60-80% of input tokens, so cache hit rate alone moves the per- hour cost from $18 to $6-8 territory.

Cache hits require the prefix to be byte-identical and to be reused within OpenAI's cache window. Voice agents that hot-swap personas, randomize ordering of tool schemas, or paste user-specific context above the system prompt will see cache miss rates close to 100% and pay the headline $18 rate. Voice agents that lock the prefix down and reuse it across users will see hit rates above 80% and pay closer to $6.

If you have followed our notes on tokenizer-driven real-world cost, this is a familiar lesson in a new domain. The published per-token price is the ceiling. The realised cost is whatever your engineering team is willing to do to collapse it.

So what should you actually pick?

For conversational voice agents that need real reasoning, Realtime-2 is the right pick, but only if your prefix caches. A voice tutor or a support agent that solves the issue (rather than reading scripts) will exercise GPT-5-class logic between turns, and that is what you pay for. Without prefix caching, expect the $18 per hour line on your invoice; with aggressive caching, $6 to $8 is realistic. Voice products that hot-swap personas per call are paying full freight.

For live multilingual audio, Realtime-Translate wins on language coverage and ties with Gemini Live on cost. If your workload routes language pairs Gemini does not handle, or you want the dedicated translation latency profile, the choice is easy. If you sit on Vertex AI already and your top five language pairs are well-covered, Gemini Live ends up cheaper.

For transcription, Realtime-Whisper only earns its 2.8x markup when streaming latency is non-negotiable. Offline transcription still belongs on whisper-1 at $0.006 per minute, or on Groq Whisper Large v3 Turbo at $0.000667 per minute if you do not need the OpenAI accuracy profile. Artificial Analysis maintains a third-party benchmark that runs the same audio against multiple Whisper-Turbo hosts, which is the best single page for sanity-checking which budget transcription provider matches your accuracy floor.

Sources