How much does GPT-Realtime-2 cost per minute?

OpenAI's own framing puts a typical GPT-Realtime-2 conversation at around $0.30 per minute, which works out to roughly $18 per hour before caching. Underneath that figure, the audio token rate is $32 per million input tokens and $64 per million output tokens, with cached input dropping to $0.40 per million. Text-mode pricing on the same model is $4 input and $24 output per million tokens. Cached input is the key dial: a hot prefix can pull the effective rate well below $0.30 per minute for repeat-context applications.

Is the new Whisper cheaper or more expensive than whisper-1?

More expensive, by about 2.8 times. GPT-Realtime-Whisper bills at $0.017 per minute, against legacy whisper-1 at $0.006 per minute. You pay the markup for streaming low-latency output instead of file-based batch transcription. If your use case is offline transcription of recorded files, whisper-1 (still available) or Groq Whisper Large v3 Turbo at roughly $0.000667 per minute remain dramatically cheaper. If you need words appearing on screen while someone is still speaking, the new Whisper is the only OpenAI option.

How does GPT-Realtime-Translate compare to a DIY translation stack?

At $0.034 per minute flat, GPT-Realtime-Translate undercuts any plausible roll-your-own stack for live multilingual audio. A Deepgram STT plus GPT-5.5 translation plus ElevenLabs Multilingual TTS pipeline ends up somewhere between $0.30 and $0.60 per minute once you account for token usage and output audio. Translate also handles 70+ input languages mapped to 13 output languages in one round-trip, with the latency budget designed for two-way conversation rather than batch.

Does Anthropic Claude have a competing realtime voice API?

Not natively, as of May 2026. Claude voice features in the Anthropic app and Claude Code use ElevenLabs for text-to-speech and external speech-to-text rather than a unified Anthropic speech-to-speech endpoint. That leaves OpenAI as the only frontier lab shipping a full reasoning-in-the-loop voice model with public API access. Google's Gemini Live API is the closest competitor on capability, with audio in at roughly $0.005 per minute and audio out at roughly $0.018 per minute, but it does not yet match GPT-Realtime-2 on mid-conversation reasoning depth.

What pricing model does each of the new OpenAI voice models use?

Three different ones. GPT-Realtime-2 prices per million tokens, split between audio and text and with separate cached rates. GPT-Realtime-Translate prices per minute of audio at a flat $0.034 regardless of language pair. GPT-Realtime-Whisper also prices per minute, at $0.017, with no token math involved. The token-based and minute-based dials make direct apples-to-apples comparison tricky, which is why most launch coverage led with the per-minute figures even when token pricing is what shows up on the invoice.

Model ReleaseMay 12, 2026·10 min read

OpenAI's three new voice models price three different ways. Here is what an hour actually costs.

GPT-Realtime-2 bills per million tokens, split between audio and text. Realtime-Translate bills a flat $0.034 per audio minute. The Whisper successor bills $0.017 per minute and costs 2.8 times what legacy whisper-1 still costs. Treating these as one stack is the fastest way to get the unit economics wrong. The five-day-old launch left a lot of coverage doing exactly that.

Black condenser studio microphone on dark backdrop representing OpenAI voice model API pricing

Photo by Sindre N. Aalberg on Unsplash

On May 7, 2026, OpenAI dropped three voice models into the Realtime API at once. GPT-Realtime-2 is the conversational successor with GPT-5-class reasoning running mid-turn. Realtime-Translate is a new endpoint for live multilingual audio at a flat per-minute price. Realtime-Whisper is the streaming follow-up to the four-year-old whisper-1, and it is the most expensive of the three on a per-minute basis. The pricing surfaces are not consistent across the three, which makes the cost decision for any voice product harder than it looks at first glance.

Three models, three meters

The first thing to do is map each model to its actual billing dimension. Token rates, per-minute rates, and cached prefix rates all coexist in the same product family.

Model	Billing unit	Headline rate	What it is for
GPT-Realtime-2	Per 1M tokens (audio + text split)	$32 in / $64 out (audio); $0.40 cached	Two-way conversation with reasoning
GPT-Realtime-Translate	Per audio minute (flat)	$0.034 / min	Live speech-to-speech translation
GPT-Realtime-Whisper	Per audio minute (streaming)	$0.017 / min	Live transcription, captions, notes
whisper-1 (legacy)	Per audio minute (batch)	$0.006 / min	File-based offline transcription

The token math on Realtime-2 is the source of most confusion. $32 per million audio input tokens sounds like text pricing, but audio tokens are dense. Roughly 600 audio tokens encode one minute of speech, so the bare audio-in rate is about $0.019 per minute. Audio-out at $64 per million works out to about $0.077 per minute. OpenAI's quoted "$0.30 per minute typical" figure bundles those audio rates with the text-mode reasoning passes that fire in the middle of a turn. The text pass is where most of the dollars actually go.

One hour of voice, priced four ways

Here is what a single hour of voice work costs across the four OpenAI surfaces. The workloads are deliberately picked to match common product shapes: an agent that holds a real conversation, a translator inserted between two speakers, a transcription service that streams captions, and a transcription job that processes recordings overnight.

Workload (1 hour)	Best OpenAI model	Hourly cost	Notes
Conversational voice agent	GPT-Realtime-2	~$18.00	Typical mix; can drop to $6-8 with caching
Live two-way translation	Realtime-Translate	$2.04	Flat rate, regardless of language pair
Streaming captions, meeting notes	Realtime-Whisper	$1.02	Low latency, words appear while spoken
Offline transcription (recordings)	whisper-1 (legacy)	$0.36	Still GA; cheapest OpenAI option

The 50x spread between conversational agent ($18) and offline transcription ($0.36) is the kind of math that breaks naive forecasts. A voice product that bundles "all of the above" on Realtime-2 will burn cash on transcription that could have run on whisper-1 for a fraction of the price. Picking the right model per task matters more than picking the right family.

Why the new Whisper costs 2.8x what whisper-1 still costs

At $0.017 per minute, GPT-Realtime-Whisper is 2.83 times more expensive than the four- year-old whisper-1 at $0.006 per minute. OpenAI did not retire whisper-1; both are live in the API simultaneously. The price gap is entirely the streaming premium. The new model produces tokens as they are spoken; the legacy model produces tokens after the file finishes processing.

For products that show captions in real time (video calls, accessibility overlays, voice memos that transcribe while the user is still talking), the new model is the only OpenAI option that meets the latency target. For products that ingest finished audio files and produce a transcript a minute later (podcast transcription, meeting recordings, batch voicemail), the legacy model is still the rational choice. We are in the unusual position of recommending the four-year-old endpoint as the cost- optimal pick for a real workload.

For comparison, Deepgram Nova-3 streams at $0.0048 per minute (English PAYG) and Groq runs Whisper Large v3 Turbo in batch mode at roughly $0.000667 per minute. Both are cheaper than either OpenAI option for the workloads they each cover. Streaming transcription on a budget belongs on Deepgram. Bulk batch transcription belongs on Groq. The OpenAI options make sense when you want everything in one vendor or when Whisper's specific accuracy profile is what your eval needs.

Translate at $0.034 per minute is the most aggressive pricing of the three

Realtime-Translate is the one product in this launch where the price is sharper than any DIY assembly. Translating a 10-minute call costs 34 cents. The endpoint accepts 70+ source languages and produces 13 target languages, with the round-trip designed for back-and-forth conversation rather than batch dubbing. Deutsche Telekom is named in OpenAI's launch coverage as an early adopter for multilingual voice work, though OpenAI did not pin them specifically to the Translate endpoint.

The DIY alternative looks like: Deepgram or AssemblyAI STT in the source language, GPT-5.5 or Claude for translation, ElevenLabs Multilingual TTS for the output voice. That stack costs somewhere between $0.30 and $0.60 per minute on real workloads, depending on output verbosity and which TTS voice you pick. Translate is roughly an order of magnitude cheaper for the live use case, and the latency profile is different too: one round-trip to one endpoint, not three serial network calls.

Translation stack	Cost / 10 min	Cost / hour	Latency profile
Realtime-Translate	$0.34	$2.04	Single round-trip, sub-second
Deepgram + GPT-5.5 + ElevenLabs	$3.00 - $6.00	$18 - $36	Three serial calls, 1-3s
Gemini Live API (translation prompt)	~$0.23	~$1.38	Single round-trip

Gemini Live actually edges out Translate on raw cost, and is the right pick if you already have Vertex AI procurement in place. Translate wins on language coverage (70 input languages versus Gemini Live's narrower set), and on the dedicated routing tuned for back-and-forth dialogue rather than open-ended generation.

How Realtime-2 stacks up against the voice-agent vendors

The conversational tier ($18 per hour typical, less with caching) sits in the middle of the voice-agent vendor pricing band, but Realtime-2 is doing more work per minute than the alternatives. ElevenLabs Agents, Vapi, and Retell route audio through a speech-to-text model, then a language model, then a text-to-speech model. Realtime-2 holds all three in a single forward pass and runs GPT-5-class reasoning between turns. The unit comparison is not apples-to-apples.

Provider / tier	Cost / hour	Architecture	Reasoning depth
GPT-Realtime-2 (cached)	$6 - $8	Unified speech-to-speech	GPT-5 class, mid-turn
GPT-Realtime-2 (typical)	~$18	Unified speech-to-speech	GPT-5 class, mid-turn
ElevenLabs Agents Premium	$7.20 + LLM fees	STT + LLM + TTS pipeline	Depends on chosen LLM
ElevenLabs Agents Standard	$4.80 + LLM fees	STT + LLM + TTS pipeline	Depends on chosen LLM
Gemini Live API	~$1.38	Unified speech-to-speech	Gemini 2.5/3.x native
Anthropic Claude	No native voice API	External STT/TTS only	n/a

Gemini Live is the cheapest unified speech-to-speech option by a wide margin, and on most general-purpose conversational tasks the gap to Realtime-2 narrows fast. Realtime-2 earns its premium when the use case actually exercises GPT-5-class reasoning during the conversation: financial advisors, complex troubleshooting, interview-style intake. Anthropic remains absent from this category, which is the most surprising structural fact in the voice market right now.

Caching is the dial that decides if Realtime-2 is affordable

Cached input on Realtime-2 sits at $0.40 per million tokens, against $32 uncached. That is an 80x reduction for prefix content that fires every turn: system prompt, instructions, persona, tool definitions, knowledge base. On a real conversation, the prefix accounts for 60-80% of input tokens, so cache hit rate alone moves the per- hour cost from $18 to $6-8 territory.

Cache hits require the prefix to be byte-identical and to be reused within OpenAI's cache window. Voice agents that hot-swap personas, randomize ordering of tool schemas, or paste user-specific context above the system prompt will see cache miss rates close to 100% and pay the headline $18 rate. Voice agents that lock the prefix down and reuse it across users will see hit rates above 80% and pay closer to $6.

If you have followed our notes on tokenizer-driven real-world cost, this is a familiar lesson in a new domain. The published per-token price is the ceiling. The realised cost is whatever your engineering team is willing to do to collapse it.

So what should you actually pick?

For conversational voice agents that need real reasoning, Realtime-2 is the right pick, but only if your prefix caches. A voice tutor or a support agent that solves the issue (rather than reading scripts) will exercise GPT-5-class logic between turns, and that is what you pay for. Without prefix caching, expect the $18 per hour line on your invoice; with aggressive caching, $6 to $8 is realistic. Voice products that hot-swap personas per call are paying full freight.

For live multilingual audio, Realtime-Translate wins on language coverage and ties with Gemini Live on cost. If your workload routes language pairs Gemini does not handle, or you want the dedicated translation latency profile, the choice is easy. If you sit on Vertex AI already and your top five language pairs are well-covered, Gemini Live ends up cheaper.

For transcription, Realtime-Whisper only earns its 2.8x markup when streaming latency is non-negotiable. Offline transcription still belongs on whisper-1 at $0.006 per minute, or on Groq Whisper Large v3 Turbo at $0.000667 per minute if you do not need the OpenAI accuracy profile. Artificial Analysis maintains a third-party benchmark that runs the same audio against multiple Whisper-Turbo hosts, which is the best single page for sanity-checking which budget transcription provider matches your accuracy floor.

Sources

OpenAI: Advancing voice intelligence with new models in the API - May 7, 2026 launch announcement, $0.30/min typical figure
OpenAI docs: gpt-realtime-2 - Token pricing, context window, reasoning levels, benchmark scores
OpenAI docs: gpt-realtime-translate - $0.034/min flat, 70 input / 13 output languages, rate limits
OpenAI API pricing reference - Realtime-Whisper $0.017/min, whisper-1 $0.006/min, cached input rates
ElevenLabs Agents pricing - Standard $0.08/min, Turbo $0.10/min, Premium $0.12/min tiers
Deepgram pricing - Nova-3 $0.0048/min English streaming PAYG, Aura-2 TTS, multilingual rates
Google Gemini API pricing - Live API audio in/out token rates, derived per-minute cost
Groq pricing - Whisper Large v3 Turbo at $0.04/hr batch transcription

Compare all model prices Calculate your API cost