Skip to main content
TokenCost logoTokenCost
Model ReleaseApril 7, 2026·7 min read

Microsoft MAI models: what MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 actually cost

Microsoft released three proprietary foundation models on April 2. They're the first built entirely inside Microsoft rather than licensed from OpenAI. The pricing is more interesting than the benchmarks suggest.

Microsoft MAI models announcement April 2026

Image source: Microsoft AI

TL;DR

MAI-Transcribe-1 costs $0.36/hr batch (matches Whisper, not cheaper), hits 3.88% WER across 25 languages, and has no real-time streaming yet. MAI-Voice-1 is $22/1M characters - between OpenAI's TTS Standard ($15) and HD ($30), with a broad voice catalog and cloning from a short audio sample. MAI-Image-2 runs $5/$33 per million tokens (roughly $0.034 per 1024x1024 image) and ranked in the top tier of the Arena.ai image leaderboard at launch. None of them are the cheapest option in their category. What's new is Microsoft building these in-house at all.

Why these models exist

Microsoft has spent years reselling OpenAI models through Azure. That arrangement made sense when OpenAI had a clear capability lead, but the gap has narrowed. The MAI (Microsoft AI) group was formed in 2025 specifically to build proprietary models. These three are the first public output of that group, formed within its first six months.

All three target categories where OpenAI already has established products: Whisper for transcription, TTS for voice, DALL-E and GPT Image for images. Microsoft is building parity, not a leapfrog. The point is reducing dependence on a supplier they also compete with.

They went live in Microsoft Foundry (Azure AI Foundry) and the MAI Playground at launch. MAI-Transcribe-1 already powers Copilot voice mode. MAI-Voice-1 runs Copilot Audio Expressions and Copilot Podcasts. MAI-Image-2 is in Copilot now and rolling out to Bing Image Creator and PowerPoint.

MAI-Transcribe-1: accuracy vs language coverage

The accuracy numbers are solid. MAI-Transcribe-1 hits 3.88% word error rate on the FLEURS benchmark across 25 languages. That beats Whisper-large-v3 (7.6% WER), OpenAI GPT-Transcribe (4.2% WER), and Scribe v2 on the same benchmark. If you're working in those 25 languages and accuracy is the main concern, this is currently the most accurate batch transcription available.

25 languages is the catch. Whisper supports 99. Azure's own existing Speech service supports 140+. Real-time streaming is not available at launch - batch only. No speaker diarization, no contextual biasing for domain-specific terminology. Microsoft says those are coming.

ServicePrice / hrWER (FLEURS)LanguagesReal-time
MAI-Transcribe-1$0.363.88%25No (coming)
OpenAI Whisper$0.367.6%99No
OpenAI GPT-Transcribe$0.364.2%99Yes
OpenAI GPT-4o Mini Transcribe$0.18-57Yes
Google Cloud STT (Batch)$0.24-125+No
Google Cloud STT (Standard)$0.96-125+Yes
Azure Speech (Batch)$0.36-140+No

WER from Microsoft's FLEURS benchmark (25 languages). GPT-4o Mini Transcribe is $0.003/min = $0.18/hr. Prices from provider pages, retrieved April 7, 2026.

MAI-Voice-1: $22 per million characters

At $22 per million characters, MAI-Voice-1 is between OpenAI TTS Standard ($15) and HD ($30). You're not getting the cheapest option or the most established one. What you do get: a curated library of preset voices, voice cloning from a short audio sample, and generation speed under 1 second per 60 seconds of audio.

The voice cloning gate matters. Creating custom synthetic voices requires going through Microsoft's responsible AI review process. It's not a self-serve API call. For pre-built voices from the catalog it's standard API access, but anything that clones a real speaker identity requires approval.

ServicePrice / 1M charsNotes
OpenAI TTS Standard$156 voices
Azure Neural TTS$16140+ languages, existing Azure service
Google Cloud TTS (WaveNet)$16380+ voices
MAI-Voice-1$22Curated voice library, voice cloning
OpenAI TTS HD$306 voices, higher quality
ElevenLabs Flash (API)$60~3,000 voices, voice cloning
ElevenLabs Multilingual v2/v3$120Highest quality tier

If you need voice cloning without ElevenLabs pricing and you're already in Azure, MAI-Voice-1 is a reasonable choice. If minimizing cost per character is the priority, OpenAI Standard at $15 is 32% cheaper for most use cases.

MAI-Image-2: token pricing, about $0.034 per image

MAI-Image-2 uses token-based pricing: $5 per million input tokens (your prompt) plus $33 per million output tokens (the image). A 1024x1024 image is roughly 1,024 output tokens, which puts a single image at around $0.034. Longer prompts add minimal cost on the input side.

ServicePer 1024x1024Arena.ai rankNotes
OpenAI GPT Image 1 (Low)$0.011#1 (family)Lower quality tier
MAI-Image-2~$0.034Top-5 at launchMicrosoft AI Foundry
DALL-E 3 Standard$0.040-Legacy, being phased out
Stability AI Large Turbo$0.040-Fastest Stability model
OpenAI GPT Image 1 (Medium)$0.042#1 (family)Mid quality tier
DALL-E 3 HD$0.080-Larger resolution
OpenAI GPT Image 1 (High)$0.167#1 (family)Highest quality tier

The launch rankings on the Arena.ai image leaderboard placed MAI-Image-2 in the top five among commercial image models, though leaderboard positions shift as more votes come in. The models above it (GPT Image 1 family, Gemini 3.1 Flash) all support multiple aspect ratios, which expands their practical use cases.

Where it does well: in-image text rendering scored +115 points over its predecessor on Arena's text rendering sub-category. If your use case involves generating images with readable text in them - product mockups, infographics, slides - that's worth testing. Readable text in generated images has been one of the harder unsolved problems in this space.

What this costs vs alternatives

Voice generation: 100M characters / month
MAI: $2,200/mo
Alternatives: $1,500/mo with OpenAI TTS Standard
OpenAI Standard is about a third cheaper. ElevenLabs would run $6,000-12,000/mo for the same volume.
Image generation: 100,000 images / month
MAI: $3,400/mo (~$0.034/image)
Alternatives: $4,200/mo with GPT Image 1 Medium; $1,100/mo with GPT Image 1 Low
About 19% cheaper than GPT Image 1 Medium, but GPT Image 1 Low costs about a third of what MAI-Image-2 does.

Run your own workload through the cost calculator.

When these make sense (and when they don't)

Use them if you're already on Azure Foundry, need transcription accuracy in supported languages rather than breadth, want voice cloning without ElevenLabs pricing, or are generating images that need readable text in the output. Avoiding single-vendor lock on OpenAI is also a legitimate reason.

Skip them if you need transcription in more than 25 languages (use Whisper), real-time streaming (not available yet), the lowest TTS cost per character (OpenAI Standard at $15 beats $22), non-square image outputs (every competitor has them), or if you're not running on Azure.

The broader point: these models are not disruptive on price. They're Microsoft's proof of concept for building foundation models in-house. That changes the supply chain situation for Azure customers long-term. Six months from now they'll likely have real-time transcription, speaker diarization, rectangular image outputs, and pricing shaped by the competition.

Compare all speech, image, and text model pricing on our pricing page.

Sources