Microsoft MAI models: what MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 actually cost
Microsoft released three proprietary foundation models on April 2. They're the first built entirely inside Microsoft rather than licensed from OpenAI. The pricing is more interesting than the benchmarks suggest.

Image source: Microsoft AI
TL;DR
MAI-Transcribe-1 costs $0.36/hr batch (matches Whisper, not cheaper), hits 3.88% WER across 25 languages, and has no real-time streaming yet. MAI-Voice-1 is $22/1M characters - between OpenAI's TTS Standard ($15) and HD ($30), with a broad voice catalog and cloning from a short audio sample. MAI-Image-2 runs $5/$33 per million tokens (roughly $0.034 per 1024x1024 image) and ranked in the top tier of the Arena.ai image leaderboard at launch. None of them are the cheapest option in their category. What's new is Microsoft building these in-house at all.
Why these models exist
Microsoft has spent years reselling OpenAI models through Azure. That arrangement made sense when OpenAI had a clear capability lead, but the gap has narrowed. The MAI (Microsoft AI) group was formed in 2025 specifically to build proprietary models. These three are the first public output of that group, formed within its first six months.
All three target categories where OpenAI already has established products: Whisper for transcription, TTS for voice, DALL-E and GPT Image for images. Microsoft is building parity, not a leapfrog. The point is reducing dependence on a supplier they also compete with.
They went live in Microsoft Foundry (Azure AI Foundry) and the MAI Playground at launch. MAI-Transcribe-1 already powers Copilot voice mode. MAI-Voice-1 runs Copilot Audio Expressions and Copilot Podcasts. MAI-Image-2 is in Copilot now and rolling out to Bing Image Creator and PowerPoint.
MAI-Transcribe-1: accuracy vs language coverage
The accuracy numbers are solid. MAI-Transcribe-1 hits 3.88% word error rate on the FLEURS benchmark across 25 languages. That beats Whisper-large-v3 (7.6% WER), OpenAI GPT-Transcribe (4.2% WER), and Scribe v2 on the same benchmark. If you're working in those 25 languages and accuracy is the main concern, this is currently the most accurate batch transcription available.
25 languages is the catch. Whisper supports 99. Azure's own existing Speech service supports 140+. Real-time streaming is not available at launch - batch only. No speaker diarization, no contextual biasing for domain-specific terminology. Microsoft says those are coming.
| Service | Price / hr | WER (FLEURS) | Languages | Real-time |
|---|---|---|---|---|
| MAI-Transcribe-1 | $0.36 | 3.88% | 25 | No (coming) |
| OpenAI Whisper | $0.36 | 7.6% | 99 | No |
| OpenAI GPT-Transcribe | $0.36 | 4.2% | 99 | Yes |
| OpenAI GPT-4o Mini Transcribe | $0.18 | - | 57 | Yes |
| Google Cloud STT (Batch) | $0.24 | - | 125+ | No |
| Google Cloud STT (Standard) | $0.96 | - | 125+ | Yes |
| Azure Speech (Batch) | $0.36 | - | 140+ | No |
WER from Microsoft's FLEURS benchmark (25 languages). GPT-4o Mini Transcribe is $0.003/min = $0.18/hr. Prices from provider pages, retrieved April 7, 2026.
MAI-Voice-1: $22 per million characters
At $22 per million characters, MAI-Voice-1 is between OpenAI TTS Standard ($15) and HD ($30). You're not getting the cheapest option or the most established one. What you do get: a curated library of preset voices, voice cloning from a short audio sample, and generation speed under 1 second per 60 seconds of audio.
The voice cloning gate matters. Creating custom synthetic voices requires going through Microsoft's responsible AI review process. It's not a self-serve API call. For pre-built voices from the catalog it's standard API access, but anything that clones a real speaker identity requires approval.
| Service | Price / 1M chars | Notes |
|---|---|---|
| OpenAI TTS Standard | $15 | 6 voices |
| Azure Neural TTS | $16 | 140+ languages, existing Azure service |
| Google Cloud TTS (WaveNet) | $16 | 380+ voices |
| MAI-Voice-1 | $22 | Curated voice library, voice cloning |
| OpenAI TTS HD | $30 | 6 voices, higher quality |
| ElevenLabs Flash (API) | $60 | ~3,000 voices, voice cloning |
| ElevenLabs Multilingual v2/v3 | $120 | Highest quality tier |
If you need voice cloning without ElevenLabs pricing and you're already in Azure, MAI-Voice-1 is a reasonable choice. If minimizing cost per character is the priority, OpenAI Standard at $15 is 32% cheaper for most use cases.
MAI-Image-2: token pricing, about $0.034 per image
MAI-Image-2 uses token-based pricing: $5 per million input tokens (your prompt) plus $33 per million output tokens (the image). A 1024x1024 image is roughly 1,024 output tokens, which puts a single image at around $0.034. Longer prompts add minimal cost on the input side.
| Service | Per 1024x1024 | Arena.ai rank | Notes |
|---|---|---|---|
| OpenAI GPT Image 1 (Low) | $0.011 | #1 (family) | Lower quality tier |
| MAI-Image-2 | ~$0.034 | Top-5 at launch | Microsoft AI Foundry |
| DALL-E 3 Standard | $0.040 | - | Legacy, being phased out |
| Stability AI Large Turbo | $0.040 | - | Fastest Stability model |
| OpenAI GPT Image 1 (Medium) | $0.042 | #1 (family) | Mid quality tier |
| DALL-E 3 HD | $0.080 | - | Larger resolution |
| OpenAI GPT Image 1 (High) | $0.167 | #1 (family) | Highest quality tier |
The launch rankings on the Arena.ai image leaderboard placed MAI-Image-2 in the top five among commercial image models, though leaderboard positions shift as more votes come in. The models above it (GPT Image 1 family, Gemini 3.1 Flash) all support multiple aspect ratios, which expands their practical use cases.
Where it does well: in-image text rendering scored +115 points over its predecessor on Arena's text rendering sub-category. If your use case involves generating images with readable text in them - product mockups, infographics, slides - that's worth testing. Readable text in generated images has been one of the harder unsolved problems in this space.
What this costs vs alternatives
Run your own workload through the cost calculator.
When these make sense (and when they don't)
Use them if you're already on Azure Foundry, need transcription accuracy in supported languages rather than breadth, want voice cloning without ElevenLabs pricing, or are generating images that need readable text in the output. Avoiding single-vendor lock on OpenAI is also a legitimate reason.
Skip them if you need transcription in more than 25 languages (use Whisper), real-time streaming (not available yet), the lowest TTS cost per character (OpenAI Standard at $15 beats $22), non-square image outputs (every competitor has them), or if you're not running on Azure.
The broader point: these models are not disruptive on price. They're Microsoft's proof of concept for building foundation models in-house. That changes the supply chain situation for Azure customers long-term. Six months from now they'll likely have real-time transcription, speaker diarization, rectangular image outputs, and pricing shaped by the competition.
Compare all speech, image, and text model pricing on our pricing page.
Sources
- Microsoft AI: Three new MAI models in Foundry (April 2, 2026)
- Microsoft AI: MAI-Transcribe-1 announcement (April 2, 2026)
- Azure AI Foundry blog: MAI model technical details
- TechCrunch: Microsoft takes on AI rivals with three new models (April 2, 2026)
- OpenAI API pricing
- Google Cloud Speech-to-Text pricing
- ElevenLabs pricing