Qwen3.5-Omni: the pricing, the audio benchmarks, and whether the architecture hype is real
Released March 30, 2026. Three variants: Plus (30B-A3B MoE), Flash, Light (open weights). Currently free on DashScope during preview. When that ends, expect roughly $0.065/M input for Flash and $0.40/M for Plus - provisional figures from live API measurement, not official yet. It handles real-time voice, video, and images natively, and beats Gemini 3.1 Pro on every audio benchmark Alibaba published.

Image source: Qwen Blog
Three variants, three price points
Plus is the benchmark driver - 30 billion total parameters with 3 billion active per token (MoE architecture, labeled 30B-A3B). Flash is the production default: lighter, lower latency, same 256K context window. Light has open weights on HuggingFace, which is the exception here - Plus and Flash are API-only, a departure from the Apache 2.0 releases that made Qwen2.5-Omni and Qwen3-Omni popular.
The pricing below is provisional. DashScope's official rate card shows the billing structure but has not published final numbers. These figures come from Artificial Analysis measuring live API traffic during free preview.
| Model | Input / 1M | Output / 1M | Context | Weights |
|---|---|---|---|---|
| Qwen3.5-Omni Flash | ~$0.065* | ~$0.26* | 256K | Closed |
| Qwen3.5-Omni Plus | ~$0.40* | ~$4.80* | 256K | Closed |
| Qwen3.5-Omni Light | Self-hosted | Self-hosted | 256K | Open |
| GPT-4o mini | $0.15 | $0.60 | 128K | Closed |
| Gemini 2.5 Flash | $0.15 | $0.60 | 1M | Closed |
| GPT-4o | $2.50 | $10.00 | 128K | Closed |
| Gemini 3.1 Pro | $1.25 | $5.00 | 1M | Closed |
* Provisional figures from Artificial Analysis live measurement during free preview. Official DashScope pricing not yet published. All variants currently free (Singapore endpoint, 1M token quota for new accounts, 90-day validity).
What the audio benchmarks actually show
Alibaba claims state-of-the-art on 215 audio and audio-visual benchmarks. That's a marketing aggregate - not useful for comparison. The individually verifiable scores against Gemini 3.1 Pro tell a clearer story: Plus wins every audio-specific benchmark measured.
| Benchmark | Qwen3.5-Omni Plus | Gemini 3.1 Pro | Winner |
|---|---|---|---|
| MMAU (audio understanding) | 82.2 | 81.1 | Qwen |
| VoiceBench | 93.1 | 88.9 | Qwen |
| LibriSpeech WER clean (%) | 1.11 | 3.36 | Qwen |
| LibriSpeech WER other (%) | 2.23 | 4.41 | Qwen |
| Fleurs top-60 WER avg (%) | 6.55 | 7.32 | Qwen |
| Cantonese ASR WER (%) | 1.95 | 13.40 | Qwen |
| MuchoMusic | 72.4 | 59.6 | Qwen |
The Cantonese number is the one that stands out. A 1.95% word error rate versus Gemini's 13.40% is not a margin rounding - it is a 7x gap. Qwen trained on 20 million hours of audio data across 113 languages, and it shows on lower-resource language benchmarks where other models tend to degrade. Speech translation coverage extends to 156 language pairs for speech-to-text translation.
For English-centric transcription, the gap over Gemini is modest. Where it opens up is non-English languages, music-heavy audio (MuchoMusic +21%), and noisy environments (LibriSpeech “other” set). If your application is purely English and low-noise, the difference may not justify a DashScope integration. If you are working across Asian languages or need robust music and noise handling, the benchmark advantage is real.
Why the “native” architecture actually matters
Most multimodal models are text models with audio and vision modules grafted on. Input goes through a separate encoder, gets tokenized, passes through the language model, then gets synthesized. A voice-in, voice-out interaction typically crosses 2-3 model hops with latency stacking at each boundary.
The Thinker-Talker architecture works differently. The Thinker is a MoE reasoning core that processes text, image, audio, and video together - not sequentially. It generates reasoning tokens and latent speech tokens in parallel. The Talker is a separate streaming MoE that converts those latent tokens into audio in real time, without waiting for the full text response to finish.
Two practical consequences. First, latency: DashScope reports ~234ms for Flash in streaming tests, competitive with dedicated real-time voice APIs. Second, the decoupled design means you can insert safety filters, RAG lookups, or function calls between the Thinker output and the Talker without stalling the response pipeline. That is harder to do when voice synthesis is end-to-end with the language model.
The ARIA system (Adaptive Rate Interleave Alignment) handles synchronization between text and speech token streams, preventing word drops and mispronunciations in fast streaming. Semantic interruption detection - distinguishing genuine turn-taking from background noise or backchanneling - is also built into this layer.
Cost at scale vs GPT-4o
These use text token pricing only - the provisional Flash rate ($0.065/$0.26) versus the main alternatives. Audio input billing is separate and final rates have not been published. Use these for relative scale; once official DashScope pricing is live, verify against the rate card.
| Scenario | Volume | Qwen3.5 Flash* | GPT-4o mini | GPT-4o |
|---|---|---|---|---|
| Transcription pipeline | 50M in / 10M out | $5.85 | $13.50 | $225 |
| Document intelligence | 200M in / 40M out | $23.40 | $54 | $900 |
| Multilingual voice app | 500M in / 100M out | $58.50 | $135 | $2,250 |
| Voice bot at scale | 2B in / 400M out | $234 | $540 | $9,000 |
* Qwen3.5-Omni Flash provisional: $0.065/M input + $0.26/M output. GPT-4o mini: $0.15/$0.60. GPT-4o: $2.50/$10.00. Text tokens only - audio tokenization billed separately. Use the TokenCost calculator for your exact numbers.
At the $2B/400M scale, Flash comes in at $234 versus $540 for GPT-4o mini and $9,000 for GPT-4o - a $8,766/month gap against GPT-4o at that volume. That difference holds at every scale. If the provisional pricing holds close to the official rates, Flash is the cheapest option for multimodal text pipelines among the models compared here.
What to know before building on it
The pricing uncertainty is the main reason to test now but commit later. The free preview gives you eval runway, but the official rate card could land anywhere. Artificial Analysis is measuring live traffic at ~$0.40/$4.80 for Plus and ~$0.065/$0.26 for Flash - treat these as estimates until DashScope publishes final numbers.
The multimodal API is not on OpenRouter. If you route Qwen calls through OpenRouter today, that covers text-only Qwen models. Omni requires a direct DashScope integration. Available endpoints are Singapore (with 1M free-tier quota), US Virginia, Beijing, and Hong Kong. The model IDs are qwen3.5-omni-plus and qwen3.5-omni-flash.
Worth noting: previous generations - Qwen2.5-Omni and Qwen3-Omni - were Apache 2.0 with full weights on HuggingFace. This time, Plus and Flash are API-only. Only Light gets open weights. If self-hosting is part of your deployment requirements, the Qwen3-Omni-30B-A3B-Instruct model is the most capable Apache 2.0 option in this family right now.
Voice cloning works with a 10-60 second sample. Tool calling and JSON mode are both confirmed. Real-time streaming voice uses the WebSocket Realtime API on DashScope - separate from the standard completions endpoint. Function calls work through the normal chat completions path with text output.
The short version
Qwen3.5-Omni Plus beats Gemini 3.1 Pro on every audio benchmark measured, with the most dramatic gap in low-resource languages. The Thinker-Talker architecture is not just marketing - the decoupled design produces lower latency and gives you a place to insert logic between reasoning and speech. At provisional Flash pricing, it would be the cheapest option for multimodal text pipelines among the models compared here.
What gives us pause: pricing is officially unknown, the multimodal API is not on OpenRouter, and Plus and Flash are closed-source for the first time in the Qwen Omni line. For multilingual voice at scale, the eval results are strong enough to test seriously. For pure text workloads with no audio requirement, there are options with clearer pricing and better Western cloud availability.
Sources
- Qwen Blog: Qwen3.5-Omni - Scaling Up, Toward Native Omni-Modal AGI
- Artificial Analysis: Qwen3.5-Omni Plus providers and pricing
- Alibaba Cloud Model Studio: pricing
- MarkTechPost: Alibaba Qwen Team Releases Qwen3.5-Omni
- The Decoder: Qwen3.5-Omni learned to write code from spoken instructions and video
- Winbuzzer: Alibaba keeps Qwen3.5-Omni AI models closed
- Analytics Vidhya: Qwen3.5-Omni benchmarks