How much does Qwen3.5-Omni cost per million tokens?

Qwen3.5-Omni is currently in free preview on DashScope. Provisional pricing from Artificial Analysis: Flash at approximately $0.065/M input and $0.26/M output, Plus at approximately $0.40/M input and $4.80/M output. Official pricing has not been published. New DashScope accounts receive 1M free tokens valid for 90 days.

How does Qwen3.5-Omni compare to GPT-4o for audio?

Qwen3.5-Omni Plus beats Gemini 3.1 Pro on MMAU (82.2 vs 81.1), VoiceBench (93.1 vs 88.9), LibriSpeech WER (1.11% vs 3.36%), and all other audio benchmarks measured. At provisional Flash pricing (~$0.065/M), it is significantly cheaper than GPT-4o ($2.50/M) and GPT-4o mini ($0.15/M) for text token billing.

Is Qwen3.5-Omni open source?

Only the Light variant has open weights on HuggingFace. Plus and Flash are API-only. This is a change from previous Qwen Omni releases, which were Apache 2.0. The predecessor Qwen3-Omni-30B-A3B-Instruct remains available as open weights for self-hosting.

Is Qwen3.5-Omni available on OpenRouter?

The multimodal Qwen3.5-Omni API is not available on OpenRouter. It requires a direct DashScope integration via Alibaba Cloud Model Studio. Available endpoints include Singapore (with free-tier quota), US Virginia, Beijing, and Hong Kong.

Model ReleaseApril 6, 2026·7 min read

Qwen3.5-Omni: the pricing, the audio benchmarks, and whether the architecture hype is real

Released March 30, 2026. Three variants: Plus (30B-A3B MoE), Flash, Light (open weights). Currently free on DashScope during preview. When that ends, expect roughly $0.065/M input for Flash and $0.40/M for Plus - provisional figures from live API measurement, not official yet. It handles real-time voice, video, and images natively, and beats Gemini 3.1 Pro on every audio benchmark Alibaba published.

Qwen3.5-Omni model release showing Thinker-Talker architecture for native voice and multimodal AI processing

Image source: Qwen Blog

Three variants, three price points

Plus is the benchmark driver - 30 billion total parameters with 3 billion active per token (MoE architecture, labeled 30B-A3B). Flash is the production default: lighter, lower latency, same 256K context window. Light has open weights on HuggingFace, which is the exception here - Plus and Flash are API-only, a departure from the Apache 2.0 releases that made Qwen2.5-Omni and Qwen3-Omni popular.

The pricing below is provisional. DashScope's official rate card shows the billing structure but has not published final numbers. These figures come from Artificial Analysis measuring live API traffic during free preview.

Model	Input / 1M	Output / 1M	Context	Weights
Qwen3.5-Omni Flash	~$0.065*	~$0.26*	256K	Closed
Qwen3.5-Omni Plus	~$0.40*	~$4.80*	256K	Closed
Qwen3.5-Omni Light	Self-hosted	Self-hosted	256K	Open
GPT-4o mini	$0.15	$0.60	128K	Closed
Gemini 2.5 Flash	$0.15	$0.60	1M	Closed
GPT-4o	$2.50	$10.00	128K	Closed
Gemini 3.1 Pro	$1.25	$5.00	1M	Closed

* Provisional figures from Artificial Analysis live measurement during free preview. Official DashScope pricing not yet published. All variants currently free (Singapore endpoint, 1M token quota for new accounts, 90-day validity).

What the audio benchmarks actually show

Alibaba claims state-of-the-art on 215 audio and audio-visual benchmarks. That's a marketing aggregate - not useful for comparison. The individually verifiable scores against Gemini 3.1 Pro tell a clearer story: Plus wins every audio-specific benchmark measured.

Benchmark	Qwen3.5-Omni Plus	Gemini 3.1 Pro	Winner
MMAU (audio understanding)	82.2	81.1	Qwen
VoiceBench	93.1	88.9	Qwen
LibriSpeech WER clean (%)	1.11	3.36	Qwen
LibriSpeech WER other (%)	2.23	4.41	Qwen
Fleurs top-60 WER avg (%)	6.55	7.32	Qwen
Cantonese ASR WER (%)	1.95	13.40	Qwen
MuchoMusic	72.4	59.6	Qwen

The Cantonese number is the one that stands out. A 1.95% word error rate versus Gemini's 13.40% is not a margin rounding - it is a 7x gap. Qwen trained on 20 million hours of audio data across 113 languages, and it shows on lower-resource language benchmarks where other models tend to degrade. Speech translation coverage extends to 156 language pairs for speech-to-text translation.

For English-centric transcription, the gap over Gemini is modest. Where it opens up is non-English languages, music-heavy audio (MuchoMusic +21%), and noisy environments (LibriSpeech “other” set). If your application is purely English and low-noise, the difference may not justify a DashScope integration. If you are working across Asian languages or need robust music and noise handling, the benchmark advantage is real.

Why the “native” architecture actually matters

Most multimodal models are text models with audio and vision modules grafted on. Input goes through a separate encoder, gets tokenized, passes through the language model, then gets synthesized. A voice-in, voice-out interaction typically crosses 2-3 model hops with latency stacking at each boundary.

The Thinker-Talker architecture works differently. The Thinker is a MoE reasoning core that processes text, image, audio, and video together - not sequentially. It generates reasoning tokens and latent speech tokens in parallel. The Talker is a separate streaming MoE that converts those latent tokens into audio in real time, without waiting for the full text response to finish.

Two practical consequences. First, latency: DashScope reports ~234ms for Flash in streaming tests, competitive with dedicated real-time voice APIs. Second, the decoupled design means you can insert safety filters, RAG lookups, or function calls between the Thinker output and the Talker without stalling the response pipeline. That is harder to do when voice synthesis is end-to-end with the language model.

The ARIA system (Adaptive Rate Interleave Alignment) handles synchronization between text and speech token streams, preventing word drops and mispronunciations in fast streaming. Semantic interruption detection - distinguishing genuine turn-taking from background noise or backchanneling - is also built into this layer.

Cost at scale vs GPT-4o

These use text token pricing only - the provisional Flash rate ($0.065/$0.26) versus the main alternatives. Audio input billing is separate and final rates have not been published. Use these for relative scale; once official DashScope pricing is live, verify against the rate card.

Scenario	Volume	Qwen3.5 Flash*	GPT-4o mini	GPT-4o
Transcription pipeline	50M in / 10M out	$5.85	$13.50	$225
Document intelligence	200M in / 40M out	$23.40	$54	$900
Multilingual voice app	500M in / 100M out	$58.50	$135	$2,250
Voice bot at scale	2B in / 400M out	$234	$540	$9,000

* Qwen3.5-Omni Flash provisional: $0.065/M input + $0.26/M output. GPT-4o mini: $0.15/$0.60. GPT-4o: $2.50/$10.00. Text tokens only - audio tokenization billed separately. Use the TokenCost calculator for your exact numbers.

At the $2B/400M scale, Flash comes in at $234 versus $540 for GPT-4o mini and $9,000 for GPT-4o - a $8,766/month gap against GPT-4o at that volume. That difference holds at every scale. If the provisional pricing holds close to the official rates, Flash is the cheapest option for multimodal text pipelines among the models compared here.

What to know before building on it

The pricing uncertainty is the main reason to test now but commit later. The free preview gives you eval runway, but the official rate card could land anywhere. Artificial Analysis is measuring live traffic at ~$0.40/$4.80 for Plus and ~$0.065/$0.26 for Flash - treat these as estimates until DashScope publishes final numbers.

The multimodal API is not on OpenRouter. If you route Qwen calls through OpenRouter today, that covers text-only Qwen models. Omni requires a direct DashScope integration. Available endpoints are Singapore (with 1M free-tier quota), US Virginia, Beijing, and Hong Kong. The model IDs are qwen3.5-omni-plus and qwen3.5-omni-flash.

Worth noting: previous generations - Qwen2.5-Omni and Qwen3-Omni - were Apache 2.0 with full weights on HuggingFace. This time, Plus and Flash are API-only. Only Light gets open weights. If self-hosting is part of your deployment requirements, the Qwen3-Omni-30B-A3B-Instruct model is the most capable Apache 2.0 option in this family right now.

Voice cloning works with a 10-60 second sample. Tool calling and JSON mode are both confirmed. Real-time streaming voice uses the WebSocket Realtime API on DashScope - separate from the standard completions endpoint. Function calls work through the normal chat completions path with text output.

The short version

Qwen3.5-Omni Plus beats Gemini 3.1 Pro on every audio benchmark measured, with the most dramatic gap in low-resource languages. The Thinker-Talker architecture is not just marketing - the decoupled design produces lower latency and gives you a place to insert logic between reasoning and speech. At provisional Flash pricing, it would be the cheapest option for multimodal text pipelines among the models compared here.

What gives us pause: pricing is officially unknown, the multimodal API is not on OpenRouter, and Plus and Flash are closed-source for the first time in the Qwen Omni line. For multilingual voice at scale, the eval results are strong enough to test seriously. For pure text workloads with no audio requirement, there are options with clearer pricing and better Western cloud availability.

Compare All Model Pricing Calculate Your API Costs

Sources

Compare All Model Pricing Calculate Your API Costs