Skip to main content
TC
TokenCost
Model ReleaseMarch 23, 2026·7 min read

Mistral Small 4: $0.15 per million input tokens for a multimodal MoE model

Mistral shipped a 119B-parameter model on March 16 that only activates 6B parameters per token. It handles text and images, does configurable reasoning, and costs 5x less than GPT-5.4 Mini on input. We dug into the pricing, the architecture, and whether the trade-offs make sense.

Mistral Small 4 model announcement

Image source: Mistral AI

TL;DR

  • -Pricing: $0.15 / 1M input, $0.60 / 1M output. That's 5x cheaper than GPT-5.4 Mini on input and 7.5x cheaper on output.
  • -Architecture: 119B total parameters, ~6.5B active per token. 128 experts, 4 active. Mixture of Experts. 256K context window.
  • -Multimodal: Native text + image input. Handles OCR, document parsing, visual analysis out of the box.
  • -Reasoning: Configurable via reasoning_effort parameter. Set to "none" for fast chat, "high" for step-by-step thinking.
  • -Open source: Apache 2.0 license. Self-host with 4x H100 GPUs or use the Mistral API.

What Mistral actually built

Mistral Small 4 is three models jammed into one. It merges Pixtral (their multimodal model), Magistral (reasoning), and Devstral (coding) into a single 119B-parameter MoE model. The API ID is mistral-small-2603.

Only 6.5B parameters activate on any given token. That's 128 total experts with 4 active at a time. The result: you get 119B-class capability at 6B-class inference costs. This is the same MoE trick NVIDIA used with Nemotron 3 Super, but Mistral pushes it harder with a 20:1 ratio of total to active parameters.

Mistral claims 40% faster end-to-end completion time and 3x more requests per second compared to Mistral Small 3. Those are throughput numbers, not quality numbers, but they matter if you're running this at scale.

Pricing breakdown

At $0.15 per million input tokens, Mistral Small 4 is the cheapest multimodal model from a major provider. The only models cheaper on input are Gemini 2.0 Flash-Lite ($0.075) and Mistral's own older Small 3.2 ($0.10), neither of which combines multimodal input with configurable reasoning.

ModelInput / 1MOutput / 1MContextMultimodal
Mistral Small 4$0.15$0.60256KYes
GPT-5.4 Nano$0.20$1.25400KYes
Gemini 3.1 Flash-Lite$0.25$1.501MYes
DeepSeek V3.2$0.28$0.42128KNo
GPT-5.4 Mini$0.75$4.50400KYes
Claude Haiku 4.5$1.00$5.00200KYes

Prices from Mistral docs and official provider pricing pages, retrieved March 23, 2026. Check our pricing page for the full list.

Why the MoE architecture matters for your bill

119B parameters sounds expensive to run. 6.5B active parameters doesn't. That gap is the whole point of Mixture of Experts. You get the knowledge of a large model with the inference cost of a small one.

Mistral runs 128 expert networks in parallel, but only routes each token through 4 of them. The routing layer picks which experts matter for each token. A coding token goes to coding experts. A French language token goes to language experts. The other 124 experts sit idle and cost nothing.

For self-hosting, that means you need 4x H100 GPUs minimum (the full model is ~242GB in BF16). Not cheap, but cheaper than hosting a 119B dense model, which would need 8-16 H100s. Most people will just use the API at $0.15/1M though.

The reasoning toggle

This is the part that caught our attention. Mistral Small 4 has a reasoning_effort parameter you can set per request.

reasoning_effort="none"

Fast chat mode. No chain-of-thought. Equivalent to Mistral Small 3.2 behavior. Use for classification, extraction, simple Q&A.

reasoning_effort="high"

Deep reasoning mode. Step-by-step thinking like Magistral. Use for math, complex analysis, multi-step problems. Costs more tokens but gets harder questions right.

This is useful because you don't pay for reasoning tokens on simple requests. A classification call with reasoning_effort="none" is pure speed. A math problem with reasoning_effort="high" takes longer but gets the answer right. Same model, same API, you just flip a parameter. OpenAI has a similar concept with their effort levels, but Mistral's is baked into a $0.15 model instead of a $2.50 one.

What the benchmarks say

Mistral published chart-based comparisons rather than clean tables for most benchmarks. Here's what we could extract. The GPQA Diamond and MMLU-Pro scores come from a third-party review, not Mistral's official numbers, so take them with appropriate skepticism.

BenchmarkScoreSource
AA LCR0.72Official
GPQA Diamond71.2%Third-party
MMLU-Pro78.0%Third-party
LiveCodeBench> GPT-OSS 120BOfficial (chart)
AIME 2025> GPT-OSS 120BOfficial (chart)

The LCR score of 0.72 is interesting because Mistral notes the model achieved it with only 1,600 characters of output, while comparable Qwen models needed 5,800-6,100 characters. That means lower output token costs for the same result. For a $0.60/1M output model, that efficiency compounds.

What this costs vs GPT-5.4 Mini

We ran the same workloads from our GPT-5.4 Mini vs Nano post through Mistral Small 4 pricing. The differences are large.

Customer support: 2K input + 500 output, 5,000/day
Mistral Small 4: $99/moGPT-5.4 Mini: $563/mo5.7x cheaper
Code review: 15K input + 3K output, 200/day
Mistral Small 4: $24/moGPT-5.4 Mini: $149/mo6.2x cheaper
Data extraction: 5K input + 1K output, 10,000/day
Mistral Small 4: $405/moGPT-5.4 Mini: $2,475/mo6.1x cheaper

Math: Customer support = [(10M * $0.15) + (2.5M * $0.60)] * 30 = ($1.50 + $1.50) * 30 = $90/mo. Rounded to $99 to account for reasoning token overhead. Run your own numbers with our cost calculator.

When Mistral Small 4 makes sense (and when it doesn't)

Good fit
  • High-volume document parsing with OCR
  • Budget multimodal pipelines (text + image)
  • Coding tasks where open source matters
  • Workloads that need both fast and deep modes
  • Self-hosted deployments (Apache 2.0)
Not a good fit
  • Computer use or browser automation (use GPT-5.4 Mini)
  • Tasks that need frontier-level reasoning (use GPT-5.4 or Claude)
  • Long context beyond 256K (GPT-5.4 Mini does 400K, Gemini does 1M)
  • Workloads where DeepSeek V3.2's $0.42 output beats $0.60

Honest take: if you don't need computer use or massive context windows, Mistral Small 4 undercuts basically everything at this quality level. The 256K context limit and lack of computer use support are the main reasons to pay more for GPT-5.4 Mini instead.

Bottom line

Mistral Small 4 at $0.15 per million input tokens is the cheapest way to get multimodal input and configurable reasoning in one model. The MoE architecture keeps inference fast despite the 119B parameter count. Apache 2.0 means you can self-host if you have the GPUs.

The limits are real: 256K context (not 400K or 1M), no computer use, and benchmark scores that sit below GPT-5.4 and Claude. But at 5-7x less than GPT-5.4 Mini, you're paying for a different tier of model and getting surprisingly close performance.

Compare it against everything else on our pricing page, or plug in your workload with the cost calculator.

Sources