Chatbot Arena April 2026: Claude leads everything, Grok 4.20 has the cheapest output
The April leaderboard has a clear answer at the top. Claude Opus 4.6 Thinking holds #1 across every major category - text, coding, math, creative writing, instruction following. Below it, the rankings get more interesting when you factor in what each model actually costs.

Photo by Luke Chesser on Unsplash
Claude Opus 4.6 Thinking is #1 across every Arena category - coding, math, creative writing, instruction following - and it costs exactly the same as the non-thinking variant. Muse Spark holds #3 with no API. Below Claude, the decision is really between Gemini 3.1 Pro ($2/$12) for near-frontier quality at roughly half the price, and Grok 4.20 ($2/$6) if you want the cheapest output in the top 10 and can live with a newer API ecosystem.
The April 2026 Arena top 10
Data from lmarena.ai, updated April 11-12. Rankings are based on human preference votes - someone sees two model responses side-by-side and picks one. The Elo scores reflect millions of those comparisons, not automated benchmarks.
| Rank | Model | Elo | Provider | Input $/1M | Output $/1M |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.6 Thinking | 1504 | Anthropic | $5.00 | $25.00 |
| 2 | Claude Opus 4.6 | 1496 | Anthropic | $5.00 | $25.00 |
| 3 | Muse Spark | 1493 | Meta | - | - |
| 4 | Gemini 3.1 Pro Preview | 1492 | $2.00 | $12.00 | |
| 5 | Gemini 3 Pro | 1486 | $2.00 | $12.00 | |
| 6 | Grok 4.20 Beta | ~1484 | xAI | $2.00 | $6.00 |
| 7 | GPT-5.4 High | 1484 | OpenAI | $2.50 | $15.00 |
| 8 | Grok 4.20 Beta Reasoning | 1478 | xAI | $2.00 | $6.00 |
| 9 | GPT-5.2 Chat | 1477 | OpenAI | - | - |
| 10 | Grok 4.20 Multi-Agent | 1476 | xAI | $2.00 | $6.00 |
Pricing from official provider pages, verified April 12, 2026. Grok 4.20 Beta Elo is approximate based on surrounding ranks. GPT-5.2 Chat and Muse Spark have no public API pricing.
Thinking mode costs nothing extra
Claude Opus 4.6 Thinking holds 8 Elo points on the non-thinking version and leads every single category the Arena tracks. Both are $5 per million input tokens and $25 per million output tokens. No thinking surcharge.
This is different from how OpenAI handles extended reasoning. With o-series models, thinking tokens are billed separately and appear explicitly in your usage. With Claude, you enable thinking at the API level and the cost structure stays the same. Anthropic absorbs the compute difference.
The practical implication: if you're already paying for Claude Opus 4.6, you're already paying for the #1 ranked model on the current leaderboard. The Arena lists them as separate entries at the same price.
Cost per Elo point
This metric has real limits - Elo points aren't linear, different workloads favor different models - but it puts the pricing picture in perspective. Using output pricing since that tends to dominate costs for text generation:
| Model | Elo | Output $/1M | $/Elo point |
|---|---|---|---|
| Grok 4.20 Beta | 1484 | $6.00 | $0.00404 |
| Gemini 3.1 Pro Preview | 1492 | $12.00 | $0.00804 |
| GPT-5.4 High | 1484 | $15.00 | $0.01011 |
| Claude Opus 4.6 Thinking | 1504 | $25.00 | $0.01663 |
Grok 4.20 looks good here. It sits around Elo 1476-1484 depending on the variant, and every variant costs $2/$6. On pure cost-per-Elo, it's the cheapest way into the top 10 by a noticeable margin.
The thing to keep in mind: Elo points are not linear. A model with Elo 1504 beats Elo 1476 in roughly 53% of head-to-head comparisons. Not a dramatic gap, but real and consistent across millions of votes. Whether that 3% win rate justifies paying 4x more on output depends entirely on what you're building.
Muse Spark: third place, no API
Meta's Muse Spark holds Elo 1493 - third on the leaderboard, between Claude Opus 4.6 and Gemini 3.1 Pro. It performs particularly well on vision tasks, appearing near the top of multimodal comparisons.
There is no developer API. Muse Spark is Meta's first proprietary model (the Llama series remains open-weight), and no pricing or access program has been announced. For anyone building on APIs, it might as well not exist yet.
Where each model actually makes sense
Claude Opus 4.6 Thinking: best overall, highest cost
Holds #1 in coding, math, creative writing, and instruction following simultaneously. That's unusual - models typically trade off between categories. Right now, Anthropic has a clean lead across the board that the other providers haven't matched.
The price is $5/$25 per million, which makes it the most expensive accessible model in the top 10. If you need the best and cost isn't the binding constraint, there's no real debate in April 2026.
Gemini 3.1 Pro: serious value at near-frontier quality
Running at $2/$12, Gemini 3.1 Pro delivers 99.2% of Claude Opus 4.6 Thinking's Elo for 40% of the input cost and 48% of the output cost. The 2M token context window is also the largest at this quality tier, with no pricing penalty for long prompts.
For high-volume workloads or batch processing, the math compounds quickly. Routing everything to Claude when Gemini is 12 Elo points lower at half the output cost is hard to justify for most use cases.
Grok 4.20: cheapest output, newer ecosystem
xAI has three variants in the top 10 - standard, reasoning, and multi-agent - all priced at $2/$6. The $6/M output price is roughly half what Gemini charges and a quarter of Claude. If you have output-heavy workloads and want near-frontier quality, Grok 4.20 is worth a proper evaluation.
The practical concern is maturity. xAI's API is newer than Anthropic's or Google's, developer tooling is less established, and the uptime track record is shorter. For production systems, those things matter more than Elo scores.
GPT-5.4: hard to recommend on value alone
GPT-5.4 High runs $2.50/$15 - more than Gemini and Grok, yet ranks 7th behind both. The case for it is ecosystem: if you're embedded in OpenAI tooling, Codex workflows, or need computer use and long document generation, the switch cost probably outweighs the pricing difference. Otherwise, the value argument is difficult to make.
What these differences look like at scale
Using a 3:1 input-to-output token ratio (7.5M input, 2.5M output per 10M total). We ran these numbers across a few typical API workloads to see where the gaps compound:
The gap between Claude and Gemini runs about 2.2x at every scale. Claude versus Grok is over 3x. Running a high-volume agent loop at a billion tokens per month means paying $7,000 more per month for Claude compared to Grok - for models separated by 20-28 Elo points. Whether those points show up in your users' experience is the actual question worth spending time on.
The call
Claude Opus 4.6 Thinking is the best model on the leaderboard right now. If you need the best and you're running tasks where quality actually compounds - agentic workflows, complex code generation, research synthesis - the $5/$25 rate is defensible.
For most applications, Gemini 3.1 Pro at $2/$12 is the more honest call. Top four globally, 2M context window, half the output cost of Claude. A 12-point Elo gap is real but probably won't be visible to your users on the majority of requests.
Grok 4.20 deserves an actual evaluation if you have output-heavy workloads and can tolerate some API ecosystem risk. $6/M output for top-10 Arena performance is a genuinely interesting number. Run your actual queries on it before dismissing it.
Sources
- LMSYS Chatbot Arena leaderboard - April 11-12, 2026
- Anthropic pricing - verified April 12, 2026
- Google AI pricing - verified April 12, 2026
- xAI Grok pricing - verified April 12, 2026
- OpenAI API pricing - verified April 12, 2026