Skip to main content
TokenCost logoTokenCost
Model ReleaseMay 31, 2026·9 min read

Cohere shipped a 218B Apache 2.0 model that runs on two H100s. The hosted rate matches Command A; the license is the actual news.

Command A+ landed on May 20 as the first Cohere flagship under unrestricted Apache 2.0. The hosted API stays at $2.50 input and $10 output per million tokens, the same tier Cohere has held on Command A since release. The W4A4 build runs on two H100s at 375 tokens per second, which is what drags the self-host math into range for anyone running steady volume. τ²-Bench Telecom jumps from 37 to 85, Terminal-Bench Hard from 3 to 25, AIME 2025 from 57 to 90. The boards Cohere did not publish are the ones to read for.

Cohere Command A+ official launch banner, 218B parameter Apache 2.0 open-weight MoE model

Image source: Cohere

Three facts before anything else

  • Apache 2.0 weights, no revenue cap. This is the first time Cohere has shipped a frontier model without the CC-BY-NC community license.
  • 218B total / 25B active sparse MoE with a W4A4 build that runs on two H100s. Cohere published 375 tokens per second of output and a 113ms time-to-first-token on that footprint.
  • Hosted API is still $2.50/$10 per 1M, the existing Command A tier. Cohere has not put a separate Command A+ rate card on the pricing page.

The rate card, and the license clause that goes with it

Two surfaces, two pricing models. The hosted Cohere API charges $2.50 input and $10 output per million tokens, lifted straight off the existing Command A line on cohere.com. OpenRouter mirrors that rate. The other surface is the Hugging Face weights, which carry no token rate at all because the Apache 2.0 license lets you host the model yourself with no royalty, no cap, and no acceptable-use clause beyond the patent grant. That is a meaningful shift from the CC-BY-NC license that covered Command A, Aya, and the original Command R+.

SurfaceInput / 1MOutput / 1MContext / Output capNotes
Cohere API$2.50$10.00128K / 64KMatches Command A tier. No published cache or batch discount.
OpenRouter$2.50$10.00128K / 64KRoutes to Cohere as the only provider at launch.
Hugging Face weights$0$0128K / 64KApache 2.0. BF16, FP8, and W4A4 NVFP4 builds.
Azure AI Foundry$2.50$10.00128K / 64KSame list rate. Foundry availability confirmed at launch.

One caveat on the hosted rate. Cohere has not pushed a Command A+ row to the public pricing page as of this week. The $2.50/$10 figures are confirmed by OpenRouter, Azure Foundry, and third-party trackers, but treat the absence of a published Cohere line as the kind of detail that can move on short notice. If a separate A+ rate appears later, it will almost certainly be higher than the existing Command A tier rather than lower.

Benchmark jumps, and the boards Cohere left out

The numbers Cohere chose to publish are large and verifiable, and they read as a response to the Command A line being treated as enterprise-RAG-and-not-much-else. The agentic and math jumps are the headline.

BenchmarkCommand A+Command A (Reasoning / Vision)Delta
τ²-Bench Telecom8537+48
Terminal-Bench Hard253+22
IFBench7436+38
AIME 20259057+33
MMMU (multimodal)75.165.3+9.8
SciCode3830+8
AA-Omniscience Non-Hallucination86n/p#1 board

Three boards are conspicuously absent from the launch deck: SWE-Bench Verified, MMLU, and HumanEval. Those are the three industry-standard coding and knowledge benches every other lab reports against on launch day, and Cohere chose not to. The most generous read is that the τ²-Bench and Terminal-Bench numbers are how Cohere wants Command A+ evaluated for agentic work. The less generous read is that SWE-Bench would put the model behind Mistral Medium 3.5 (77.6) and well behind Claude Sonnet 4.6 (79.6), so the table that would have included it does not exist.

Artificial Analysis filed an Intelligence Index of 37 for Command A+ on launch day, which puts it in roughly the same tier as Claude Haiku 4.5 and a step above NVIDIA Nemotron 3 Super and Gemini 3.1 Flash-Lite. Top of the AA-Omniscience non-hallucination board is the genuinely interesting result. Cohere's native citation grounding, which embeds source-span tags directly in generated text, is what pushes that number.

Where Command A+ sits on a comparison table

Mid-tier hosted models. List rates only, no batch and no cache, since Cohere has not published cache pricing and most of the competing models use very different cache schemes anyway. The point of this row is to read the dollars against the boards Cohere does report.

WorkloadCommand A+Haiku 4.5Gemini 3.5 FlashMistral Medium 3.5
Long-doc RAG read (200K in / 4K out)$0.540$0.220$0.336$0.330
Citation-grounded answer (30K in / 2K out)$0.095$0.040$0.063$0.060
Terminal-style agent run (80K in / 8K out)$0.280$0.120$0.192$0.180
500M monthly tokens (70/30 in/out)$2,375$1,100$1,875$1,650

Command A+ is the most expensive line on this table at the API. Haiku 4.5 lands at roughly 40 percent of the Command A+ bill, Mistral Medium 3.5 at about 70 percent, and Gemini 3.5 Flash slots in between. The hosted dollar comparison is not where Cohere wants this fight, which is the cue to read the next section.

Self-host math: when 2 rented H100s beat the API

This is the part of the launch that actually moves the cost surface. Cohere published 375 output tokens per second on a 2x H100 footprint running the W4A4 NVFP4 build, with a 113 millisecond time-to-first-token. That works out to about 1.35 million output tokens per hour at full GPU utilization. Set that against the API rate of $10 per million output tokens, and the breakeven hinges on hourly GPU cost and how full you keep the boxes.

Provider2x H100 / hrCost / 1M output (100% util)Utilisation for $10/1M parity
RunPod (PCIe)$4.00$2.96~30%
Lambda Labs$5.98$4.43~44%
Spheron (SXM5 spot)$2.06$1.53~15%
AWS p5 on-demand$13.76$10.19~100% (API wins)

The bands tell the story. On RunPod or Spheron you can break even at light utilization and start clearing the API rate as soon as the box runs more than a third full. Lambda Labs and similar mid-tier providers want about half the day's capacity in use before the math flips. AWS on-demand is more or less a wash and would only make sense for compliance-bound workloads that have to run there. None of this counts the time spent operating the deployment, which is the actual ceiling for most teams.

One nuance the table flattens: 375 tok/s is the peak Cohere published, not the sustained throughput under realistic concurrency. Plan for somewhere between 60 and 80 percent of that as the steady-state number on a generic vLLM deployment, which moves every breakeven row a few points higher. The shape of the conclusion does not change.

Architecture in one paragraph

218 billion total parameters arranged as a sparse Mixture-of-Experts. 128 experts per layer plus one shared expert, 8 active per token, for an effective active-parameter count of about 25 billion. 128K input context, 64K maximum output, 48 supported languages. The launch builds are BF16, FP8, and a W4A4 NVFP4 quantization that uses Quantization-Aware Distillation to keep the attention path, Q/K/V/O projections, and KV cache at full precision while compressing the experts. That is what gets it down to a single B200 or a pair of H100s without the usual quality cliff. Cohere also lists speculative decoding and a tokenizer refresh that picks up 16 to 20 percent on Japanese, Korean, and Arabic versus Command A.

What this actually changes

For teams already paying a Command A hosted bill, Command A+ is a drop-in upgrade at the same rate. The model ID changes and the benchmark headroom climbs, but the line on your invoice does not move. That is the easy decision.

For teams choosing between hosted APIs in the $1 to $3 input band, Command A+ is the most expensive option on the table and the slowest to commit to a published cache discount. Haiku 4.5 and Gemini 3.5 Flash are the obvious price-led picks, and Mistral Medium 3.5 sits in between with open weights that have been around long enough for inference stacks to be optimized. The case for Command A+ on the hosted surface is mostly the citation grounding feature and the AA-Omniscience non-hallucination top spot, which matter for compliance-bound RAG more than they do for general chat.

The case for Command A+ on self-host is the actual news. Apache 2.0 on a 218B/25B MoE that runs on two H100s lands in a different category from anything else released in May. Llama 3.x weights still carry a 700-million-monthly-active-users clause; Mistral Medium 3.5 is under a modified MIT that prohibits some commercial categories. Cohere put a clause-free open-source license on a model that runs on rentable hardware, and that is the move worth weighing against the hosted dollar comparison. Compare it on the full pricing table or model the spend on the calculator.

Sources