Skip to main content
TokenCost logoTokenCost
Model ReleaseMarch 26, 2026·8 min read

Grok 4.20 Beta: $2 per million tokens, 2M context, and the lowest hallucination rate measured so far

xAI's latest model costs less than Grok 4 and less than Claude Sonnet 4.6, while expanding context to 2 million tokens. It ranks #1 on instruction following and set a record on non-hallucination. It also scores 9 points below GPT-5.4 on overall benchmarks. Here is what that trade-off looks like in practice.

xAI Grok 4.20 Beta model release showing API documentation and 2M context window

Image: xAI (official Grok branding)

Grok 4.20 costs $2 input / $6 output per million tokens -- about a third of the output price GPT-5.4 and Claude Sonnet 4.6 charge. It expanded context to 2M tokens at flat pricing, holds the #1 spot on IFBench instruction following (83%), and hit a record 78% non-hallucination rate on Omniscience. The catch: Artificial Analysis puts it at 48 on their Intelligence Index vs 57 for GPT-5.4 and Gemini 3.1 Pro, so for hard reasoning and novel math those two are still the better call.

What changed from Grok 4

The original Grok 4 launched in July 2025 at $3 input / $15 output with a 256K context window. Grok 4.20 drops both prices and expands context 8x. Input is now $2 (down 33%) and output is $6 (down 60%). That output cut is the bigger deal -- $6 puts it well below Claude Sonnet 4.6 ($15) and GPT-5.4 ($15) on the number that most production workloads actually feel.

On the Intelligence Index, the score went from 42 to 48 -- a 14% gain, though still behind the top-tier models. What Grok 4.20 added that Grok 4 lacked: a #1 ranking on instruction following and a record-low hallucination rate. Neither was a strength of the original.

Version numbering follows xAI's internal sequence: Grok 4 in July 2025, Grok 4.1 in November 2025, then Grok 4.20 in March 2026. The API model ID is grok-4.20-beta-0309 -- the "0309" is an internal date stamp (March 9). Official docs went live March 24.

Pricing vs the competition

At $6 output, Grok 4.20 undercuts every other frontier model on the number that scales hardest. GPT-5.4 and Claude Sonnet 4.6 both charge $15. Gemini 3.1 Pro is $12. The input prices are all within $1 of each other, so output is where the cost difference compounds.

ModelInput / 1MOutput / 1MContextIntel. Index
Grok 4.20 Beta$2.00$6.002M48
GPT-5.4$2.50$15.001.1M57
Gemini 3.1 Pro$2.00$12.001M57
Claude Sonnet 4.6$3.00$15.001M44
Grok 4 (prev)$3.00$15.00256K42

Intelligence Index v4.0 from Artificial Analysis. Pricing from xAI API docs, retrieved March 26, 2026.

Cached and batch pricing

Prompt caching drops input to $0.20 per million -- 10% of the standard rate. For pipelines with a stable system prompt or a fixed knowledge base, that stacks well. Batch API takes 50% off everything.

TierInput / 1MOutput / 1M
Standard$2.00$6.00
Cached input$0.20--
Batch (50% off)$1.00$3.00
Batch + cached input$0.10--

Rate limits: 607 requests/minute, 4M tokens/minute. Provisioned throughput available in US East and EU West.

Where it leads, where it trails

The overall Intelligence Index score of 48 puts Grok 4.20 at #8 overall -- a 9-point gap behind GPT-5.4 and Gemini 3.1 Pro, both at 57. That gap matters for hard reasoning, novel math, and multi-step problem solving. For those tasks, paying more for GPT-5.4 is probably right.

Where Grok 4.20 actually leads:

BenchmarkGrok 4.20RankWhat it measures
IFBench83%#1 overallInstruction following
τ²-Bench Telecom97%#2 overallAgentic tool use
AA-Omniscience78%Record highNon-hallucination
Output speed225.5 t/s#2 / 124 modelsGeneration speed
Intelligence Index48#8 overallComposite (10 evals)

The 78% non-hallucination rate on AA-Omniscience is the number I keep coming back to. No other model tested has hit it. For production workloads where the model being confidently wrong is a real problem -- customer support, medical summaries, legal document parsing -- that record matters more than an aggregate benchmark score.

Speed is also worth noting. 225.5 tokens/second is #2 out of 124 models. Time-to-first-token sits at 13.21 seconds, which reflects some reasoning overhead before generation begins. If your application can buffer that initial wait, the throughput is genuinely fast.

What it costs at production scale

Four workload types, monthly estimates:

WorkloadGrok 4.20GPT-5.4Gemini 3.1 Pro
RAG pipeline (10M in / 2M out)$32$55$44
RAG with caching (10M in, 60% hit)$13.20$22.50$19.20
Agent loop (500K in / 2M out)$13$31.25$25
Long-doc analysis (1M ctx + 10K out)$2.06$2.65$2.12

Cache scenario uses $0.20/M cached input (Grok), $0.25/M (GPT-5.4), $0.50/M (Gemini). Agent loop: 500K input + 2M output; shows the output price gap most clearly. Output stays at standard rates for all scenarios.

2M context at flat pricing

Anthropic recently eliminated the 2x long-context surcharge on Claude, which means 1M token Claude requests now cost the same per token as any other. Grok 4.20 goes further: 2M tokens at flat $2/M input, no tiered pricing, confirmed in xAI's API documentation.

For long-context RAG -- stuffing a large knowledge base, analyzing entire codebases, processing book-length documents -- Grok 4.20 pairs the 2M window with a low hallucination rate and $6 output. Gemini 3.1 Pro has 1M context at $2 input but $12 output. The output difference adds up when the model is generating substantive responses after reading a large context.

When Grok 4.20 makes sense

The 9-point Intelligence Index gap is real. For tasks involving hard reasoning, novel problems, or multi-step logic, GPT-5.4 and Gemini 3.1 Pro are the better calls. Grok 4.20 is not trying to win there -- its strengths are elsewhere.

  • 1.Instruction-heavy pipelines. Structured output generation, form filling, data extraction against a spec -- anything where you need the model to follow a detailed prompt exactly rather than interpret it. The #1 IFBench score is the relevant signal here. Lower-scoring models tend to produce creative variants of what you asked for.
  • 2.Production RAG where facts matter. 78% non-hallucination + 2M context + $0.20/M cached input is a real combination for retrieval pipelines. Claude Sonnet 4.6 has the 1M context at $3/$15. Grok 4.20 is $2/$6 with more context and a lower hallucination rate.
  • 3.Output-heavy workloads. Writing, summarization, long-form generation. At a 10:1 output-to-input ratio, Grok 4.20 runs at roughly 40% of GPT-5.4's cost. That gap compounds over millions of requests.

xAI also notes multi-agent use as a target: Grok 4.20 supports up to 4 parallel agents in Heavy mode, and the τ²-Bench Telecom score (97%, #2 overall) suggests solid real-world agentic tool use performance.

xAI's position

xAI closed a $20B Series E in January 2026 -- upsized from an initial $15B target -- with NVIDIA and Cisco as strategic investors alongside Valor, Fidelity, and the Qatar Investment Authority. The implied valuation exceeded $230B. Grok 5 is in active training.

The Grok 4.20 price cut reads as deliberate positioning. Grok 4 at $3/$15 was priced identically to Claude Sonnet 4.6 with no strong reason to pick it. Grok 4.20 at $2/$6 is harder to dismiss for output-heavy use cases. Whether the overall intelligence gap closes with Grok 5 is the more interesting question.

Run the cost math for your workload

Grok 4.20 is in our full model comparison. Put in your actual token volumes to see how it stacks up against GPT-5.4, Gemini 3.1 Pro, and 60+ other models.

Sources