What is the best ARC-AGI-3 score from a frontier AI model?

As of March 2026, Gemini 3.1 Pro leads with 0.37% on the semi-private test set. GPT-5.4 scores 0.26%, Claude Opus 4.6 scores 0.20%, and Grok 4.20 scores 0.00%. Humans score 100%.

How much does it cost to run ARC-AGI-3 on a frontier model?

Running the full semi-private evaluation set costs roughly $2,200 for Gemini 3.1 Pro, $3,800 for Grok 4.20, $5,200 for GPT-5.4, and $8,900 for Claude Opus 4.6. The benchmark has a hard $10,000 cutoff for leaderboard eligibility.

What is the ARC-AGI-3 prize?

The ARC Prize 2026 competition has $2 million in total prizes. The ARC-AGI-3 track alone has $850,000, including a $700,000 grand prize for the first system to score 100%. The deadline for submissions is November 2, 2026.

How is ARC-AGI-3 different from ARC-AGI-2?

ARC-AGI-1 and ARC-AGI-2 are static benchmarks where an AI is shown input-output grid pairs and must infer a transformation rule. ARC-AGI-3 is interactive: the AI is dropped into an unknown game environment with no instructions and must explore to discover both the rules and the goal, then execute across multiple levels.

ResearchMarch 31, 2026·8 min read

ARC-AGI-3: the benchmark no AI can crack, and what running it costs

Launched six days ago at Y Combinator, ARC-AGI-3 is the first interactive AI benchmark where the model gets zero instructions and must figure out the goal itself. Gemini 3.1 Pro leads all frontier models at 0.37%. Humans score 100%. Here are the full eval costs and what the API pricing looks like behind them.

ARC-AGI-3 interactive benchmark environment showing colorful game grid tasks

Image source: ARC Prize

Gemini 3.1 Pro scored 0.37% on ARC-AGI-3. That was the best. It cost $2,200 to find out.
GPT-5.4 got 0.26% for $5,200. Opus 4.6 got 0.20% for $8,900. Grok 4.20 scored 0.00% for $3,800.
Humans score 100%. The $700,000 grand prize for matching them is still unclaimed. Deadline is November 2, 2026.

What changed from ARC-AGI-2

ARC-AGI-1 and ARC-AGI-2 are static grid puzzles. The AI sees a few input/output pairs and must figure out the transformation rule, then apply it to a new input. Hard, but solvable with pattern matching. Gemini 3 scored 84.6% on ARC-AGI-2 last year. GPT-5.4 Pro got 83.3%.

ARC-AGI-3 strips all of that away. There are no example pairs, no stated rules, no hint about what success looks like. The model is dropped into a 64x64 color grid environment and given five directional keys and an undo button. It needs to explore to figure out what the environment does, then infer what winning means, then plan and execute across progressively harder levels. All without any instructions.

Francois Chollet described it as the only unsaturated agentic AI benchmark as of March 2026. The design deliberately avoids anything that might have leaked into training data. Prior benchmarks, including ARC-AGI-2, showed signs of overfitting: Gemini 3 was caught using the ARC color mapping in its reasoning without being told it. The new benchmark uses 25 public demo environments, 55 semi-private environments for the official leaderboard, and 55 fully private environments for the competition only.

The scores at launch

Four frontier models were evaluated on the semi-private set when the benchmark went live on March 25. The contrast with their ARC-AGI-2 performance is hard to ignore.

Model	ARC-AGI-3 score	ARC-AGI-2 score	Eval cost
Gemini 3.1 Pro (Preview)	0.37%	77.1%	$2,200
GPT-5.4 (High)	0.26%	83.3%*	$5,200
Claude Opus 4.6 (Max)	0.20%	69.2%	$8,900
Grok 4.20 Beta (Reasoning)	0.00%	65.1%	$3,800
Human baseline	100%	100%	--

ARC-AGI-3 scores from the official ARC Prize leaderboard, semi-private set. ARC-AGI-2 figures from the technical paper (Table 2). *GPT-5.4 Pro xHigh mode used for ARC-AGI-2.

That last column is worth sitting with for a moment. Gemini 3.1 Pro was the cheapest to run and got the best score. Opus 4.6 cost four times more and scored half as well. The benchmark caps evaluations at a hard $10,000 per run -- Opus at $8,900 is close to the ceiling.

Why evaluation costs this much

Standard benchmarks are cheap to run. Send a prompt, get a response, score it. ARC-AGI-3 works differently. Each environment has at least six levels. The model must take actions, observe results, and reason about what it learned before taking the next action. The context window grows with every turn as the full interaction history is re-read.

The benchmark also uses extended reasoning modes -- High for GPT-5.4, Max for Opus 4.6, Preview reasoning for Gemini. These modes generate long internal chains of thought before producing each move. Output tokens from chain-of-thought reasoning are what drives cost here, not input.

The ARC-AGI-3 technical paper puts it plainly: running a full evaluation using high-reasoning frontier model APIs "could run in the tens of thousands of dollars as of early 2026." The hard cutoff at 5x human action count per level exists to keep costs in range. Back of envelope: at $2,200 for 55 environments, Gemini spent roughly $40 per environment. At $2 input / $12 output per million tokens, that implies around 3 million output tokens per environment across six or more levels -- a lot of reasoning about what to do next in a game it has never seen before.

API pricing for the models tested

These are the standard API rates. The eval runs used extended reasoning modes, which add additional cost per-token or per-call depending on the provider.

Model	Input / 1M	Output / 1M	Context
Gemini 3.1 Pro	$2.00	$12.00	1M
GPT-5.4	$2.50	$15.00	1.1M
GPT-5.4 Pro	$30.00	$180.00	1.1M
Claude Opus 4.6	$5.00	$25.00	1M
Grok 4.20 Beta	$2.00	$6.00	2M

Pricing from provider API docs, retrieved March 31, 2026. Compare all models at tokencost.app/pricing.

What this benchmark actually measures

The benchmark uses Core Knowledge priors -- the kind of reasoning skills a child develops before age seven. Object permanence, simple causality, spatial relationships. No language, no math, no real-world knowledge. Just pattern and cause-and-effect in a visual grid.

Scoring uses RHAE (Relative Human Action Efficiency): the AI's action count compared against the second-best human player, squared, per level. A model that solves a level but takes 3x as many moves as a human scores lower than one that solves it efficiently. The hard cutoff at 5x human action count means the model can't brute-force levels -- it either figures out what's happening or runs out of turns.

The failure mode of current frontier models is revealing. They do well on ARC-AGI-2 partly because they've seen grid transformation tasks in training data. ARC-AGI-3 does not give them that. The benchmark was designed to be out-of-distribution from anything publicly available. The result: models that score 65-84% on ARC-AGI-2 collapse to under 0.4% on ARC-AGI-3.

The $700,000 prize no AI can claim yet

$2 million in total prizes across the ARC Prize 2026 competition. The ARC-AGI-3 track carries $850,000, with a $700,000 grand prize for the first system to hit 100%. Milestone prizes ($75,000 total) go to the highest scoring open-source submission at the June 30 and September 30 checkpoints.

One constraint that matters: solutions must run without internet access during the Kaggle evaluation phase. No calling OpenAI, Anthropic, or Google APIs. Everything has to run locally. Given the current scores, the grand prize will almost certainly roll to 2027 -- but the milestone prizes give smaller research teams something achievable.

Submissions close November 2, 2026. Papers are due November 8, with results announced December 4. The benchmark paper will be presented at ICLR 2026 in April.

What it says about the AGI claims

The timing is pointed. Sam Altman has said OpenAI has "basically built AGI." The ARC-AGI-3 launch at YC HQ, with Altman on stage next to Chollet, puts a number on what that means in practice. The best AI scores 0.37% on a benchmark a seven-year-old can pass.

Chollet's framing is blunt: "When the most advanced AI systems are stumped, but a child can do it, that's a big red flashing light telling you that we're missing something, that something really important is off."

There is a reasonable counterargument. ARC-AGI-3 measures one specific thing: interactive, instruction-free goal inference in visual environments. Current frontier models were trained on text. They are genuinely better than humans at plenty of tasks. Neither point cancels the other. The benchmark does not prove AI is useless -- it shows there is a specific type of reasoning that current training approaches do not produce, even at several thousand dollars per run.

Running multi-turn agentic workloads?

Extended reasoning modes and long-context multi-turn tasks cost more than standard completions -- often by a lot. Use our pricing tables and calculator to estimate what your actual API spend looks like before it surprises you.

View all model pricing Estimate your API cost