Is Claude Code getting worse in 2026?

Data from AMD Senior Director Stella Laurenzo shows a measurable regression starting March 8, 2026: reads per edit fell from 6.6 to 2.0, edits without prior reading jumped from 6.2% to 33.7%, and user interrupts per 1,000 tool calls increased over 1,000%. Thinking depth also declined 67% in late February before the change was publicly visible. Anthropic attributed the visible change to a UI-level redaction of thinking summaries, but behavioral metrics worsened before redaction began.

Why did Claude Code costs spike so dramatically in March 2026?

Two factors combined: behavioral degradation (Claude editing without reading, stopping mid-task, rewriting full files instead of targeted edits) caused each task to consume far more tokens than necessary; and two independent prompt cache bugs starting March 23 inflated costs by an additional 10-20%. One developer's AWS Bedrock bill went from $345 in February to $42,121 in March.

What caused the Claude Code regression?

The most likely explanation: when Claude Opus 4.6 launched in February with adaptive thinking, Anthropic set the default effort level to 'medium'. For large, multi-file engineering workloads, this setting may have caused the model to apply shallow reasoning to tasks that needed more. Anthropic said the thinking redaction on March 8 was a UI-only change with no effect on internal reasoning, but the behavioral decline started before redaction began.

How can I fix Claude Code performance issues?

Add showThinkingSummaries: true to your .claude/settings.json to restore visibility into thinking. Use explicit prompts to force higher effort on complex tasks ('Think thoroughly before editing any files'). Update Claude Code to a recent version to get the March cache bug fixes. Monitor your actual API spend against expected costs - the behavioral regressions are detectable as cost spikes relative to task complexity.

ResearchApril 8, 2026·8 min read

Is Claude Code getting worse? The data says something did change on March 8.

An AMD AI director published 6,852 session logs showing Claude's thinking depth dropped 67% before anyone noticed. The result: costs climbed from $345 to $42,121 in one month as the model started doing more work to produce worse output.

Developer frustrated at computer with code on dark terminal screen

Photo by Christopher Gower on Unsplash

Everyone's noticed it. Most explanations are wrong.

Developer complaints about Claude Code quality have been building since early March 2026. The usual pattern: anecdotal frustration, some Hacker News speculation, community noise that gets dismissed as AI discourse. This time is different, because on April 2, Stella Laurenzo - Senior Director of AI at AMD - filed GitHub issue #42796 with 6,852 session JSONL files, 234,760 tool calls, and a regression analysis precise enough to name the exact date the cliff appeared: March 8, 2026.

Her conclusion: "Claude cannot be trusted to perform complex engineering tasks." AMD switched providers. The issue attracted 128 comments in 6 days and was covered by The Register, InfoWorld, TechRadar, and WinBuzzer. Anthropic closed it on April 8 without explaining what was actually resolved.

What 6,852 sessions actually showed

Laurenzo ran automated hooks across all her Claude Code sessions from January 30 to April 1, 2026. The hooks tracked tool call patterns - specifically, whether Claude read files before editing them, whether it finished tasks or stopped early, how often users had to interrupt and correct it. This is behavioral data from production workloads on real codebases, not a benchmark.

Metric	Before Mar 8	After Mar 8	Change
Reads per edit	6.6	2.0	-70%
Edits without prior read	6.2%	33.7%	+443%
Full-file rewrites	4.9%	11.1%	+127%
Stop-hook violations / day	0	10	0 → 173 total
User interrupts / 1K tool calls	0.9	11.4	+1,167%
Monthly cost (AWS Bedrock)	$345	$42,121	+122x
API requests / month	1,498	119,341	+80x
Output tokens / month	0.97M	62.60M	+64x

Data from GitHub issue #42796. Cost measured at AWS Bedrock rates. Ben Vanik independently corroborated the thinking depth measurement methodology with Pearson correlation 0.971 on 7,146 paired samples.

The thinking depth fell 67% before anyone could see it

The most important finding in Laurenzo's analysis is this: thinking depth dropped by about two-thirds in late February - before the thinking redaction feature rolled out on March 8. The public narrative was that Anthropic just "hid" thinking summaries in Claude Code v2.1.69 for speed. But the behavioral deterioration started weeks earlier, while thinking was still fully visible.

Estimated thinking depth by period

Late Jan (baseline)

100%

Late February

-67%

March 1-5

-75%

March 12+ (fully redacted)

-73%

Laurenzo's explanation: when the model skips deep thinking, it defaults to the cheapest available action. Edit without reading. Stop without finishing. Take the simplest fix instead of the correct one. This is why output tokens went up 64x while quality went down - the model generated more words doing less actual reasoning. Every senior engineer on her team reported the same experience independently.

What Anthropic said, and why the data argues against it

Boris Cherny, head of Claude Code at Anthropic, posted a response in the thread: the thinking redaction is "merely UI-level, only used to hide the thinking process to improve response speed, and does not affect the model's internal actual reasoning logic, thinking budget, or underlying mechanisms." He noted that users can restore thinking summaries by adding showThinkingSummaries: true to settings.json.

Laurenzo rejected this directly. Her proxy metric for thinking depth (the signature field correlation, corroborated independently at r=0.971) showed the decline starting in late February, when thinking was still fully visible in the UI. If it were just a display change, you'd see no behavioral shift before March 8. The behavioral data shows a clear shift starting weeks earlier.

The more plausible technical explanation: when Claude Opus 4.6 launched with adaptive thinking in February, Anthropic set the default effort level to "medium" (value: 85) rather than "high". Adaptive thinking at medium effort means the model may skip deep reasoning for tasks it internally judges as simple. If that complexity assessment was miscalibrated for large, multi-file engineering work, it would apply shallow thinking to tasks that needed more. That's a default misconfiguration, not deliberate degradation.

Why the cost math is backwards

Here's the part that should concern anyone running agentic Claude Code workloads: the model getting lazier didn't make things cheaper. It made them much more expensive.

When Claude stops reading files before editing them, it makes wrong edits that require correction, triggering more calls. When it stops mid-task, users restart, adding context on every new session. When it rewrites whole files instead of targeted edits, output tokens explode. Laurenzo's February bill was $345. March was $42,121. The model produced worse engineering work - and Laurenzo's bill went from $345 to $42,121 for the privilege.

Why shallow thinking costs more, not less

-Edit without reading = wrong edit + correction cycle + re-read that should have happened first

-Stop mid-task = user restart + full context window refilled from scratch

-Full-file rewrites = 10-50x more output tokens than targeted edits for the same change

-User interrupts = conversation history grows, input tokens for every subsequent call increase

There were also two independent prompt cache bugs starting March 23 that caused the cache to break and silently inflated costs by 10-20%. Those bugs explain some of the March spike. They don't explain the behavioral metrics (stop-hook violations, ownership-dodging) that started on March 8. Both problems existed simultaneously.

The benchmark paradox

If you look at LMSYS Chatbot Arena as of April 6, 2026, Claude Opus 4.6 Thinking is ranked #1 overall with an Arena Elo of 1504 - the first model ever to cross the 1500 barrier. On the coding-specific leaderboard, Opus 4.6 and Opus 4.6 Thinking take the top two spots. On Artificial Analysis, it scores 53 out of 127 models (4th overall).

Both things are true. Arena measures single-turn or short multi-turn interactions. Laurenzo's complaint is specifically about long-session, multi-file, multi-agent engineering work - concurrent sessions across 10 projects, hours of autonomous operation. That's not what Arena tests.

A model can top the leaderboard on a benchmark that doesn't measure what you actually need. For single-query coding help, Claude Opus 4.6 is probably as good as advertised. For the specific use case of autonomous, long-context, multi-file engineering - the thing Claude Code is supposed to be built for - there's a documented regression with measurable dates and behavioral data behind it.

The Mythos timing: coincidence or sandbagging?

Anthropic announced Claude Mythos on April 7, 2026 - one day after The Register published the AMD story. Mythos scored 93.9% on SWE-bench Verified and 97.6% on USAMO 2026, described internally as a "step-change" above Opus and the first model to find "thousands of zero-day vulnerabilities" in major operating systems autonomously. Its codename "Capybara" leaked via a Claude Code source map on March 31.

The circumstantial narrative writes itself: Anthropic throttled Opus's thinking depth to push developers toward their next model. There's no evidence this is true. Mythos is restricted to a small set of cybersecurity partners (Amazon, Apple, Cisco, Microsoft, etc.) and won't be generally available anytime soon. You can't migrate your Claude Code setup to it. The degradation had no obvious commercial benefit for Anthropic.

The more boring explanation: the effort default change in February was a cost optimization decision that went wrong for power users, the cache bugs added fuel, and by March 8, it crossed a threshold where the behavioral impact became hard to ignore. Boris Cherny's "just UI" response suggests Anthropic may not have fully understood the extent of the issue when they initially responded.

What actually helps

A few options if you're seeing this in your own sessions:

Restore thinking summaries

Add to your project's .claude/settings.json:

{
  "showThinkingSummaries": true
}

This restores visibility into thinking. It doesn't increase thinking depth, but lets you see when the model is reasoning shallowly and prompt for more.

Force higher effort on complex tasks

You can set effort level explicitly in your prompts: "Think thoroughly about this before editing any files." or use the thinking parameter directly if you're using the API. The medium-effort default means Claude may skip deep reasoning on tasks it judges as routine - pushing explicitly overrides that judgment.

Update to a recent Claude Code version

The two March 23 cache bugs that inflated costs by 10-20% were patched in later versions. Check your Claude Code version with claude --version and update via npm update -g @anthropic-ai/claude-code. The token explosion issue from cache bugs is fixable. The thinking depth issue is separate.

Monitor your actual costs

Laurenzo caught the regression because she had automated hooks measuring session behavior. Most developers don't. At minimum, watch your API cost trends week over week. A 5x cost increase with no proportional output increase is a signal worth investigating before it becomes $42,000. Use the cost calculator to benchmark what a session should cost, and compare against what you're actually spending.

A note on cost planning for agentic workloads

Anthropic's official benchmarks for Claude Code say $6/day average, $12/day at the 90th percentile. Those numbers are for interactive use with official Claude Code, with its caching optimizations intact. They are not meaningful reference points for autonomous, multi-session, large-codebase engineering work.

If you run agentic Claude Code sessions across multiple projects concurrently, budget significantly higher - and build in behavioral monitoring from the start. The cost of a session that goes wrong (context fills, Claude starts looping, stops following instructions) can be multiples of a session that goes right. At Claude Opus 4.6 API rates ($5 input / $25 output per MTok), a session that rewrites full files instead of making targeted edits can easily add hundreds of dollars per session for the same net change.

See our breakdown of what Claude Code actually costs per session and current Claude API pricing for the full picture.

Sources

Claude API Pricing Calculate Your API Costs