Is Claude Code getting worse? The data says something did change on March 8.
An AMD AI director published 6,852 session logs showing Claude's thinking depth dropped 67% before anyone noticed. The result: costs climbed from $345 to $42,121 in one month as the model started doing more work to produce worse output.

Photo by Christopher Gower on Unsplash
Everyone's noticed it. Most explanations are wrong.
Developer complaints about Claude Code quality have been building since early March 2026. The usual pattern: anecdotal frustration, some Hacker News speculation, community noise that gets dismissed as AI discourse. This time is different, because on April 2, Stella Laurenzo - Senior Director of AI at AMD - filed GitHub issue #42796 with 6,852 session JSONL files, 234,760 tool calls, and a regression analysis precise enough to name the exact date the cliff appeared: March 8, 2026.
Her conclusion: "Claude cannot be trusted to perform complex engineering tasks." AMD switched providers. The issue attracted 128 comments in 6 days and was covered by The Register, InfoWorld, TechRadar, and WinBuzzer. Anthropic closed it on April 8 without explaining what was actually resolved.
What 6,852 sessions actually showed
Laurenzo ran automated hooks across all her Claude Code sessions from January 30 to April 1, 2026. The hooks tracked tool call patterns - specifically, whether Claude read files before editing them, whether it finished tasks or stopped early, how often users had to interrupt and correct it. This is behavioral data from production workloads on real codebases, not a benchmark.
| Metric | Before Mar 8 | After Mar 8 | Change |
|---|---|---|---|
| Reads per edit | 6.6 | 2.0 | -70% |
| Edits without prior read | 6.2% | 33.7% | +443% |
| Full-file rewrites | 4.9% | 11.1% | +127% |
| Stop-hook violations / day | 0 | 10 | 0 → 173 total |
| User interrupts / 1K tool calls | 0.9 | 11.4 | +1,167% |
| Monthly cost (AWS Bedrock) | $345 | $42,121 | +122x |
| API requests / month | 1,498 | 119,341 | +80x |
| Output tokens / month | 0.97M | 62.60M | +64x |
Data from GitHub issue #42796. Cost measured at AWS Bedrock rates. Ben Vanik independently corroborated the thinking depth measurement methodology with Pearson correlation 0.971 on 7,146 paired samples.
The thinking depth fell 67% before anyone could see it
The most important finding in Laurenzo's analysis is this: thinking depth dropped by about two-thirds in late February - before the thinking redaction feature rolled out on March 8. The public narrative was that Anthropic just "hid" thinking summaries in Claude Code v2.1.69 for speed. But the behavioral deterioration started weeks earlier, while thinking was still fully visible.
Laurenzo's explanation: when the model skips deep thinking, it defaults to the cheapest available action. Edit without reading. Stop without finishing. Take the simplest fix instead of the correct one. This is why output tokens went up 64x while quality went down - the model generated more words doing less actual reasoning. Every senior engineer on her team reported the same experience independently.
What Anthropic said, and why the data argues against it
Boris Cherny, head of Claude Code at Anthropic, posted a response in the thread: the thinking redaction is "merely UI-level, only used to hide the thinking process to improve response speed, and does not affect the model's internal actual reasoning logic, thinking budget, or underlying mechanisms." He noted that users can restore thinking summaries by adding showThinkingSummaries: true to settings.json.
Laurenzo rejected this directly. Her proxy metric for thinking depth (the signature field correlation, corroborated independently at r=0.971) showed the decline starting in late February, when thinking was still fully visible in the UI. If it were just a display change, you'd see no behavioral shift before March 8. The behavioral data shows a clear shift starting weeks earlier.
The more plausible technical explanation: when Claude Opus 4.6 launched with adaptive thinking in February, Anthropic set the default effort level to "medium" (value: 85) rather than "high". Adaptive thinking at medium effort means the model may skip deep reasoning for tasks it internally judges as simple. If that complexity assessment was miscalibrated for large, multi-file engineering work, it would apply shallow thinking to tasks that needed more. That's a default misconfiguration, not deliberate degradation.
Why the cost math is backwards
Here's the part that should concern anyone running agentic Claude Code workloads: the model getting lazier didn't make things cheaper. It made them much more expensive.
When Claude stops reading files before editing them, it makes wrong edits that require correction, triggering more calls. When it stops mid-task, users restart, adding context on every new session. When it rewrites whole files instead of targeted edits, output tokens explode. Laurenzo's February bill was $345. March was $42,121. The model produced worse engineering work - and Laurenzo's bill went from $345 to $42,121 for the privilege.
There were also two independent prompt cache bugs starting March 23 that caused the cache to break and silently inflated costs by 10-20%. Those bugs explain some of the March spike. They don't explain the behavioral metrics (stop-hook violations, ownership-dodging) that started on March 8. Both problems existed simultaneously.
The benchmark paradox
If you look at LMSYS Chatbot Arena as of April 6, 2026, Claude Opus 4.6 Thinking is ranked #1 overall with an Arena Elo of 1504 - the first model ever to cross the 1500 barrier. On the coding-specific leaderboard, Opus 4.6 and Opus 4.6 Thinking take the top two spots. On Artificial Analysis, it scores 53 out of 127 models (4th overall).
Both things are true. Arena measures single-turn or short multi-turn interactions. Laurenzo's complaint is specifically about long-session, multi-file, multi-agent engineering work - concurrent sessions across 10 projects, hours of autonomous operation. That's not what Arena tests.
A model can top the leaderboard on a benchmark that doesn't measure what you actually need. For single-query coding help, Claude Opus 4.6 is probably as good as advertised. For the specific use case of autonomous, long-context, multi-file engineering - the thing Claude Code is supposed to be built for - there's a documented regression with measurable dates and behavioral data behind it.
The Mythos timing: coincidence or sandbagging?
Anthropic announced Claude Mythos on April 7, 2026 - one day after The Register published the AMD story. Mythos scored 93.9% on SWE-bench Verified and 97.6% on USAMO 2026, described internally as a "step-change" above Opus and the first model to find "thousands of zero-day vulnerabilities" in major operating systems autonomously. Its codename "Capybara" leaked via a Claude Code source map on March 31.
The circumstantial narrative writes itself: Anthropic throttled Opus's thinking depth to push developers toward their next model. There's no evidence this is true. Mythos is restricted to a small set of cybersecurity partners (Amazon, Apple, Cisco, Microsoft, etc.) and won't be generally available anytime soon. You can't migrate your Claude Code setup to it. The degradation had no obvious commercial benefit for Anthropic.
The more boring explanation: the effort default change in February was a cost optimization decision that went wrong for power users, the cache bugs added fuel, and by March 8, it crossed a threshold where the behavioral impact became hard to ignore. Boris Cherny's "just UI" response suggests Anthropic may not have fully understood the extent of the issue when they initially responded.
What actually helps
A few options if you're seeing this in your own sessions:
Add to your project's .claude/settings.json:
{
"showThinkingSummaries": true
}This restores visibility into thinking. It doesn't increase thinking depth, but lets you see when the model is reasoning shallowly and prompt for more.
You can set effort level explicitly in your prompts: "Think thoroughly about this before editing any files." or use the thinking parameter directly if you're using the API. The medium-effort default means Claude may skip deep reasoning on tasks it judges as routine - pushing explicitly overrides that judgment.
The two March 23 cache bugs that inflated costs by 10-20% were patched in later versions. Check your Claude Code version with claude --version and update via npm update -g @anthropic-ai/claude-code. The token explosion issue from cache bugs is fixable. The thinking depth issue is separate.
Laurenzo caught the regression because she had automated hooks measuring session behavior. Most developers don't. At minimum, watch your API cost trends week over week. A 5x cost increase with no proportional output increase is a signal worth investigating before it becomes $42,000. Use the cost calculator to benchmark what a session should cost, and compare against what you're actually spending.
A note on cost planning for agentic workloads
Anthropic's official benchmarks for Claude Code say $6/day average, $12/day at the 90th percentile. Those numbers are for interactive use with official Claude Code, with its caching optimizations intact. They are not meaningful reference points for autonomous, multi-session, large-codebase engineering work.
If you run agentic Claude Code sessions across multiple projects concurrently, budget significantly higher - and build in behavioral monitoring from the start. The cost of a session that goes wrong (context fills, Claude starts looping, stops following instructions) can be multiples of a session that goes right. At Claude Opus 4.6 API rates ($5 input / $25 output per MTok), a session that rewrites full files instead of making targeted edits can easily add hundreds of dollars per session for the same net change.
See our breakdown of what Claude Code actually costs per session and current Claude API pricing for the full picture.
Sources
- GitHub issue #42796: Stella Laurenzo (AMD) - Claude Code unusable for complex engineering tasks
- Ben Vanik: Independent corroboration of thinking depth methodology (r=0.971, 7,146 samples)
- The Register: AMD AI director slams Claude Code after months of frustration (April 6, 2026)
- InfoWorld: Enterprise developers question Claude Code reliability (April 2026)
- The Register: Anthropic admits Claude Code usage limits crisis and cache bugs (March 31, 2026)
- TechCrunch: Anthropic announces Claude Mythos preview (April 7, 2026)
- Artificial Analysis: Claude Opus 4.6 Adaptive intelligence index
- Hacker News: "Has Claude Code quality gotten worse?" discussion thread