Habr AI→ original

Anthropic and ETH Zurich: a long CLAUDE.md worsens agent performance and raises costs

ETH Zurich analyzed 138 repositories and reached an uncomfortable conclusion: long CLAUDE.md and AGENTS.md files often do not help agents — they hinder them…

AI-processed from Habr AI; edited by Hamidun News
Anthropic and ETH Zurich: a long CLAUDE.md worsens agent performance and raises costs
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

CLAUDE.md and AGENTS.md were intended as a quick way to explain project rules to an agent, but new research shows: long context more often hurts than helps. On a sample of 138 Python repositories, researchers at ETH Zurich observed a drop in success rate and increased costs, especially for automatically generated files.

What the research showed

The authors of the paper Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents? took 138 real repositories, collected 5694 pull requests, and ran tasks through four models: Claude Sonnet 4.5, Codex GPT-5.2, GPT-5.1 Mini, and Qwen3-30B. They compared scenarios without special files, with LLM-generated instructions, and with human-maintained AGENTS.md files. This is an important point: the research looked not at abstract demos, but at real tasks from live codebases.

The main conclusion for automatically created files turned out to be disappointing. Success rate dropped on average by 3%, and inference cost increased by more than 20%. Human-maintained files performed better: they raised success by about 4%, but the cost also rose by nearly 19%. In other words, a context file turned out to be no free performance booster. Even when it helps, the quality gain remains modest compared to the constant overpayment for tokens and extra agent steps.

Why long files hurt

The most counterintuitive observation from the paper: descriptions of project structure barely help the agent navigate. Sections about directories, architecture, and tech stack look useful to humans, but agents often find everything themselves through grep, glob, and file reading. If the information already sits in pyproject.toml, package.json, linter config, or in the repository structure itself, a long explanation merely duplicates what the model is capable of discovering quickly on its own.

If the agent can learn this from the code itself, it's better to

remove it from the instruction.

Researchers also noticed that models with context files perform more actions than necessary for the specific task. They more often re-read instructions, run tests unnecessarily, and more actively invoke tools that were mentioned in the file. The paper separately notes an increase in reasoning tokens for GPT models. In other words, the instruction doesn't simply add knowledge—it changes agent behavior: it starts following rules even where they don't help solve the current task.

What should be kept

A complete rejection of CLAUDE.md or AGENTS.md doesn't follow from this research. Rather, the conclusion is different: such files should be short and contain only what the agent can't reliably infer from code or configs on its own. The less decorative text, the lower the risk that the model gets stuck on unnecessary rituals instead of executing the specific request.

  • Non-standard test run commands
  • Package manager, if it's not obvious
  • Custom scripts, tools, and deployment specifics
  • Naming conventions, if they can't be quickly inferred from code
  • Link to .env.example or other critical entry point file

Another study overlays this picture, where Codex with AGENTS.md showed acceleration and lower token consumption. But there the sample was much smaller, and result correctness was evaluated in a limited way. So the general conclusion for now is cautious: short and practical context sometimes helps, while a long file reviewing architecture, tech stack, and general rules easily becomes expensive ballast. ETH Zurich also barely addresses code maintainability and adherence to project style, so the dispute over the usefulness of such files is not yet closed.

What this means

The practical conclusion is simple: treat CLAUDE.md as a list of fixes for agent errors, not as a project encyclopedia. If the instruction doesn't help avoid a specific failure, doesn't describe a non-standard command, and doesn't add unique context, it's better to remove it. For teams actively using coding agents, this is direct motivation to trim context files, reduce token spending, and check on their own tasks which lines really improve results and which only create expensive noise.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…