PAC1 exposed a weakness in reasoning models: a hardcoded agent passed the benchmark at lower cost
The closed PAC1 benchmark, where an AI agent must read logs, search for files, and bypass prompt injections, unexpectedly exposed a weakness in reasoning…
AI-processed from Habr AI; edited by Hamidun News
The closed benchmark PAC1, designed for AI agents, unexpectedly revealed a weak point in reasoning models themselves. Instead of a "smart" solution, a hackathon participant built a Zero-Cost Agent — a set of rigid algorithms that circumvented typical LLM errors and completed tasks more cheaply.
Where the agent broke down
PAC1 tests not the ability to reason elegantly, but the ability to act within a constrained environment. The agent needs to read logs, find necessary files, send emails, and all while avoiding traps like Indirect Prompt Injections, where malicious instructions hide inside data. According to the author's description, it's precisely in this setup that reasoning models began to falter: they hallucinated, lost context between actions, broke JSON structure, and inserted their own "reasoning" instead of clean output. In a closed sandbox, this is critical because a single wrong key or extra line immediately breaks the next step and rapidly inflates the API bill.
Why hardcoding won
After several failed attempts, the author abandoned the idea of "squeezing" the model through prompts and bet on a deterministic scenario. Thus emerged the Zero-Cost Agent — essentially an algorithmic executor that doesn't simulate thinking but instead knows in advance which operations to check, in what order to traverse the file system, and how to respond to known classes of traps. Instead of universal intelligence, the solution uses a set of rules that can be tested in advance and rigidly controlled.
- rigid input and output format without free-form explanations;
- pre-described routes for file search and log reading;
- separate handling of dangerous instructions hidden within file contents;
- predictable steps for sending emails and other actions;
- rejection of retry loops that quickly burn LLM budgets.
This approach looks crude, but within PAC1, it delivers an advantage. The algorithm doesn't argue with itself, doesn't rewrite the answer, and doesn't waste tokens explaining obvious steps. Its cost is almost independent of the number of "reasoning" cycles, because there are no reasoning cycles in the chain. In the hackathon task, this turned an unstable agent system into a tool whose behavior can be predicted and measured.
What the experiment revealed
The PAC1 story challenges the popular thesis that taking a powerful reasoning model is enough and it will handle agent automation on its own. In practice, an environment with a file system, formal answers, and embedded attacks turns out to be closer to an engineering problem than a conversation with an assistant. What matters here are validation, state control, transition limiting, and explicit error handling. If a system consistently produces correct JSON and doesn't get distracted by false instructions, it beats a more "intelligent" but unstable model.
"If AI can't handle it, we'll replace it with good old hardcoding."
The author's phrase about "good old hardcoding" sounds provocative, but the meaning is quite pragmatic. It's not about the uselessness of neural networks, but about the boundaries of their application without rigid scaffolding. If the task is standard, the rules are known, and the cost of error is high, a set of deterministic heuristics sometimes delivers better results than a model with large context, elegant explanations, and a long chain of retry attempts. For corporate tasks with formal interfaces, this is especially noticeable: the system must be dull, verifiable, and predictable, not impressively verbose.
What this means
For AI agent developers, the PAC1 case is a reminder that system reliability often matters more than model power. In real products, a hybrid approach will increasingly be the norm: LLM where variability and handling uncertainty are needed, and rigid logic where format, safety, cost, and repeatable results matter. Such combinations, rather than a pure bet on a single model, are likely to become the standard for production agents.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.