Habr AI→ original

Claude Code and Codex compared on a real-world task: Claude is stronger in RAG, Codex saves tokens

The author compared Claude Code and Codex in depth across benchmarks, a real-world RAG pipeline build, and day-to-day use. Claude proved stronger on long…

AI-processed from Habr AI; edited by Hamidun News
Claude Code and Codex compared on a real-world task: Claude is stronger in RAG, Codex saves tokens
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

Claude Code and Codex compared on a real task: Claude is stronger in RAG, Codex saves tokens

The comparison of Claude Code and Codex turned out to be more useful than typical battles over screenshots and blind sympathies. The author compared not only the Opus 4.6 and GPT-5.3-Codex models, but also how both agents behave in a real engineering task, where a working result matters more than a beautiful answer.

How they compared

First, the author looks at the completion time horizon metric from METR research. By this metric, Opus 4.6 handles tasks roughly equivalent to 12 hours of human work at 50% success rate, while GPT-5.3-Codex handles approximately 5 hours 50 minutes. The gap is noticeable, but the conclusion doesn't come down to one tool always being better. What matters more is this: a coding agent is useful not when it quickly writes code, but when it brings a task to a working state without unnecessary debugging loops. That's why for the practical part, they chose not a landing page or UI, but a measurable RAG pipeline for scientific papers.

  • Text extraction from PDF
  • Splitting articles into chunks
  • Generating embeddings and local index
  • Searching for relevant fragments by question
  • Answering only from found context or fallback

The conditions were identical for both tools: Python, PDF processing via PyMuPDF, independent choice of chunking strategy and vector storage, answer generation via llama-3.1-8b-instant, and a ban on hallucinations with weak evidentiary basis. For evaluation, they collected a dataset of five scientific papers and 100 questions with reference answers. This format is important because it removes subjectivity: here you can compare not the feeling about the code, but the quality of extraction, accuracy of answers, and how ready the agent delivers the result.

Where Claude won

By the author's experience, Claude Code feels like a more engaged partner. It starts working faster, more often leads the task to completion on its own, and puts fewer steps on the user. This aligned well with the experiment: Claude didn't just write files, but ran the pipeline end-to-end and made sure the script actually runs. Codex implemented the solution more slowly and on the first pass asked the user to install dependencies and check the run themselves, after which an error had to be fixed. For practical development, the difference between code is written and everything actually works turns out to be critical.

"Claude is a

Senior Developer who does the work with you, while Codex is a contractor."

This difference showed up in the final numbers too. With an LLM judge comparing the answers from both pipelines on correctness, completeness, relevance, and conciseness. Out of 100 questions, Claude Code's implementation won in 42 cases, Codex in 33, and 25 ended in a tie. The author attributes Claude's advantage not to model magic, but to a softer confidence threshold and possibly slightly higher generation temperature. Plus Claude has a noticeably shorter path to the first token in a new session, whereas Codex sometimes took almost a minute to start.

Where Codex is better

At the same time, Codex doesn't look like an outsider. On the contrary, in solution architecture it's often more neat. In the RAG case, Codex assembled more structured code: a pipeline class, centralized config, dataclass structures, argparse interface, and model consistency validation. Claude chose a flatter and faster implementation without such discipline. Technically both arrived at a similar search scheme, but details differed: Claude used ChromaDB and recursive character-level chunking with overlap, Codex used FAISS, sentence-based splitting, and three-level confidence scoring. For production code, such design might even be more important than winning in a single test run.

Another strong point of Codex is efficiency. According to the Morph breakdown cited in the article, Claude Code on comparable tasks spends 3.2–4.2 times more tokens. If these estimates are close to reality, Claude users will hit their subscription limits faster. But Anthropic has a stronger ecosystem offer around the product: the author's experience is better with an ecosystem of Claude Chat, Claude Code, and other services. There's also a pricing nuance: both have plans at $20 and $200 per month, but only Claude has an intermediate tier at $100. The tools' skills are generally compatible, but the community around Claude currently looks noticeably larger.

What does this mean

The main conclusion is simple: choosing between Claude Code and Codex based on a single number or someone else's thread on X is pointless. Claude currently looks stronger where long tasks, end-to-end completion, and ecosystem matter, while Codex is where code structure, token savings, and predictable engineering discipline are critical. With strictly prescribed requirements in AGENTS.md, the behavioral gap between them becomes smaller. It's better to check this on your own, short, and verifiable tasks.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…