ProjDevBench: Can AI Build Full-Fledged Software from Scratch?

# ProjDevBench: Can AI Create Full-Fledged Software from Scratch? When we discuss artificial intelligence in software development, we typically recall examples like ChatGPT fixing a bug in a function in minutes, or Claude generating elegant code for a simple algorithm. But what would happen if we asked an AI agent to design and build a complete application from scratch—with the entire architecture, dependency management, and component integration? Researchers from leading laboratories have quietly worked on this question and created ProjDevBench, a platform that reveals the true capabilities and limitations of current AI models acting as full-fledged software engineers. The results force a reconsideration of optimistic forecasts about the imminent replacement of developers by automation. ProjDevBench differs fundamentally from all previous intelligent coding tests.

Khamidun Zhemal

AI monitoring · Jiqizhixin (机器之心)

Feb 10, 2026· 2 min

AI-processed from Jiqizhixin (机器之心); edited by Hamidun News

ProjDevBench: Can AI Build Full-Fledged Software from Scratch? — Source: Jiqizhixin (机器之心). Collage: Hamidun News.

◐ Listen to article

# ProjDevBench: Can AI Create Full-Fledged Software from Scratch?

When we discuss artificial intelligence in software development, we typically recall examples like ChatGPT fixing a bug in a function in minutes, or Claude generating elegant code for a simple algorithm. But what would happen if we asked an AI agent to design and build a complete application from scratch—with the entire architecture, dependency management, and component integration? Researchers from leading laboratories have quietly worked on this question and created ProjDevBench, a platform that reveals the true capabilities and limitations of current AI models acting as full-fledged software engineers. The results force a reconsideration of optimistic forecasts about the imminent replacement of developers by automation.

ProjDevBench differs fundamentally from all previous intelligent coding tests. While earlier research checked whether a model could write a single function or solve a LeetCode problem, the new benchmark presents AI with a real task: create a finished product from scratch. The platform demands that agents not merely generate code but make architectural decisions, break the project into modules, manage dependencies, write tests, and integrate everything into a working product. These are not isolated functions—this is a simulation of real development, where every decision affects the next, and errors accumulate, complicating the entire system.

The structure of ProjDevBench itself reflects real challenges in software engineering. Agents are given specifications for projects of varying complexity: from simple utilities to applications with multiple layers of logic, databases, and external APIs. Models must understand requirements, plan code structure, select appropriate technologies and libraries, manage conflicts between components, and ensure functionality. This is quite similar to what a junior developer does on their first serious task, except without the ability to ask senior colleagues for advice or to have their pull requests reviewed.

The testing results opened eyes even among optimists. Modern LLM agents, powered by leading models like GPT-4 and Claude, indeed demonstrate progress compared to previous generations. They can competently break a project into modules, select a sound architecture, and write functional code. But problems immediately emerge. Agents forget about dependencies between components and generate code that works in isolation but breaks during integration. They manage system state poorly and often cannot track how changes in one module affect others. Code scalability declines as complexity increases—agents start duplicating logic instead of refactoring, turning a simple project into a tangled mess.

Moreover, AI developers perform poorly at project-level debugging. When something goes wrong, models often lose track of cause-and-effect relationships and begin changing random parts of the code instead of logically analyzing the problem. Testing, which should be an integral part of development, often becomes a formality—agents write tests that pass because they essentially test the same thing as the actual code.

These results do not mean AI is useless for development. They reveal a real gap between code generation and software engineering. The first is arithmetic; the second is an art. ProjDevBench underscores that the path to fully autonomous AI developers is still long. The future probably belongs to hybrid tools: AI assistants that generate code and propose solutions, but under the control of an experienced engineer ready to think strategically and see the whole picture.

Hamidun News

AI news without noise. Daily editorial selection from 50+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Need AI working inside your business — not just in your newsfeed?

I build production AI for companies — custom CRM, internal tools, autonomous agents, workflow automation. Owned by you, shaped to your process, no per-seat tax. Built by Zhemal Khamidun, CPO of AlpinaGPT (AI platform, 6,000+ users).

Book a free consultation →

ProjDevBench: Can AI Build Full-Fledged Software from Scratch?

Need AI working inside your business — not just in your newsfeed?

The AI world, distilled — once a week