Google Presented Auto-Diagnose — an AI System for Finding Causes of Integration Test Failures
Google unveiled Auto-Diagnose — a tool powered by Gemini 2.5 Flash for diagnosing integration test failures. The system automatically collects and sorts…
AI-processed from MarkTechPost; edited by Hamidun News
Google unveiled Auto-Diagnose, an internal LLM-based system that parses logs from failed integration tests, automatically extracts key lines, and publishes diagnostics directly in code review. For large engineering teams, this represents an attempt to eliminate one of the most expensive hidden costs of development: the hours, and sometimes days, spent manually searching for the cause of a failure across dozens of log files. Google's problem is quite measurable.
In an internal survey of 6059 developers, diagnosing integration test failures entered the top 5 most frequent complaints about engineering tools. A follow-up survey of 116 engineers showed that 38.4% of such failures took longer than an hour to diagnose, and 8.
9% took longer than a day. For unit tests, these figures were 2.7% and 0% respectively.
The reason is clear: an integration test almost never fails in one obvious place. In a typical case, there is a separate test driver, a set of services within the system under test, logs spread across different components, a mass of warnings and errors that are not related to the root cause at all. In Google's research, the median failing test contained 16 log files and 2801 lines of logs.
Auto-Diagnose is built into the existing development workflow. When an integration test fails, the system automatically receives an event, collects logs from the test driver and SUT components at INFO level and above, consolidates them into a single stream, and sorts them by time. Then, along with component metadata, all of this is sent to Gemini 2.
5 Flash. The model works without fine-tuning on Google's special logs: the bet is not placed on fine-tuning, but on a hard-coded prompt and integration into the process. In the prompt, the model is forced to follow steps: find log sections, identify the component where the failure occurred, verify the context, and only then formulate a conclusion.
The key point is a prohibition on guessing. If the logs don't contain lines from the exact component that failed to start or become healthy, the model should not speculate but directly answer that the data is insufficient. After this, the response is formatted to a standard format and published in Critique, Google's internal code review system, where the developer immediately sees the conclusion, investigation steps, and the most relevant log lines.
By the numbers, the system looks not like a lab prototype, but like a really tested internal tool. In manual verification on 71 real failures from 39 teams, Auto-Diagnose correctly identified the root cause in 64 cases, an accuracy of 90.14%.
After this, Google rolled it out to all integration failures on code changes across the company, starting in May 2025. During this time, the system operated on 52,635 unique tests, 224,782 runs, 91,130 code changes, and 22,962 authors. The median time to publish a diagnosis in code review was 56 seconds, and the 90th percentile was 346 seconds, meaning the result usually appears before the engineer completely switches to another task.
On average, one run consumes 110,617 input tokens and generates 5,962 output tokens. Feedback also looks good: out of 517 reviews from 437 developers, the share of "Not helpful" marks was 5.8%, below Google's internal threshold of 10% for such tools, and in terms of helpfulness, Auto-Diagnose ranked 14th out of 370 systems publishing findings in Critique.
There is also an important side benefit. Seven errors from manual evaluation turned out to be not a failure of the model itself, but problems with logging infrastructure: in some cases, test driver logs were not saved after a crash, in others, logs from the failed component itself were missing. Similar responses in the spirit of "we need more data" later helped identify around 20 more infrastructure problems.
Therefore, the main significance of Auto-Diagnose is not only that Google is speeding up the investigation of test failures. The company is demonstrating a more practical pattern for using LLMs in development: not asking the model to fix code blindly, but embedding it into a narrow point in the process, giving it strict rules for refusing speculation, and returning results directly to where the engineer is already working. For large teams, this is perhaps a more valuable scenario than yet another "AI coding assistant," because it reduces the time to understanding the cause of failure, and that is precisely what most often delays release.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.