OpenAI Blog→ оригинал

Databricks integrated GPT-5.5 into enterprise AI agents after a record score on OfficeQA Pro

Databricks integrated GPT-5.5 into enterprise agentic scenarios after a strong result on OfficeQA Pro, a benchmark for complex document work. The model was the

Databricks integrated GPT-5.5 into enterprise AI agents after a record score on OfficeQA Pro
Source: OpenAI Blog. Коллаж: Hamidun News.
◐ Слушать статью

Databricks announced on May 15, 2026, that it is opening GPT-5.5 for corporate agent scenarios. The occasion was the model's best result on OfficeQA Pro — the company's benchmark for heavy document work, where accurate results matter more than eloquent answers.

Why

OfficeQA Pro Matters OfficeQA Pro tests not general model erudition, but the entire workflow: can the model parse a document, extract the right numbers, find relevant passages, connect multiple sources, and provide an answer grounded in data? This is a painful point for corporate AI agents. Production systems break more often not because the model "cannot think," but because it gets confused in tables, loses a number in a scan, or misreads an old PDF.

In its technical report, Databricks describes OfficeQA Pro as a set of 133 questions based on a corpus of US Treasury bulletins spanning nearly 100 years — from 1939 to 2025. It contains about 89 thousand pages and more than 26 million numerical values. Such a dataset well simulates a real corporate environment: archives, long documents, poorly digitized tables, outdated formats, and data where a single digit error changes the entire agent output.

GPT-5.5

Results In OpenAI's case study for Databricks, it is stated that GPT-5.5 in agent test mode reduced error rate by 46% compared to GPT-5.4 and became the first model to exceed 50% accuracy on OfficeQA Pro.

In a separate release note for GPT-5.5, OpenAI provides a more precise measure — 54.1% on this benchmark.

Against previous results, this is a notable shift: in the March OfficeQA Pro report, frontier agents with direct corpus access averaged only 34.1%. Databricks specifically highlights that the strongest gains came in heavy parsing scenarios.

GPT-5.5 reads old documents and scanned PDFs better, extracts numbers more accurately, and less often goes into unnecessary search loops within multi-step tasks. According to the team, the model became more reliable both in context extraction and in orchestrating multiple steps without additional oversight.

"With

Codex and 5.5, we got the best result among all agents and models," said Databricks research engineer Arnav Singhvi.

How

It Is Being Deployed Now Databricks is opening GPT-5.5 for customer scenarios through Unity AI Gateway. The model can be used within workflows built on Agent Bricks and Supervisor API.

According to Databricks documentation, Supervisor API removes some low-level orchestration from teams: a developer specifies the model, tools, and instructions in a single request, and the platform itself runs the agent loop, invokes tools, selects next steps, and assembles the final answer. In practice, this means that GPT-5.5 in Databricks is embedded not as a separate chat widget, but as a control layer above corporate data and specialized sub-agents.

Around the model, Databricks builds a typical enterprise workflow: a single connection point for models and agents through Unity AI Gateway observability, limits, fallback routes, and audit trails integration with Agent Bricks, MCP servers, Unity Catalog functions, and other tools access control so users see only permitted sources and sub-agents Part of these components, including Unity AI Gateway and Supervisor API, Databricks still marks as beta in its documentation. But the direction is clear: the model is evaluated not on its own, but as a component of a managed, verifiable, and secure corporate system.

What

It Means Databricks demonstrates a pragmatic vector for corporate AI: the winner is not simply the most talkative model, but the one that reliably reads messy documents, does not lose numbers, and conducts long work scenarios without unnecessary errors. If GPT-5.5 maintains this level in production, it will be deployed not for demos, but for automating real document and analytical processes.

ЖХ
Hamidun News
AI‑новости без шума. Ежедневный редакторский отбор из 400+ источников. Продукт Жемала Хамидуна, Head of AI в Alpina Digital.
What do you think?
Loading comments…