Kaggle Under Google DeepMind Launches Benchmarks SDK for Comparing Large AI Models
Kaggle is no longer just a data science competition platform. In 2026, moving under Google DeepMind's wing, the platform launched a Benchmarks section and an…
AI-processed from Habr AI; edited by Hamidun News
Kaggle, a platform that millions of data science professionals know as the main arena for machine learning competitions, is changing its identity. The slogan "Your Home for Data Science" has given way to "The World's AI Proving Ground" — and this is not just a marketing rebrand. In 2026, Kaggle officially came under the management of AI Frontier — a new division of Google DeepMind.
The change of curator means a change in strategic focus. Kaggle is no longer simply a place for prediction or image classification competitions. Now the platform's mission is systematic evaluation of large language and multimodal models under standardized conditions.
The main technical update — a new Benchmarks section on the website and an open Kaggle Benchmarks SDK on GitHub. This is a full-fledged framework for creating, managing, and running test suites. The mechanics are simple: a researcher describes a test — input data, expected result, quality metric — combines several tests into a group, and that group becomes a benchmark.
The SDK takes care of running models under equal conditions and generates the result: logs, JSON, comparison tables, leaderboards. The system's flexibility allows you to implement almost any testing mechanics — from classical accuracy to complex multi-step tasks with reasoning evaluation. At the same time, benchmark data and code can be kept in private datasets, closed to public access.
Companies can create internal standards for model evaluation without revealing competitors with methodology and test cases. If they want — they'll make the benchmark public, and it becomes a common community standard. Why is this important right now?
The problem of fair evaluation of AI models is extremely acute. Popular public benchmarks — MMLU, HumanEval, GPQA, and others — are regularly criticized: data from them leaks into training sets, and models essentially take an exam by cheat sheet rather than demonstrate real abilities. Large labs create closed internal tests — but small teams and academic groups don't have such infrastructure.
Kaggle Benchmarks SDK makes this toolkit accessible. Google DeepMind gains obvious advantages from the platform's transformation. Kaggle with its many-million community becomes a venue for demonstrating the capabilities of its own models in comparison with competitors — under conditions perceived as neutral.
The community also has clear benefits: previously, creating a fair reproducible benchmark required serious engineering work, now it's accessible through a standard SDK. Nostalgia for the old Kaggle is understandable. The times when a properly tuned XGBoost victory over a neural network on tabular data was a sensation are gone.
The industry's task has shifted: from "who predicts more accurately" — to "how to objectively measure what a large model does". Kaggle is adapting to this shift and, judging by the scale of changes, intends to become the standard of this measurement.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.