Arena: AI model ranking that can't be gamed — and funded by those it judges
Arena is an AI model ranking that can't be gamed. The startup grew from a PhD research project at Berkeley and in seven months became the de facto arbiter of…
AI-processed from TechCrunch; edited by Hamidun News
In the market of language models, there are hundreds of participants, and each one calls itself the best. The question of who decides who is actually the best turned out to be not philosophical — it became business. Arena, formerly known as LM Arena, has become the main public judge for frontier LLMs and in seven months has traveled from a university research project to a startup with real influence on the industry.
The project grew out of the work of graduate students at the University of California, Berkeley. The idea is simple: instead of trusting benchmarks that companies can tailor to themselves, ask live people to blindly compare two responses from anonymous models and choose the better one. The Elo system, familiar from chess ratings, turns millions of such votes into a single rating.
Manipulating it is extremely difficult: you don't know which model you're voting for, and the scale of the sample neutralizes random outliers. The effect turned out to be unexpectedly powerful. Position in Arena began to influence how venture investors perceive models, when companies announce launches and how the PR narrative around new releases is built.
Getting into the top of the ranking — means getting independent confirmation of quality that cannot be disputed by reference to internal tests. But the system has a structural paradox that raises uncomfortable questions. Arena is financed by the very companies it evaluates.
OpenAI, Anthropic, Google, Meta and other major players support the platform one way or another. This creates a potential conflict of interest: the independent judge receives money from those being judged. The project team insists that the methodology protects against sponsor influence — the anonymity of votes and transparency of data leave no entry points for manipulation.
Critics, however, point out: the very fact of financial dependence undermines trust, even if technically everything is honest. A separate question is what exactly Arena measures. The rating reflects user preferences in open dialogue, not the model's ability to solve specialized tasks: write code, analyze documents, work with data.
A model that appeals to a wide audience in everyday conversations may yield to competitors where accuracy matters. This doesn't make the rating useless — it honestly measures what it measures. But equating position in Arena with overall model quality would be an oversimplification.
Nevertheless, over the past two years, Arena has become a reference point that the industry cannot ignore. Companies build marketing campaigns around high positions, researchers cite the rating in academic papers, journalists use it as a quick reference when covering new launches. The influence is real — regardless of methodological debates.
The history of Arena shows how rapidly informal institutions of power form in the AI industry. No one appointed this rating as a standard — it became one because it filled a vacuum. The market needed an independent assessment, and the first to offer a convincing mechanism got disproportionately large influence.
The question is how long this balance will hold as stakes grow and conflicts of interest become more apparent.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.