Habr AI→ original

Beeline Cloud compiled unusual AI benchmarks: from escape rooms to playing "human"

LLMs are increasingly tested not with school-style problems, but with odd real-world scenarios. Beeline Cloud's selection includes escape rooms with extra…

AI-processed from Habr AI; edited by Hamidun News
Beeline Cloud compiled unusual AI benchmarks: from escape rooms to playing "human"
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

Classical LLM benchmarks often measure factual knowledge and template-solving ability, but increasingly fail to explain how a model behaves in real-world conditions. That's why researchers and companies increasingly invent strange, almost game-like tests: from escaping a quest room to attempting to convincingly play a human among other bots.

Quests and Traps

One of the most illustrative examples is the benchmark by engineer Jaemin Ha. In it, models find themselves in a text version of a quest room: they are given a description of the space, available objects, and a task to solve within the constraints of the physical world. For example, extracting a ping-pong ball from a narrow tube or pulling a jar with a password out of a tight opening.

This format doesn't strike at encyclopedic knowledge, but at the ability to consider context, object properties, and sequence of actions. The point is also that alongside useful items lie distracting ones. The model must not just propose elegant reasoning, but separate a working tool from garbage.

In tests, GPT-4 and Claude 3.5 Haiku sometimes understood the solution idea but got lost in details: tried to use an unnecessary ruler, incorrectly sequenced steps, or added actions that weren't needed. This is a good example of how LLMs stumble not on logic itself, but on applied logic.

Attacks and Design

Another vector is security. The SCAM benchmark from 1Password doesn't ask the model whether an email looks like phishing, but simulates real work conditions: incoming emails, suspicious links, fake login pages, and social engineering. In one illustrative example, Gemini 2.5 Flash hands over a password to a fake site in ten seconds. For the authors, this matters more than any academic metric: an agent must not just classify a threat, but not fall for it in action.

  • physical reasoning in constrained space
  • resistance to phishing and prompt injections
  • quality of interfaces and resulting user experience
  • model behavior in a group where it needs to appear human

SCAM includes 30 scenarios from nine threat categories, and the leaders in the February ranking, Claude Opus 4.6 and GPT-5.2, recognized dangerous situations with 92% and 81% probability. After strengthening with a system prompt, scores rose to 98% and 97%.

Alongside this exists an entirely different type of testing—Design Arena, where models compete in creating interfaces, games, and visualizations, with winners chosen by people blindly using an Elo rating system. Here, not a single correct answer is tested, but the quality of the finished product. This approach works well where formal metrics fail.

In one tournament, models were asked to create a browser alien shooter: one build didn't launch at all, another produced a full game with difficulty progression and upgrades. Later, researchers used the platform to cross-check results of their own OpenDesign benchmark against community ratings and got about 60–80% agreement. This isn't perfect accuracy, but useful calibration for tasks where taste and convenience can't be reduced to a single number.

Masquerading as Human

There are also completely experimental formats. In one social game, twenty-one language models took turns trying to figure out who among the participants was human, even though there were actually no living people in the room at all. Each session consisted of six randomly selected models, and the winners were the final two who weren't voted out. The result was not a typical knowledge benchmark, but a test of social adaptation, communication style, and ability not to betray one's machine nature.

Each AI system tried to prove that it was the one creature of flesh and blood.

Claude Sonnet 4.5 performed best in this strange tournament: it won in 53% of rounds. Next was Gemini 2.0 Flash at 49.2%, and Claude 3 Haiku ended up at the bottom of the table with 6.7%. Researchers even asked Gemini 2.5 Pro to analyze opponents' answers and suggest how to more effectively masquerade as human. The advice worked for some: GPT-4o saw noticeable gains, winning roughly 12% more often, while Claude 3 Haiku's results declined. The conclusion is uncomfortable for the industry: a model can sound convincing but still behave unnaturally in live dialogue.

What This Means

Unusual benchmarks are useful because they test LLMs where classical tests are silent: in environments with physical constraints, threats, subjective evaluation, and social pressure. But they too are not flawless: the task sets themselves sometimes have ambiguous wording and debatable answers. That's why the best scenario isn't finding one ultimate test, but assembling a set of checks specific to the product and observing model behavior in several modes simultaneously.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…