MELT-1: how Metabolic AI tests agents for survival

Q: Источник материала?

Оригинальная публикация на Habr AI. Hamidun News обрабатывает и адаптирует материалы с помощью AI.

Q: Когда опубликовано?

2026-05-17. Время чтения: 2 мин.

Habr published an article about MELT-1, a benchmark that measures not MMLU but how long an AI agent survives under distribution drift. Metabolic AI posted a 160

Hamidun News Editorial

AI monitoring · Habr AI

2026-05-17· 1 min

MELT-1: how Metabolic AI tests agents for survival — Source: Habr AI. Collage: Hamidun News.

◐ Listen to article

MELT-1 — this is not MMLU and not MMLU Pro. This is a new open benchmark for testing AI agents under real conditions: not "what does the model know," but "how many hours will it survive when everything changes around it."

Three axes instead of one number

Conventional benchmarks (MMLU, ARC, GPQA) assume ideal conditions: static questions, stable data distribution. MELT-1 measures three things at once:

Computation economy: how much it costs to maintain an agent in combat conditions ($/1M successful solutions)
Survival under drift: how many hours the agent works without retraining before it starts making errors
Latency under stress: p99 time from sensor to actuator at 40°C over 30 consecutive days of inference, 5 seeds, two temperature profiles.

This is not a lab test — this is a scenario in which a real robot must work day and night, summer and winter.

Results: 1600× difference

On closed-loop manipulation (robot grasps and stacks objects), Metabolic AI — an architecture without a transformer — outperformed Llama-class 7B INT8 by 9.4× on cost and by 8.5× on survival under drift. Compositely: 1600×.

This is not because Llama is bad. This is because 7B transformers are designed for static knowledge retrieval, not for an embodied agent that needs to stay hot 24/7.

"Transformers die after 11 hours under drift," the authors write.

Openness as a standard

The Metabolic AI architecture is closed (patent under examination), but the benchmark is fully open: harness, test scenes, oracle, sensitivity scripts, VAE encoder of drift for reproduction. Methodology in PDF with a section on threats to validity. Researchers invite others to run their agents and place the results alongside.

This is how science in deep learning should be done: closed IP, open benchmarks, reproducibility through code.

What this means

MELT-1 may become a new standard for robotics and embodied AI. MMLU shows whether a model is "smart." MELT-1 shows whether it is "viable."

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com