MELT-1: how Metabolic AI tests agents for survival
Habr published an article about MELT-1, a benchmark that measures not MMLU but how long an AI agent survives under distribution drift. Metabolic AI posted a 160

MELT-1 — this is not MMLU and not MMLU Pro. This is a new open benchmark for testing AI agents under real conditions: not "what does the model know," but "how many hours will it survive when everything changes around it."
Three axes instead of one number
Conventional benchmarks (MMLU, ARC, GPQA) assume ideal conditions: static questions, stable data distribution. MELT-1 measures three things at once:
- Computation economy: how much it costs to maintain an agent in combat conditions ($/1M successful solutions)
- Survival under drift: how many hours the agent works without retraining before it starts making errors
- Latency under stress: p99 time from sensor to actuator at 40°C over 30 consecutive days of inference, 5 seeds, two temperature profiles.
This is not a lab test — this is a scenario in which a real robot must work day and night, summer and winter.
Results: 1600× difference
On closed-loop manipulation (robot grasps and stacks objects), Metabolic AI — an architecture without a transformer — outperformed Llama-class 7B INT8 by 9.4× on cost and by 8.5× on survival under drift. Compositely: 1600×.
This is not because Llama is bad. This is because 7B transformers are designed for static knowledge retrieval, not for an embodied agent that needs to stay hot 24/7.
"Transformers die after 11 hours under drift," the authors write.
Openness as a standard
The Metabolic AI architecture is closed (patent under examination), but the benchmark is fully open: harness, test scenes, oracle, sensitivity scripts, VAE encoder of drift for reproduction. Methodology in PDF with a section on threats to validity. Researchers invite others to run their agents and place the results alongside.
This is how science in deep learning should be done: closed IP, open benchmarks, reproducibility through code.
What this means
MELT-1 may become a new standard for robotics and embodied AI. MMLU shows whether a model is "smart." MELT-1 shows whether it is "viable."