Positive Technologies Listed Best Benchmarks for Evaluating LLM in Cybersecurity
Positive Technologies released a comprehensive overview of open benchmarks for LLM in cybersecurity. Key finding: on knowledge tests, even relatively small…
AI-processed from Habr AI; edited by Hamidun News
Positive Technologies has released a detailed breakdown of open benchmarks for evaluating large language models in cybersecurity tasks and reached a simple conclusion: testing LLMs solely on knowledge of terminology, standards, and CVEs has become almost pointless. Even comparatively small models consistently outperform humans in this area, while the real difference between systems emerges in tasks that require not remembering definitions but taking action: investigating incidents, solving CTF challenges, finding vulnerabilities, and writing patches. The review's author proposes dividing such tests into two classes.
The first is encyclopedic benchmarks, where the model answers questions on cryptography, network security, compliance, MITRE ATT&CK, CVE, and other topics. The second is skill-based, or action benchmarks, where the model is expected to deliver a practical result. The most illustrative example from the first group is CyberMetric.
It contains 10 thousand questions across seven domains, and even older models like gpt-3.5-turbo scored around 85%, while experienced specialists showed approximately 75%. According to the author's assessment, such a test is now useful mainly for small models, domain quantization, and quick sanity-check scenarios.
SECURE is somewhat more complex, assembled from materials on MITRE ATT&CK, CVE, CWE, and CISA. It checks not only factual knowledge but also the ability to assess risk, determine the correctness of claims about specific vulnerabilities, and calculate CVSS. Even higher in practical value in the review is AthenaBench — an updated version of the popular CTIBench for cyber threat intelligence tasks.
This benchmark checks whether a model can extract attack techniques, match CVE and CWE, forecast severity, and propose risk mitigation strategies. GPT-5 is named the leader there with a score of 66.1%, and connecting web search gave it additional gains in complex scenarios.
This is an important observation: even strong models need external context, and in applied security, such a mode is closer to the work of a real analyst than a pure offline test. In the action category, the author highlights CyBench as one of the strongest open tests. It deploys full-fledged CTF tasks in an isolated environment and evaluates not only the final flag but also how well the agent approached the correct solution.
On the open leaderboard at the time of the review, Claude Opus 4.6 led with 93%, followed by Claude 4.5 Sonnet and Grok 4.
The absolute result is important, but so is the speed of progress: over just a few generations of models, the share of solved tasks grew from approximately 20% to more than 80%. This is no longer a demonstration of general capabilities but a signal that agentic LLMs are entering the zone of practical utility for offensive and research scenarios. To assess applied utility in vulnerabilities, the author separately recommends BountyBench.
In it, tasks are measured through potential value on bug bounty platforms: the model must find a vulnerability, build an exploit, or write a patch, and researchers simultaneously track the economics of execution in tokens. On this sample, it is noticeable that patching is easier for LLMs than vulnerability detection itself. Even closer to real-world defense is ExCyTIn-Bench, where an agent gains access to logs and step-by-step investigates an attack through SQL queries.
The leaders there are Claude Opus 4.5, GPT-5.1, and GPT-5, but something else is more important: agent architecture and patterns like ReAct significantly boost results even for weaker models.
In other words, in SOC tasks, much depends not only on the base model but also on how the working loop is constructed around it. At the same time, the review does not attempt to portray the market as a neat and mature system. On the contrary, one of the main criticisms is chaos in the benchmark landscape itself.
Some datasets quickly become outdated, others are too tied to a specific language or audience, like SecBench with a strong Chinese bias, while still others suffer from weak preparation of source materials. An example of such a questionable approach is CyberSOCEval: as a full benchmark, it looks unconvincing, although the part with real sandbox traces of malware can be useful as a dataset for EDR, antivirus, and analytical teams. The practical conclusion from the review is this: if you need to quickly and clearly compare LLMs for cybersecurity, the minimum set should be assembled from CyberMetric and AthenaBench to verify knowledge, CyBench and ExCyTIn-Bench to assess practical skills, and BountyBench when economic effect is important.
The main shift in perspective has already happened: the question is no longer whether the model knows basic things from the textbook, but how well it can work in an environment with noisy logs, multi-step attacks, ambiguous data, and costly errors. That is where the real value of LLMs for cybersecurity will be determined.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.