H Company released Holotron-12B — a model for agents with a 2x speed increase

Q: What is the source?

Originally published on Hugging Face Blog. Hamidun News processes and adapts the material with AI.

Q: When was it published?

May 2, 2026. Reading time: 3 min.

H Company introduced Holotron-12B, a model for computer-use agents based on NVIDIA Nemotron. The developers are focusing on throughput: in a load test on a…

Hamidun News Editorial

AI monitoring · Hugging Face Blog

May 2, 2026· 3 min

AI-processed from Hugging Face Blog; edited by Hamidun News

H Company released Holotron-12B — a model for agents with a 2x speed increase — Source: Hugging Face Blog. Collage: Hamidun News.

◐ Listen to article

H Company has released Holotron-12B — a multimodal model for AI agents that interact with interfaces like a human user. The new model is built on top of the open-source NVIDIA Nemotron and is designed not for impressive demos, but for high throughput in production.

What it's designed for

Holotron-12B is positioned as a policy model for computer-use agents: systems that must see the screen, understand interface elements, choose the next action, and complete tasks end-to-end. Unlike many multimodal models focused on static image recognition or standard image-based chat, the focus here shifts to long sessions, chains of actions, and handling multiple screenshots simultaneously. This is an important shift: the model was designed not as a general-purpose assistant, but as a working module for agentic systems.

The developers at H Company fine-tuned the model on their own data mixture for UI element localization and navigation. The goal is for the agent to better understand buttons, input fields, page structures, and the relationship between visual context and action. Holotron-12B is already available on Hugging Face under the NVIDIA Open Model License, making it suitable as a foundation for web agents, internal automation tools, and online reinforcement learning pipelines.

Speed under load

The key bet in Holotron-12B is not just the quality of actions, but inference efficiency. The model is built on a hybrid SSM + attention architecture inherited from Nemotron. In essence, this is an attempt to solve the main problem of agentic workloads: long interaction histories, many high-resolution images, and dozens of parallel requests quickly hit memory and GPU bandwidth limits. With the SSM approach, state is stored more compactly than in a classic transformer with a large KV cache, so the model scales better in real-world scenarios.

Tests were run on a single NVIDIA H100 via vLLM with SSM optimizations from version 0.14.1
In real multimodal agent workloads, the model showed throughput more than 2x higher compared to Holo2-8B
On the generation throughput chart, Holotron-12B achieved 149 tokens per second versus 69 for Holo2-8B
At concurrency 100, total throughput increased to 8,900 tokens per second versus 5,100 for Holo2-8B

For teams building large-scale data generation pipelines, annotation, or online RL, this is not a cosmetic improvement. If the model handles a larger batch load on the same hardware, the cost per agentic scenario drops and deploying them in production becomes easier. This is precisely why H Company emphasizes not the maximum model size, but the ability to stably serve long agentic sessions with high request concurrency.

Training and benchmarks

Holotron-12B was trained in two stages. The base was the open multimodal model NVIDIA Nemotron-Nano-12B-v2-VL-BF16, after which H Company conducted supervised fine-tuning on a proprietary data mixture for localization and navigation. The developers specifically highlight the focus on screen understanding, grounding, and UI-level interactions — meaning the model's ability to not just describe the screen, but correctly bind an action to a specific interface element. The final checkpoint was trained on approximately 14 billion tokens.

Benchmark results look strong. On WebVoyager, success rose from 35.1% for the base Nemotron model to 80.5% for Holotron-12B, slightly above the 80.2% of Holo2-8B. In GUI localization tasks, average accuracy increased to 74.2% versus 24.6% for the base version. Individual test scores also show a notable spread: 49% on OSWorld-G, 66.1% on Showdown, 82% on GroundUI-1k, 83.8% on WebClick v1, and 89.9% on Screenspot V2. This means the improvement covers not just one convenient benchmark, but multiple interface understanding scenarios.

What this means

The AI agent market is gradually moving away from general-purpose VLMs toward more specialized models optimized for specific interface work and production economics. Holotron-12B is interesting precisely for this reason: it demonstrates that for computer-use systems today, what matters is not just benchmark percentages, but real throughput on a single GPU. For companies building browser or desktop agents, this is no longer a secondary metric — it is a baseline requirement for scaling.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation