MolmoWeb-4B by Ai2: A Web Agent That Sees Websites Like Humans, Without HTML Parsing

Q: What is the source?

Originally published on MarkTechPost. Hamidun News processes and adapts the material with AI.

Q: When was it published?

Apr 30, 2026. Reading time: 3 min.

Ai2 (Allen Institute for AI) released MolmoWeb-4B — an open-source web agent that controls a browser the same way humans do: by looking at a screenshot and…

Hamidun News Editorial

AI monitoring · MarkTechPost

Apr 30, 2026· 2 min

AI-processed from MarkTechPost; edited by Hamidun News

MolmoWeb-4B by Ai2: A Web Agent That Sees Websites Like Humans, Without HTML Parsing — Source: MarkTechPost. Collage: Hamidun News.

◐ Listen to article

Ai2 (Allen Institute for AI) has introduced MolmoWeb-4B, an open-source multimodal web agent that controls a browser exclusively using screenshots, without analyzing HTML.

Vision Instead of Parsing

Most web agents work with the DOM tree: they read the HTML code of a page, find the needed elements, and interact with them programmatically. This approach breaks on dynamic sites, Canvas interfaces, and pages with heavy JavaScript.

MolmoWeb takes a different approach. The model receives a screenshot of the current browser state and sees the page as an image—exactly as a human does. The agent's task: understand what's happening on the screen and decide what to do next. No HTML, no DOM selectors—only pixels and multimodal reasoning.

How the Pipeline Works

Under the hood, MolmoWeb-4B is a multimodal language model with 4 billion parameters and 4-bit quantization. This allows it to run on free Google Colab with a T4 GPU—which is especially important for developers without expensive hardware.

The agent's working cycle consists of five steps:

Capture a screenshot of the current browser state
Pass the image to MolmoWeb-4B
Model reasoning about the page state (chain-of-thought)
Predict the next action: click, text input, scroll
Execute the action and capture a new screenshot

The key idea of the prompt workflow is to force the model to reason explicitly before acting. The agent doesn't just 'see a button and click it'—it articulates what exactly it observes on the screen, explains why it should click right there, and only then generates coordinates or a command. This is an adaptation of chain-of-thought prompting for visual interface perception.

Open Access and Practice

MolmoWeb is released under an open Ai2 license, which means any developer can deploy their own web agent without dependence on paid APIs from OpenAI, Google, or Anthropic. The authors publish a complete tutorial: from setting up the environment in Colab and loading the model via Transformers to integrating with Playwright for browser control. The agent cycle is built from scratch—capturing a screenshot, passing it to the model, parsing the response, executing the action.

Practical advantages:

Run without API keys from external services
Does not require special site markup or browser plugins
Compatible with any site and operating system
Quantized version (4-bit) works on Colab T4
Fully reproducible pipeline in open access

Caveat: for now, this is a research tool. Speed (one step takes several seconds) and the accuracy of action prediction fall short of specialized agents with direct DOM access.

Context: The Browser Agents Race

Browser agents are one of the most active directions in AI development in 2025–2026. Anthropic Computer Use, Google Project Mariner, OpenAI Operator—major players are actively working to enable AI models to control computers instead of humans. MolmoWeb from Ai2 occupies its own niche: completely open, reproducible, and running on consumer hardware. It's not a direct competitor to corporate solutions—it's a tool for researchers and developers who want to build agents independently.

What This Means

An open browser agent with 4B parameters running in Colab is a lowering of the entry barrier for web automation tasks. Teams without corporate budgets get a working tool for experimenting with agents driven by vision rather than code markup.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation