Google released Gemma 4 on Hugging Face: multimodal models for local inference
Google DeepMind has brought Gemma 4 to Hugging Face and is betting on local inference. The lineup includes four multimodal models: from E2B and E4B for edge…
AI-processed from Hugging Face Blog; edited by Hamidun News
Google DeepMind has unveiled the Gemma 4 family on Hugging Face, emphasizing not maximum model size, but a combination of power, multimodality, and the ability to run the model locally. The lineup includes four versions: from compact E2B and E4B for edge scenarios to 26B A4B and 31B for heavier tasks on workstations and server hardware.
What versions were released
The release took place on April 2, 2026. Hugging Face reports that Gemma 4 is available in both base and instruction variants, with the entire lineup distributed under the Apache 2.0 license. The two smaller models received a 128K context window, the two larger ones — 256K. Google and Hugging Face present the series not merely as chat models, but as a foundation for agentic scenarios, local assistants, and multimodal applications, where working with text, images, video, and in some configurations, audio is important.
- Gemma 4 E2B — effective 2.3B, approximately 5.1B with embeddings, 128K context
- Gemma 4 E4B — effective 4.5B, approximately 8B with embeddings, 128K context
- Gemma 4 26B A4B — MoE model with 26B total parameters and approximately 4B active, 256K context
- Gemma 4 31B — dense 31B model with 256K context
According to Google, the 31B model ranked third among open models in Arena AI's text ranking at the time of announcement, while 26B A4B ranked sixth. For a series designed in part for local deployment, this is a strong statement: Google is attempting to compete not only in the cloud with Gemini, but also in the open-model segment, where the balance of quality, speed, memory, production stability, and deployment flexibility matter.
What Gemma 4 can do
The Hugging Face blog emphasizes practical multimodal tests. The models can work with OCR, speech recognition, object detection, and coordinate identification in images. In one example, Gemma 4 finds a UI element on a screenshot from a plain text query and immediately returns bounding boxes in JSON without additional format delimiters. For developers, this is useful: less boilerplate around the model, simpler assembly of visual agents and interface assistants.
This is not where the list ends. Gemma 4 is demonstrated in tasks for HTML page restoration from images, in text-only and multimodal function calling, as well as in code correction and completion. The younger E2B and E4B models can accept audio, and in video tasks can process videos together with audio tracks. The older 26B A4B and 31B understand video without audio. According to Hugging Face tests, even without separate post-training on video, the models confidently handle describing what is happening and captioning complex images.
Why this is practical
Technically, Gemma 4 is built around several solutions that should improve long-context performance and reduce inference cost. Among them are alternating local sliding-window attention and global full-context attention, separate RoPE configurations for different layers, Per-Layer Embeddings, and shared KV cache. The latter technique allows reusing key-value states across layers, saving memory and computation, which is especially important for long generation and running on a device.
Another practical advantage is the breadth of the ecosystem already on release day. Hugging Face announces support for transformers, llama.cpp, MLX, transformers.
js with WebGPU, and Mistral.rs, while TRL and Unsloth Studio are available for fine-tuning. This means Gemma 4 is not locked into a single stack: the model can be quickly tried in a browser, on a laptop, on Mac, in a local agent, or in a familiar Python pipeline.
For the open-model market, this is no longer a nice bonus but a necessary condition for real deployment.
What this means
Gemma 4 demonstrates where the open AI market is headed in 2026: less of a race for raw parameter count and more focus on multimodality, long context, and local deployment. If quality is confirmed in independent tests and production cases, developers will have another strong foundational model for agents, offline products, and enterprise scenarios where data privacy, latency, and inference cost are more important than dependence on cloud APIs.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.