Zyphra Released Zamba2-VL: Visual Models with 10x Faster Response
Zyphra released a family of open multimodal models Zamba2-VL — three variants: 1.2B, 2.7B and 7B parameters, Apache 2.0 license. Hybrid architecture: Mamba2…
AI-processed from MarkTechPost; edited by Hamidun News
Zyphra has released an open family of vision-language models, Zamba2-VL, in 1.2B, 2.7B, and 7B parameter variants. At its core is a hybrid architecture that combines Mamba2 and Transformer blocks. The key result: time-to-first-token is reduced by approximately 10 times compared to pure Transformer VLM models of comparable size.
Three sizes, one license
The family includes three variants: 1.2B, 2.7B, and 7B parameters. All three are released under the Apache 2.0 license, which means free commercial use without restrictions on embedding, modification, and redistribution — full freedom for commercial and research projects.
Zamba2-VL are full-fledged vision-language models. They jointly process images and text, opening up applications across a wide range of tasks: image and photo captioning, visual question-answering, document analysis with illustrations, parsing of user interface screenshots, work with medical images.
Unlike pure text LLMs, VLMs can answer questions about what is depicted in an image and combine visual and textual context in a single request.
In terms of quality on standard benchmarks, Zamba2-VL holds its ground level with pure Transformer VLMs of comparable size. The transition to hybrid architecture does not require sacrificing accuracy for speed — both metrics remain competitive.
How the hybrid backbone works
Most modern language and multimodal models are built on pure Transformer architecture. In it, each newly generated token "reviews" the entire previous sequence through an attention mechanism. This is a powerful approach, but computationally expensive: with long contexts, the workload grows quadratically. This is where the performance bottleneck appears — including high time-to-first-token.
Mamba2 is an architecture based on state space models (SSM). Instead of exhaustively reviewing history, it compresses the preceding context into a compact "state" that updates linearly as new tokens are processed.
Zamba2-VL alternates Mamba2 blocks with regular Transformer layers: SSM blocks provide speed and efficiency, Transformer layers add flexibility when dealing with complex dependencies.
The result:
- Time-to-first-token is reduced by approximately 10 times
- Quality remains competitive with pure Transformer VLMs
- Smaller computational footprint during inference
- Better scaling on long contexts
- Ability to deploy on less powerful hardware without losing responsiveness
Why TTFT matters
Time-to-first-token (TTFT) is the interval between sending a request and the appearance of the first character of the response. It determines the sense of "aliveness" in interactive systems: chatbots, voice assistants, API services, where response speed matters. While the model thinks — the user waits. High TTFT feels like "hanging," even if the final response is high-quality.
A 10-fold reduction in TTFT is a significant practical gain. With the same hardware resources, this means either a significantly more responsive service or the ability to handle substantially more requests simultaneously. For companies paying for GPU time, both options directly impact the product's unit economics.
Open models with such response speed enable building products where
response latency previously made an entire class of solutions unviable.
What it means
Hybrid SSM + Transformer architectures continue to move from academic papers into practical products. The release of Zamba2-VL as a family of three models — from the compact 1.2B to the full-size 7B — covers different deployment scenarios: from resource-constrained devices to server farms. Open licensing under Apache 2.0 lowers the barrier to entry: teams can take a ready-made fast multimodal model without dependence on commercial APIs — with all their pricing, rate limits, and risk of sudden condition changes.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.