PrismML and Google Bring Local 200B Model Inference Closer with Bonsai and TurboQuant
Giant local LLMs are already starting to look less exotic. PrismML compressed an 8B model to 1.15 GB in Bonsai, and Google Research introduced TurboQuant…
AI-processed from Habr AI; edited by Hamidun News
Local execution of very large language models is ceasing to be a fantasy for enthusiasts with a server rack. Two fresh approaches — 1-bit weights from Bonsai by PrismML and KV-cache compression TurboQuant from Google Research — hit directly at two of the most expensive parts of inference: memory for the model and memory for long context.
How weights are compressed
PrismML unveiled Bonsai 8B with Apache 2.0 license — a model based on Qwen3-8B, where almost all weights are stored in 1-bit representation. In practical terms, this means a sharp reduction in size: approximately 1.
15 GB versus 16.38 GB for the FP16 version, roughly 14 times smaller. The company emphasizes that this is not simply file archival packing.
Such a format requires special kernels to avoid unpacking weights back to full FP16 during inference. The scheme looks rough but not primitive: each weight is encoded as a single bit, and a group of 128 weights gets a common scale in FP16. As a result, the effective cost comes to about 1.
125 bits per weight. According to PrismML's claims, Bonsai 8B outputs up to 368 tokens per second on RTX 4090, about 131 tokens per second on M4 Pro, and remains competitive in quality among 8B models, though it does not become an absolute leader in benchmarks.
How KV-cache is reduced
But light weights alone are not enough. Large models quickly develop KV-cache — working memory that stores token representations and grows with context length. This is where Google Research proposes TurboQuant.
The method compresses KV-cache without model retraining and, according to the authors' results, maintains quality even in the range of approximately 3–3.5 bits per channel, where ordinary quantization already begins to noticeably risk answer quality. Inside the approach are two key ideas: first, data is rotated into a more convenient space where it is easier to compress heavily, and then a separate step compensates for compression error.
Through this, TurboQuant solves not only the size question but also the problem of overhead costs that often eat up the benefit of ordinary vector quantization. On Google's tests, the method showed at least a sixfold reduction in KV-cache memory and acceleration of attention computation compared to uncompressed representation.
If approaches are combined
The most interesting part begins where these two ideas stack together. If PrismML's 1-bit approach ever scales to models in the 200B+ class, and TurboQuant preserves its properties on long context, local execution of such systems will cease to be the domain of servers with hundreds of gigabytes of memory. Using Qwen3-235B-A22B as an example, the estimates already look not fantastic but technically debatable yet quite realistic. This is not yet about a finished product, but about the trajectory of hardware and inference development.
- Model weights in bfloat16: approximately 437.7 GiB
- Hypothetical 1-bit variant by analogy with Bonsai: approximately 30.8 GiB
- KV-cache for 128k context in 16 bits: approximately 23.5 GiB
- KV-cache with TurboQuant at 3.5 bits: approximately 5.1 GiB
- Total weights and cache: on the order of 36 GiB instead of more than 460 GiB
This is not yet a promise of a ready home 235B assistant. Questions remain about memory bandwidth, the quality of low-bit kernels, stability on real tasks, and how well the 1-bit scheme transfers from 8B to substantially larger models. But the trajectory is changing: previously the conversation was about how to compress 7B or 14B for a laptop, now already the question is being discussed of whether a 200B class can be brought to local hardware.
What this means
The local LLM market is shifting from cosmetic optimization to architecturally significant breakthroughs in inference. If Bonsai and TurboQuant prove to be scalable, the winners will not only be enthusiasts but also companies that need privacy, low latency, and running powerful models without constant cloud dependence. For corporate teams, this is already a path to local assistants of a new class on one powerful node, rather than on a separate cluster.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.