DeepSeek-V4-Pro Compressed 50x, Now Runs on a Free Kaggle T4
Researchers tested running DeepSeek-V4-Pro's 1.6 trillion parameters without an expensive cluster: the project author compressed weights via SVD, processed…
AI-processed from Habr AI; edited by Hamidun News
The experiment with DeepSeek-V4-Pro demonstrates that even a model of 1.6 trillion parameters class can be brought to a working state without an H100 cluster, if you abandon the idea of running it in its original form. Instead of full-fledged inference, the project author assembled an extremely aggressive approximation: compressed weights through low-rank decomposition, processed giant shards through streaming, and manually adapted the architecture to existing tools.
The result is far from production, but the mere fact of running on a free NVIDIA T4 in Kaggle looks like a strong demonstration of how much mathematics and engineering ingenuity matter today. The original description discusses DeepSeek-V4-Pro, which the author calls a 1.6-trillion parameter MoE model with weights exceeding 800 GB.
For such class systems, typically a completely different infrastructure is needed: several H100s, large amounts of video memory, fast channels between nodes, and adequate local disk space. Against this backdrop, the choice of a free Kaggle instance with a T4 with 16 GB VRAM and approximately 50 GB disk looks not like an attempt to replicate a standard deployment, but as an experiment at the edge of possibility. The problem statement itself is also important: not to preserve the model in its original form, but to verify how much useful structure can be retained after radical compression.
The key move in the project is abandoning standard 4-bit quantization in favor of SVD transmutation, that is, low-rank decomposition of weight matrices. According to the author's description, a rank of 64 provided about 50-fold compression. This scheme preserves the main dependencies between parameters, but discards a lot of details and along with them part of the quality.
For a giant model, this is a harsh trade-off: accuracy drops, but there is a chance to fit the system into available hardware. In essence, this is no longer the original model in the full sense, but its mathematical skeleton, which is still capable of preserving part of the context and associative connections. The second important element is working with weights in an almost emergency MLOps mode.
Instead of storing the entire set of parameters locally, the author processed shards sequentially through safe_open: downloaded one file, extracted the needed tensor, compressed it in RAM, sent the result to the repository, and completely cleared the cache before the next step. This made it possible to push through the disk limitation a set of weights that in a normal scenario simply would not fit on a free machine. It is separately emphasized that RAM consumption never exceeded 4 GB.
This is an important detail, because in such tasks you run into not only VRAM limitations but also file logistics, when the model physically cannot be unpacked without intermediate tricks. The third layer of the construction is architectural identity theft. The transformers library, according to the author, did not yet support DeepSeek-V4, so the configuration had to be masked as DeepSeek-V2 and the MoE routing had to be separately patched through monkey patching.
From an engineering perspective, this is a fragile technique: it depends on the version of libraries, config format, and the expert router design. But this step precisely shows that some of the limitations around large models are related not only to hardware, but also to tool compatibility. If the stack does not yet know the new architecture, researchers often have to first adapt the framework to the model, and only then deal with output quality.
The result was a version of the model that, according to the author, fits in a single T4 memory and can maintain context, but noticeably degrades in quality. Among the side effects are hallucinations and mixing of Russian, English, and Chinese in a single response. This makes the system a poor candidate for reliable production scenarios where accuracy, stability, and predictability matter.
But as a proof of concept, the project works: it shows that even ultra-large open-weight models can not only be discussed in data-center terms, but also broken down into more accessible, albeit heavily reduced configurations. The main takeaway here is not that the T4 suddenly became a replacement for modern GPU clusters. Rather the opposite: the experiment clearly shows the cost of such compromises and the boundary beyond which running a model means not full-fledged inference, but research-level reconstruction.
But it is precisely such projects that advance the practice of compression, approximate inference, and accessible MLOps. The more such workarounds appear, the lower the entry barrier for those who want to experiment with large models without a corporate budget.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.