Google Gemma 4 31B reduced to 18.3 GB and run on free Kaggle
Google Gemma 4 31B, which occupies about 62 GB in float16, was run on free Kaggle despite Kaggle's 57.6 GB disk limit. To do this, the model was quantized to…
AI-processed from Habr AI; edited by Hamidun News
Google's 31-billion-parameter Gemma 4 was successfully compressed from 62 GB down to 18.3 GB and run through free Kaggle, despite the disk limit being smaller than the original model size. To achieve this, the author assembled a rather strict but functional pipeline: on-the-fly quantization, cache deletion during execution, and direct model upload to Hugging Face Hub.
Hitting the Limits
The original Gemma 4 31B in float16 format takes about 62 GB, and the available disk on free Kaggle is limited to 57.6 GB. The problem isn't just downloading the model.
You still need to re-quantize it to 4-bit, get a new set of weights around 18 GB, and then upload it out. If we do the math simply, at one stage you need to hold both the original and the result, which is already around 80 GB. In other words, the standard scenario breaks down even before the first successful upload.
That's the point of this case: not about beautiful MLOps on expensive hardware, but about trying to run a heavy open-source model under conditions where every gigabyte has to be earned. Instead of A100, two free T4s were used, so the task came down to two limitations at once — disk and video memory. As a result, the entire pipeline had to be built not around convenience, but around survival: what to store, when to delete, and at what moment to send the artifact out.
How the Pipeline Was Built
The first key technique is quantization not after full preparation, but right during the loading process. For this, bitsandbytes and NF4 format were used, and the device_map="auto" parameter allowed automatic distribution of the model across two T4s with 15 GB of memory each. The idea is to not wait for all weights to calmly settle on disk and only then start processing. The sooner compression begins, the less chance that temporary files and intermediate states will completely consume available space.
- The model loads and quantizes almost simultaneously, without a long pause between stages
- The device_map="auto" parameter distributes sharded weights between two T4s
- After loading into VRAM, the Hugging Face cache is deleted to free up space for the new artifact
- Instead of saving to disk, the model is sent directly to Hugging Face Hub
The most radical step was deleting the Hugging Face cache folder after the model was already loaded into video memory. Technically, the trick relies on Linux behavior: even if a file is deleted from the file system, the process can continue to hold it open, and the space becomes available for new writes. Due to this, the original 62 GB stops interfering with creating quantized shards. The final stage also avoided unnecessary copies: instead of normal saving and separate upload, the result was sent directly to Hugging Face Hub via push_to_hub.
"We didn't wait for the upload to finish, we started compressing the
weights immediately."
What Came Out
The result was an artifact of 18.3 GB — no longer a monster tied to server hardware, but a model you can work with on much more ordinary configurations. The author specifically emphasizes that this is the 31-billion-parameter Gemma 4, oriented among other things toward tasks involving complex code. The result itself was posted to Hugging Face, so this is not just a theoretical recipe, but a reproducible experiment with a concrete outcome and ready-made build.
It's important to note that this is not a universal instruction for production and not an example of neat infrastructure. From the description of the method, it's clear that the approach is fragile: it's tied to Linux behavior, the exact moment of cache deletion, and the hope that no error will occur during execution that would force you to start almost from scratch. But that's exactly what makes this case interesting. It shows that the accessibility of large models is determined not only by budget, but by engineering ingenuity — especially where infrastructure is limited and you want to try a new model right now.
What It Means
Large open-source models like Google Gemma 4 can be adapted even to very modest infrastructure, if you rebuild the entire process around limitations rather than around a convenient pipeline. For developers, this is a good signal: the barrier to entry for experiments with large LLMs is lowered, but the price of this accessibility is more fragile and manual MLOps scenarios.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.