DeepSeek and Gemma: How a Hybrid LLM Experiment on Kaggle Broke the Transformers Library

Q: What is the source?

Originally published on Habr AI. Hamidun News processes and adapts the material with AI.

Q: When was it published?

Apr 29, 2026. Reading time: 3 min.

On Kaggle, researchers attempted the nearly impossible: take Gemma's 31B weights, transplant them into DeepSeek's MoE framework, and run the hybrid without…

Hamidun News Editorial

AI monitoring · Habr AI

Apr 29, 2026· 3 min

AI-processed from Habr AI; edited by Hamidun News

DeepSeek and Gemma: How a Hybrid LLM Experiment on Kaggle Broke the Transformers Library — Source: Habr AI. Collage: Hamidun News.

◐ Listen to article

Enthusiasts assembled a hybrid LLM almost against the rules: they took four layers of the 31B Gemma model and embedded them into an empty MoE architecture of DeepSeek-V4 without additional training. This didn't produce a working ChatGPT competitor, but the process itself showed how deeply you can break PyTorch and Transformers in pursuit of testing an extreme idea.

Why a hybrid is needed

The idea was born from extremely strict constraints: two free T4s on Kaggle, about 30 GB of RAM, and no budget for full-scale fine-tuning. Instead of training, the authors decided to test a rougher scenario — structural model grafting. Gemma-4-31B served as the donor, compressed to 4-bit NF4 format, while an empty DeepSeek-V4 with its Mixture-of-Experts router served as the "skeleton." The task sounded almost like a meme: transplant weights from one architecture to another and see if the chimera could be brought to life at all.

From an engineering perspective, the project was interesting not for answer quality, but for the mere fact of compatibility. Gemma and DeepSeek have different dimensionalities, their own normalization logic, different attention blocks, and their own rules for routing tokens between experts. In textbooks, such transfers are usually classified as impossible without intermediate projections and additional training. The authors consciously went against this rule and set up an experiment to test library resilience: what would break first — the model, Kaggle's memory, or the Hugging Face stack itself.

Where everything broke

The first failure started at the loading stage. Transformers refused to properly recognize the custom type `gemma4`, then crashed with a `dict.to_dict()` error because one of the internal modules serialized the config into a regular dictionary and immediately tried to access it like an object. To proceed, the authors manually registered the type in `CONFIG_MAPPING` and added a monkey patch for `GenerationConfig`, which wraps the dictionary into a proxy object on the fly. Otherwise, the entire operation would have ended before the chimera was even created.

Next, the library tried to initialize empty DeepSeek sections with random noise by calling `normal_`. At that point, PyTorch threw `NotImplementedError`: the donor weights were loaded via bitsandbytes in 4-bit form and stored as `uint8`, while the normal distribution generator works with float tensors. The problem was solved in the most straightforward way — they intercepted `TORCH_INIT_FUNCTIONS["normal_"]` and prevented the function from touching byte tensors. It's not a pretty fix, but a surgical workaround, and without it the transplant wouldn't even start.

The most unpleasant part turned out to be DeepSeek's experts and memory. MoE blocks were hidden not in a regular `ModuleList`, but inside a custom class that couldn't be iterated through simply. Moreover, unpacking the 31B Gemma layer in Kaggle's RAM almost instantly led to OOM. To work around this, the authors wrote a recursive "sonar" that searches for the required sublayers by `gate_proj`, `up_proj`, and `down_proj` attributes, and transferred weights in micro-batches through the CPU, spreading the models across two GPUs and constantly calling the garbage collector.

"In machine learning, there are no impenetrable architectural walls."

How they stitched the model

The final script reduced the entire operation to a sequence of hard patches and weight copying. For each of the four layers, it selected matrices to fit the target layer size, padding them with zeros when necessary, separately transplanted attention projections, and separately the MLP components for each found expert. Memory was cleaned immediately after each major copy operation. Even after the transplantation, the authors had to rewrite the MoE router so that during inference the model wouldn't crash on dimension conflicts and could output at least some text.

The key technical steps were as follows:

Registration of custom Gemma config in the global Transformers registry
Monkey patch for `GenerationConfig` that survives the `dict.to_dict()` error
Blocking `normal_` for 4-bit `uint8` weights
Recursive search for experts and transfer of matrices in micro-batches through the CPU

Testing the resulting chimera showed the expected result: the model indeed started up, but generated incoherent text, a mix of random tokens and fragments of phrases. For the authors, this was still considered a success because the goal wasn't answer quality, but demonstration of principle. They didn't align vocabularies, didn't train transition projections, and didn't do fine-tuning after the transplant, so the chaotic stream of tokens at the output was almost inevitable. However, the experiment proved that architectural constraints often turn out to be not a wall, but a long list of low-level problems that can be bypassed with code.

What this means

Such experiments don't create a finished product, but they do a good job of showing the boundaries of modern open-source LLMs. If weights, quantization, routing, and hidden states can be at least partially aligned without additional training, then space opens up for more practical hybrids — already with normal projections, vocabulary adaptation, and subsequent tuning. For developers, this is a signal: the stack around models is still fragile, which means the space for non-standard assemblies is only growing.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation