Habr AI→ original

Whisper and Gemma 3 linked with contrastive learning for low-cost speech input to LLMs

Adding voice to an LLM cost-effectively proved harder than papers suggest. The author linked Whisper Medium and Gemma 3 4B through an MLP projector, tried…

AI-processed from Habr AI; edited by Hamidun News
Whisper and Gemma 3 linked with contrastive learning for low-cost speech input to LLMs
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

Adding voice input to an LLM in a cost-effective way turned out to be more complex than multimodality papers promise. The experiment's author attempted to connect the Whisper audio encoder and Gemma 3 language model through a compact projector, and after a series of failures arrived at a working configuration using contrastive learning.

How the Stack Was Built

The idea was simple: rather than train an expensive multimodal system from scratch, take a ready-made audio encoder, a ready-made LLM, and connect them with a "translator" between embedding spaces. Whisper Medium was chosen as the encoder because its internal representations are better tuned for speech recognition than self-supervised alternatives. On the text side, they used Gemma 3 4B, and a two-layer MLP projector served as the bridge, compressing and translating audio vectors into the LLM's embedding space.

To avoid training the model only on clean studio English, the training stream was assembled from multiple datasets and mixed dynamically. This allowed the system to immediately handle different speech across quality, language, and pronunciation style. The paper emphasizes separately that this mix is needed not for pretty statistics, but so the system wouldn't become accustomed to a single acoustic environment and single language from the first epochs. Otherwise, any deviation — noise, a pause, or a Russian fragment — would immediately break recognition.

  • LibriSpeech train.360 as the corpus base
  • LibriSpeech train.100 as additional clean English
  • Russian LibriSpeech for Russian speech
  • DisfluencySpeech with pauses, misspeakings, and stuttering

Why Everything Broke

The first attempt relied on the most obvious recipe: teacher forcing and standard cross-entropy on transcripts. The LLM received as input an instruction, audio vectors, and the correct text, with loss computed only on the answer tokens. In practice, the scheme barely heard the recording: the model produced incoherent fragments, and WER could get stuck around 300%. Even after adding LoRA, it became clear the problem ran deeper — the projector wasn't bringing the audio signal to where the language model could read it. Gemma retained too strong a prior on the familiar geometry of text tokens.

Then came a series of targeted fixes. The author added a zero stage where Gemma first simply learned to rewrite text following instructions, since a non-instruction-tuned version was being used. Next came experiments with quantization and regularizations: commitment loss was supposed to keep projector outputs close to known embeddings, SWD to align distributions of audio and text vectors, entropy loss to force the system to use more codes, and VICReg to prevent individual coordinates from collapsing.

t-SNE visualizations helped identify two main problems: representation collapse and a geometric gap between audio and text spaces. But each new adjustment treated only one symptom. SWD improved distribution shape without improving content. Entropy loss expanded code usage but did so arbitrarily. VICReg increased variance, yet vectors scattered chaotically. The system repeatedly found a workaround where metrics looked locally better while actual recognition didn't emerge.

This became the main lesson of the regularization phase: with a weak primary signal, the model optimizes the mathematics rather than the meaning.

What Actually Worked

The turning point was abandoning the idea that alignment could be achieved through indirect penalties alone. The author made contrastive learning the primary signal and switched to symmetric InfoNCE: an audio vector should be closer to its transcription than to all other texts in the batch, and vice versa. Unlike previous regularizers, this loss specifies not general statistics but specific pairwise relationships.

With a large batch this worked noticeably better: the loss curve fell smoothly without sharp jumps, and WER dropped to 35%. The result doesn't yet match commercial ASR systems, but it's no longer random noise. In the logs, the model began making phonetically plausible errors: it picked up word sounds and confused them more like a person with poor hearing than a broken text generator. For a first pass, this matters more than the absolute WER number: the system stopped simulating answers and started genuinely using sound.

This is what the author considers the main sign of progress.

"But the main thing is that it's already hearing."

What This Means

This case demonstrates well that cheap audio modality for local LLMs is possible, but not through the "magic" MLP projector from papers. A simple pairing of a ready-made encoder and LLM starts working only when a strong alignment signal appears between them. For developers, this is an important takeaway: if you want to add voice to your own model without expensive training from scratch, a contrastive stage may turn out to be not an option but a mandatory foundation.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…