Hugging Face Blog→ original

Hugging Face Explains Fine-tuning of Multimodal Embeddings and Reranker Models

Hugging Face detailed how to train and fine-tune multimodal embedding and reranker models in Sentence Transformers. Using Visual Document Retrieval as an…

AI-processed from Hugging Face Blog; edited by Hamidun News
Hugging Face Explains Fine-tuning of Multimodal Embeddings and Reranker Models
Source: Hugging Face Blog. Collage: Hamidun News.
◐ Listen to article

Hugging Face has demonstrated something important for practical AI: multimodal search models don't necessarily need to be replaced with larger ones to achieve noticeable quality improvements. In a new guide to Sentence Transformers, the team broke down how to train and fine-tune embedding and reranker models that work not only with text, but also with images, audio, and video. The main idea is simple: if a company already has a general multimodal checkpoint, it can be adapted to its specific task and yield better results than switching to a heavier universal model.

As a practical example, the authors tackled the Visual Document Retrieval task, where you need to find the correct page of a document as a screenshot based on a text query. This is a scenario where the model must understand not just words, but also page structure, tables, charts, captions, and visual layout. For the experiment, they used the Qwen3-VL-Embedding-2B model and fine-tuned it on an English-language subset of the LlamaIndex dataset.

The original dataset contains about 500 thousand multilingual query-image pairs, and the prepared version for the experiment retained 53,512 English examples. They used the first 10 thousand records for training and the next 300 for evaluation. The pipeline itself differs little from standard text training in Sentence Transformers.

The article emphasizes that the trainer, training arguments, and dataset loading remain the same, while key differences stem from multimodality: the model is loaded along with processor_kwargs and model_kwargs to control image processing quality, computational precision, and attention implementation; data can contain text, images, audio, video, or dictionaries with multiple modalities; and preprocessing is performed automatically through model.preprocess(). For the main task, the author used CachedMultipleNegativesRankingLoss with mini_batch_size=1 to work with a large VLM model without memory overflow, while preserving the benefits of a large effective batch size through gradient caching.

Particular emphasis is placed on MatryoshkaLoss. This wrapper around the base loss function teaches the model to concentrate useful information in the early dimensions of the embedding. In practice, this allows reducing vector size during deployment without a sharp drop in search quality.

For Qwen3-VL, the full embedding size is 2048 dimensions, but after such training the model retains quality almost unchanged even when reduced to 512 dimensions. Moreover, the final version configuration was saved with truncate_dim=1024, meaning it returns vectors twice as compact as the full size by default and thus reduces storage and index requirements. The results look convincing even without lengthy caveats.

After one epoch, the fine-tuned version achieved NDCG@10 of 0.947 on the evaluation set, while the baseline Qwen3-VL-Embedding-2B showed 0.888.

In the comparison table, this 2-billion-parameter model outperformed not only the original version, but also larger systems, including Qwen3-VL-Embedding-8B with a score of 0.923 and several other current multimodal solutions. Additionally, the author showed that at 512 dimensions the fine-tuned model achieves 0.

945, remaining nearly at peak, and even at 64 dimensions it retains over 92% of maximum quality. For teams that consider index cost and latency, this is not a detail but a quite practical argument in favor of this approach. At the end, Hugging Face specifically notes that the same stack allows training multimodal reranker models as well.

This uses CrossEncoderTrainer and specialized loss functions, and in the any-to-any reranker example, the model is trained to decide whether an image matches text by returning a binary score. This is important because in real search systems, retriever and reranker often work in tandem: the first quickly selects candidates, the second precisely resorts the results. What this means: the era of "take the biggest multimodal checkpoint and hope for the best" is ending.

Hugging Face demonstrates a more pragmatic path — take an already available model, fine-tune it on your domain, maintain compatibility with the familiar Sentence Transformers pipeline, and if necessary, even reduce embeddings without noticeable degradation. For teams building search across documents, catalogs, media archives, or internal knowledge bases, this is a direct signal: the quality of multimodal search is now increasingly determined not by model size per se, but by the quality of domain-specific tuning.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…