Ettin Reranker from Hugging Face: 6 Models for Precise Search Reranking

Q: What is the source?

Originally published on Hugging Face Blog. Hamidun News processes and adapts the material with AI.

Q: When was it published?

2026-05-21. Reading time: 3 min.

Hugging Face introduced Ettin Reranker — a family of 6 rerankers ranging from 17 million to 1 billion parameters. The models are built on ModernBERT and trained

Hamidun News Editorial

AI monitoring · Hugging Face Blog

2026-05-21· 3 min

AI-processed from Hugging Face Blog; edited by Hamidun News

Ettin Reranker from Hugging Face: 6 Models for Precise Search Reranking — Source: Hugging Face Blog. Collage: Hamidun News.

◐ Listen to article

Hugging Face released a family of 6 Ettin rerankers based on ModernBERT architecture. These state-of-the-art models are designed for secondary reranking of search results, trained using distillation from a larger model.

What is a Reranker

A reranker (cross-encoder) is a special class of models that takes a pair (query, document) as input and outputs a single relevance score. The key difference from standard embedding models: a reranker encodes both sequences jointly, allowing them to see each other through all transformer layers. This makes rerankers much more accurate but also computationally expensive.

How It Works in Practice

Standard search uses a retrieve-then-rerank pattern:

First step: a fast embedding model retrieves the top-K candidates
Second step: the reranker re-sorts these K candidates with high precision
Result: better quality without excessive computational costs

This pattern is more efficient than running the reranker on the entire corpus. Ettin models are trained specifically for this scenario.

Architecture and Optimizations

All models in the family use ModernBERT as a foundation with several key optimizations:

Flash Attention 2 for attention acceleration
Unpadded sequences — each layer sees only actual tokens
CLS pooling instead of mean pooling (proven more accurate in ablation)
Special architecture: Transformer → Pooling → Dense layer → LayerNorm → Dense layer

Unpadded sequences provide a particular boost. Thanks to this, the 150M model runs 2.3 times faster than two other 150M models based on ModernBERT. The overall acceleration from bf16 + Flash Attention + unpadding reaches 1.7–8.3x depending on model size.

Performance in Numbers

On the MTEB(eng, v2) benchmark, even the compact versions impress:

17M version processes 7517 pairs per second (on H100)
32M — 6602 pairs per second
150M — 3237 pairs per second (2.3x faster than competitors)
1B version runs 2.4x faster than the teacher model (1.54B)

This means that for most applications, there is a version that is both fast and accurate.

What This Means

Ettin Reranker makes high-precision search more accessible. The compact versions allow reranking to be integrated even in applications with limited computational resources, while the larger versions compete with state-of-the-art. Distillation on open data means anyone can reproduce the results and train their own version.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation