Hugging Face Blog→ original

Ettin Reranker from Hugging Face: 6 Models for Precise Search Reranking

Hugging Face introduced Ettin Reranker — a family of 6 rerankers ranging from 17 million to 1 billion parameters. The models are built on ModernBERT and trained

AI-processed from Hugging Face Blog; edited by Hamidun News
Ettin Reranker from Hugging Face: 6 Models for Precise Search Reranking
Source: Hugging Face Blog. Collage: Hamidun News.
◐ Listen to article

Hugging Face released a family of 6 Ettin rerankers based on ModernBERT architecture. These state-of-the-art models are designed for secondary reranking of search results, trained using distillation from a larger model.

What is a Reranker

A reranker (cross-encoder) is a special class of models that takes a pair (query, document) as input and outputs a single relevance score. The key difference from standard embedding models: a reranker encodes both sequences jointly, allowing them to see each other through all transformer layers. This makes rerankers much more accurate but also computationally expensive.

How It Works in Practice

Standard search uses a retrieve-then-rerank pattern:

  • First step: a fast embedding model retrieves the top-K candidates
  • Second step: the reranker re-sorts these K candidates with high precision
  • Result: better quality without excessive computational costs

This pattern is more efficient than running the reranker on the entire corpus. Ettin models are trained specifically for this scenario.

Architecture and Optimizations

All models in the family use ModernBERT as a foundation with several key optimizations:

  • Flash Attention 2 for attention acceleration
  • Unpadded sequences — each layer sees only actual tokens
  • CLS pooling instead of mean pooling (proven more accurate in ablation)
  • Special architecture: Transformer → Pooling → Dense layer → LayerNorm → Dense layer

Unpadded sequences provide a particular boost. Thanks to this, the 150M model runs 2.3 times faster than two other 150M models based on ModernBERT. The overall acceleration from bf16 + Flash Attention + unpadding reaches 1.7–8.3x depending on model size.

Performance in Numbers

On the MTEB(eng, v2) benchmark, even the compact versions impress:

  • 17M version processes 7517 pairs per second (on H100)
  • 32M — 6602 pairs per second
  • 150M — 3237 pairs per second (2.3x faster than competitors)
  • 1B version runs 2.4x faster than the teacher model (1.54B)

This means that for most applications, there is a version that is both fast and accurate.

What This Means

Ettin Reranker makes high-precision search more accessible. The compact versions allow reranking to be integrated even in applications with limited computational resources, while the larger versions compete with state-of-the-art. Distillation on open data means anyone can reproduce the results and train their own version.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…