red_mad_robot engineer showed how to build an NER service for résumés: from annotation to API

A red_mad_robot NLP engineer broke down the full NER pipeline for résumés — from task definition and BIO annotation to model training and an API service. The…

Hamidun News Editorial

AI monitoring · Habr AI

May 2, 2026· 3 min

AI-processed from Habr AI; edited by Hamidun News

red_mad_robot engineer showed how to build an NER service for résumés: from annotation to API — Source: Habr AI. Collage: Hamidun News.

◐ Listen to article

A practical article was published on Habr about solving NER tasks not in theory, but on a real use case with resumes for an HR system. The author demonstrates the complete path: from choosing entities and preparing a dataset to training a model and launching an API that can be integrated into a working product.

From Task to Schema

At the heart of the case is a practical task: recruiters need a module that takes resumes in PDF or DOC format, extracts text, and finds names, surnames, email addresses, phone numbers, skills, and expected salary. It is precisely this scenario that drives the choice of the NER approach. An important point in the article: the author starts not with the model, but with clarifying business requirements.

First, you need to understand the domain, the language of the documents, the input format, and which fields the customer actually needs. Without this, even a good model will solve the wrong problem. After formalizing requirements, the author defines specific tags: NAME, SURNAME, EMAIL, PHONE, SKILL, and SALARY.

For annotation, he chooses the BIO scheme because in resumes, skills often appear consecutively, and it's important for the model to understand where each individual entity begins. An example like "Python SQL Docker" is telling: if the annotation scheme is chosen poorly, the system can merge several skills into one. This is precisely the case where a seemingly minor decision at the start later affects the quality of the entire system.

Where Datasets Break

The largest section of the article is devoted to data, and this makes sense: this is where most of the time is usually spent. For manual annotation, the author uses Label Studio, but separately addresses two practical problems: the service doesn't parse PDF properly on its own and works poorly with simple TXT import. So before annotation, he writes a small Python script that extracts text from PDF and converts documents into JSON format convenient for Label Studio.

"Garbage in — garbage out".

What follows is particularly useful for applied teams. The author found no ready-made Russian-language resume dataset on Hugging Face, Kaggle, or other popular platforms. So he had to combine several approaches to corpus preparation:

manual annotation of real or prepared resumes in Label Studio
translation of English-language resume datasets into Russian with additional verification
generation of synthetic resumes through LLM and fictional data from Faker
additional annotation of EMAIL and PHONE via regex where the LLM made mistakes

The article demonstrates well that LLM doesn't eliminate the data quality problem—it only changes its form. During resume translation, emails, phone numbers, and technology names could get mangled, and during automatic annotation, the model sometimes classified employers like Oracle as skills, or conversely, skipped obvious tags. Another important insight: skills need to be annotated throughout the entire text, not just in the "Skills" section, otherwise the model will learn the template of the section rather than the entity itself. For demonstration, the author collected 114 documents and six different resume templates, but directly notes that production typically requires thousands of examples.

Why mBERT Was Chosen

The modeling phase is built around the BERT family. For the experiment, he compares six pre-trained models, including mBERT, XLM-RoBERTa, ruBERT, and ruRoBERTa-large. Training occurred on RTX 3090 with 16 GB of VRAM, and long documents were additionally chunked to fit within the model's context window.

F1-score was used for evaluation, and formally ruRoBERTa-large performed best. But the author doesn't stop there and moves on to a question that matters more in real projects than a table of metrics: what does each extra fraction of quality cost? The final choice fell not on the benchmark leader, but on bert-base-multilingual-cased.

The reason is simple: mBERT has 178M parameters versus 355M for ruRoBERTa-large, and the F1 difference was around 0.01. This trade-off provides faster inference and less stringent hardware requirements, which is especially important if the service later needs to run not just on GPU but also on CPU.

On top of the selected model, the author builds a FastAPI service with an /predict endpoint that takes resume text and returns found entities. In other words, the article brings the case not to a notebook with an experiment, but to a minimally viable API.

What This Means

The red_mad_robot analysis is valuable because it presents NER as an engineering task, not an academic exercise. The main conclusion is simple: working results come not from choosing a trendy model, but from a combination of correct problem formulation, rigorous annotation scheme, manual data verification, and pragmatic compromise between quality and inference cost.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation