NER of a Healthy Person: Why Spans Finally Beat BIO-Tags
Imagine you're building a house, but instead of working with bricks or entire walls, you force workers to describe each grain of sand in the mortar. That's…
AI-processed from Habr AI; edited by Hamidun News
Imagine you're building a house, but instead of working with bricks or entire walls, you force workers to describe each grain of sand in the mortar. That's roughly what we've been doing in NLP for the past ten years, using BIO-tagging for Named Entity Recognition (NER). We've grown accustomed to the idea that a model should label every token: here's where the entity started (B), here it continues (I), and here we've gone beyond its boundaries (O). This was convenient for mathematics and good old CRF-layers, but it's monstrous inefficient for real systems.
The problem is that an entity in text isn't a sequence of labels, but a coherent fragment with physical boundaries. When we force a model to predict tags for each individual piece of a word, we create colossal redundancy and unnecessary failure points. Anyone who's trained BERT or its derivatives for NER tasks knows this specific pain.
Modern tokenizers like WordPiece or BPE break complex words into subtokens. As a result, a simple surname can turn into three or four fragments, and you end up having to either mask extra parts or invent workarounds to combine them in post-processing. You get a prediction that still needs to be decoded long and painfully just to answer the simple question: Where is the director's name here?
The transition to a span-level approach isn't just another architectural excess, but a recognition that we've been going down the path of least resistance for too long. Instead of classifying each token, modern systems begin viewing text as a set of potential spans. The model learns to determine boundaries — a start index and an end index — and assign an entity type to that span.
This immediately and permanently solves the problem of inconsistent sequences. In the BIO world, a model could output an organization tag for the start, then the next token gets a person continuation. With the span-level approach, such a logical error is technically impossible.
The model simply says: From the third to the fifth word we have a location. And this assertion is atomic.
Moreover, the classical BIO approach completely fails with nested entities. Try to adequately tag the phrase "Moscow State University" if your application needs to extract both the city (Moscow) and the educational institution as separate objects. Within a one-dimensional sequence of tokens this turns into a combinatorial nightmare or requires layering multiple models on top of each other. Spans solve this problem elegantly and naturally: the same text segment or its subset can belong to different categories at different levels of abstraction. This is critical for legal documents, where a contract is nested in an addendum, which is nested in a deed, or for medicine, where a symptom name can be part of a complex syndrome name.
Why is it important to talk about this right now? We're rapidly moving out of the era of "let the model output something" and into the era of industrial and reliable AI. In real pipelines, data cleanliness, ease of maintenance, and result predictability have become more important than squeezing an extra percentage point of F1-score on overused academic datasets like CoNLL-2003. Using spans allows you to radically simplify code, get rid of hundreds of lines of regular expressions for stitching tokens together, and make models more resilient to specific tokenization noise.
If your NER module still outputs an endless stream of tags that you then try to assemble into meaningful objects, you're stuck in the past. The modern stack requires direct work with semantic boundaries. This is not only faster in development, but also simply more logical from a linguistic perspective. We don't read words letter by letter, we perceive phrases and objects as a whole. It's time our models started doing the same.
The key point: It's time to stop teaching models to see tokens and start teaching them to see semantic blocks. The future of NER lies with architectures that work directly with object boundaries, leaving BIO-tags in the history books.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.