Habr AI→ original

Habr AI: TAPe detection moved away from transformers and toward almost free segmentation

Habr AI continues its TAPe detection diary and shows an unexpected turn: after dropping transformers, the model became lighter, and local connections between…

AI-processed from Habr AI; edited by Hamidun News
Habr AI: TAPe detection moved away from transformers and toward almost free segmentation
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

The Habr AI team in the seventh entry of their TAPe detection diary described an important turn: the model abandoned transformers in favor of a lighter scheme of local connections between patches. The paradox is that the simplification not only reduced the system size but also produced an unexpected side effect—the first signs of skin and clothing segmentation without separate annotations.

Why Remove Transformers

In previous versions of the architecture, transformers handled global connections between visual fragments, but such luxury comes at a high cost in both the number of parameters and computations. For a research system this is acceptable, but for practical detection it is not always the case.

The Habr AI team decided to test whether they could abandon the heavy attention mechanism and keep only what truly helps assemble an object from its observed parts. Based on intermediate results, this step noticeably lightens the model without breaking the core idea of TAPe representation.

The point of the experiment is not to declare transformers unnecessary. Rather, it is that for certain computer vision tasks, local connections work better than they appear to, especially when the model seeks the most informative and contrasting fragments of the scene. If an object can be described through a set of characteristic patches and their vicinity, then some of the global complexity can indeed be removed. This makes training cheaper and the architecture simpler to analyze and iterate on.

How Patches Are Connected

Instead of a large block that tries to view the entire image at once, the model builds local associations between TAPe patches. That is, it connects not abstract tokens across the entire image, but neighboring or structurally similar regions from which an object description gradually emerges. This approach is closer to engineering logic: first find key details, then understand which pieces belong together, and only then assemble a complete picture. For detection this is especially useful when boundaries, contours, and the most pronounced visual transitions matter.

The authors describe the practical effect as follows:

  • the model requires fewer parameters than the transformer variant;
  • computational cost drops, making it easier to experiment with architecture;
  • the most contrasting patches begin to serve as anchor points for object description;
  • the internal representation becomes clearer: you can see which local connections actually work;
  • on complex objects like a human, the model can identify not only the silhouette but also internal boundaries.

The last point looks most interesting. When the system relies on contrasting areas, it inadvertently begins to distinguish not only the object and background but also different zones within the object itself. For humans, such a natural boundary often appears as the transition between skin and clothing. This was not a separate training goal, but turned out to be a logical consequence of the chosen strategy.

Where Did the Segmentation Come From

The most curious result from the diary is the embryo of segmentation that appears to arise on its own. The authors do not directly teach the model the concept of "skin" and do not task it with coloring a face according to a mask. But when the system seeks maximally contrasting and stable patches, it inevitably latches onto the boundaries between exposed skin, hair, clothing, and background. Within the "human" object, clothing becomes a natural divider, and skin becomes a sufficiently uniform region for the model to begin perceiving it as a separate visual class.

It is important to note that this is not yet full segmentation in a practical sense. This is not about a ready-made model that could replace specialized solutions, but about a side effect of the representation. But such effects often suggest where to move the architecture next. If a simple scheme of local associations already generates internal object division, the next step could involve cheaper and more accurate segmentation without heavy overhead. Hence the formulation about the "almost free" result: the new capability appears not as a separate expensive module, but as a consequence of the simplification already made.

What This Means

The TAPe story shows something important: in computer vision, not every improvement requires a larger model. Sometimes abandoning a complex block provides a twofold benefit—it reduces system cost and reveals new properties of the representation. If this effect is confirmed in subsequent iterations, Habr AI may achieve a more compact path from detection to segmentation.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…