TAPe raised classification accuracy to 77% and compared the results with YOLO on a small COCO dataset
In the eighth part of the TAPe diary, the authors brought together several key improvements: segmentation using contrasting patches at the object boundary…
AI-processed from Habr AI; edited by Hamidun News
The eighth TAPe diary entry focuses not on a single function but on assembling an almost complete detection workflow: the authors improved segmentation, brought classification accuracy to 77%, and tested what happens when comparing the approach with YOLO on a small COCO dataset. The interim conclusion for them is optimistic: the model is already beginning to work in scenarios where classical detectors typically require significantly more data.
Segmentation by Boundary
The main update at this stage is segmentation by contrast patches at the actual object boundary, rather than by a conventional bounding box around it. The idea is to divide the image into more natural segments and then assemble the object from them, rather than trying to immediately predict the box from raw pixels. According to the authors, this is precisely what quickly improved patch merging quality and made it possible to form more adequate segments for each object in the image.
In parallel, the team tried several other architectural solutions: additional heads, different ways to select similar segments, and more complex aggregation variants. But these approaches did not produce the expected effect. The reason, as the authors describe, is that such schemes attempted to adapt TAPe data to familiar architecture, instead of using it as is.
In practice, a more direct approach worked better: rely on the structure of TAPe representations themselves and improve connections between patches.
Classification Without Learning Rate
The next problem turned out to be more pragmatic: some patches sometimes failed to fall into the correct segment. If one or several image fragments are not associated with an object, it becomes difficult to classify them correctly, because the model lacks the answer to what exactly this piece refers to. To bring training closer to real model behavior, the authors began to simulate step-by-step segment growth from a single patch during training — that is, to repeat the same logic used at inference.
This helped improve the search for correct connections between patches, but did not remove all limitations. The remaining problem is "non-growing" regions, when a segment lacks context and makes classification errors because of it. For such cases, TAPe now additionally checks neighboring areas and smooths the context.
Separately, the authors describe another important engineering goal: consistently eliminating hyperparameters that can break system behavior. One such parameter was the learning rate, which they decided to abandon in this version along with gradient descent.
- Segmentation now proceeds by contrast patches at the object boundary
- Classification has grown to 77%
- Abandoning learning rate added about 3% accuracy
- The weakest points so far are related to small segments and lack of context
- The team's next goal is to reach at least 80% classification
The authors specifically note the market benchmark: DETR publications list classification accuracy around 79%, although it is unclear whether detection errors themselves are included. For TAPe, this is not yet the finish line, but the nearest goal. Full tests on the entire COCO dataset are still ahead, as they take a lot of time, but it is already clear that the architecture has become more stable and better aligned with self-supervised learning tasks.
First Tests with YOLO
The most notable part of the entry is the first direct benchmark against YOLO. For the experiment, the authors took a small slice of COCO with 5,000 images and divided it according to a 70/30 scheme: 3,500 frames for training and 1,500 for testing. For standard detectors, this volume proved critically insufficient. TAPe claims that on this dataset, YOLO practically does not converge, and the detection level remains around 1%.
"YOLO does not converge at all for the dataset we use for testing."
This is not yet a final comparison by mAP50, mAP50-95, speed, and number of parameters — the authors are still preparing a separate post with full benchmarks against YOLO and RF-DETR. But even this early result is important because it demonstrates the project's main thesis: TAPe attempts to be not just another detection model, but an architecture that can work on dozens of images per class where more conventional approaches require hundreds of thousands of examples and much heavier pre-trained bases.
What This Means
If TAPe truly confirms its results on a full set of metrics, it will be a strong argument in favor of computer vision focused on data structure rather than scale alone. For teams with small datasets, this is especially important: the entry cost for quality detection could decrease significantly.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.