TAPe raised accuracy on 2% of COCO to 98% and began shifting from centroids to bounding box detection
TAPe continues its experiment log on COCO and shows a new step: 98% accuracy on a 2% sample, fewer false positives, and first results on rectangular bounding…
AI-processed from Habr AI; edited by Hamidun News
TAPe continues its series of open computer vision experiments on COCO and reports a new local milestone: accuracy has reached approximately 98% on a 2% subset of the dataset. In parallel, the team has reduced false positives and begun transitioning the model from centroid detection to full bounding box detection.
What the test showed
The new TAPe run was not conducted on the entire COCO dataset, but on its 2% subset—approximately 2,400 images used for rapid iterations. Under these conditions, the team achieved around 98% accuracy on their current metric. The key change was the use of inverse pyramids during fine-tuning and data collection: a precise TAPe patch remains at the center, while the scale increases as you move outward. In essence, the model learns to view an object simultaneously at a local level and slightly broader, which helps better separate useful signal from background noise.
For the authors, this is not a final benchmark or reason to claim completed COCO detection. Rather, it is an intermediate check that the chosen scheme does indeed yield improvements on a small data slice and enables faster error identification. The article emphasizes not only accuracy gains but also a reduction in false positives—for applied systems, this is just as important as hit rate percentage.
How training was configured
In parallel, the team was tuning basic training parameters: how many prototypes each class needs, how many background TAPe patches should be shown to the model, and how to balance background against objects themselves. Currently, the best result, according to the authors, comes from a fairly simple configuration: two prototypes per class and roughly a two-fold excess of background examples over objects. The logic is this: background is less expressive, so the system needs to see more of it to stop treating everything as an object. However, excess background quickly breaks the picture: if overdone, the model starts classifying nearly everything as background.
The article also describes a two-stage embedding training mode: first, representations are pushed apart to reduce overlap between classes, and then similar objects are pulled closer to each other for accuracy. The authors expect that in the future, some of these stages could be replaced by training on pre-prepared TAPe objects.
- For rapid tests, approximately 2% of COCO was used—roughly 2,400 images
- The best number of prototypes per class is now 2
- The working balance is roughly twice as many background patches as objects
- False positives were reduced to 30 on a set of approximately 1,500 images
- Separately being tested is how many "views" the model needs for detection without full classification
Transition to boxes
The most notable shift in the log is the transition from searching for object centroids to constructing rectangles around them. Previously, TAPe in this experiment series primarily searched for the center of targets; now the team is beginning to format results in a more conventional object detection format. At this early stage, the authors are cautious in their assessments and do not provide final quality percentages for boxes, but report that visually the first results look good.
Another interesting aspect is experiments with the number of "views" the model needs. For detection without classification, according to the team, viewing the corners and center of the image proved sufficient. This is an important signal for the architecture itself: if an object can be localized with a small number of observations, it means the system could potentially be simpler and cheaper than classical heavy-duty pipelines. However, this currently applies specifically to detection without commitment to precise classification.
The results field remains uneven. Average centroid detection accuracy is currently around 72%, but for the most textured classes, the metric rises above 90% and reaches 93–94% for objects like piano keys, zebras, or boats. The system struggles most with forks due to their small size and with humans due to high variability: in the dataset, a person can be a close-up face, a back figure, or a seated pose, and such annotation significantly complicates the task.
What this means
The TAPe story so far looks not like a ready competitor to YOLO on a general benchmark, but as a careful build-up of a working alternative: more accuracy on a small COCO slice, less noise, and the first step toward full boxes. If the team maintains progress in transitioning from centroids to detection on stricter metrics, the approach will gain not only research value but practical weight as well.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.