TAPe reached 74% accuracy on COCO and began moving away from standard transformers
TAPe posted a new intermediate result on COCO: 74% classification accuracy while training embeddings on entirely synthetic data. At the same time, the team…
AI-processed from Habr AI; edited by Hamidun News
A team running an experiment journal with TAPe for computer vision reported a new intermediate result on COCO: embeddings trained on fully synthetic data achieved 74% classification accuracy. At the same time, the authors reached another conclusion: standard transformers help quickly verify hypotheses, but in this architecture they become a bottleneck.
How the experiment was structured
The TAPe approach is based on the idea of working not with raw pixels, but with structured elements of an image and relationships between them. In the new iteration, the authors tackled two tasks simultaneously. The first—training embeddings using a scheme similar to iBOT, but entirely on synthetic data created according to TAPe rules. The second—standard classification, where the model should assign each patch to one of 80 COCO dataset classes based on its description. This pipeline allows separating representation learning from applied validation on real images.
- synthetic TAPe data instead of realistic pixel-based generations
- two training tasks: embeddings and classification
- 3,500 validation images from COCO for training
- 1,500 validation images for testing
The choice of specifically the validation part of COCO seems unusual, but that was the point of the experiment. The authors took a small dataset where all 80 classes are already represented, and the images themselves are considered more challenging than examples from the training set. This makes it possible to quickly understand whether the approach converges under difficult conditions. According to their logic, if the model starts working confidently on such a set, further scaling to larger data becomes an engineering task rather than a question of fundamental learnability.
Results on COCO
As a result of the first task, the model achieved 82% accuracy in conditional patch reconstruction. For the authors, this is an indicator that embeddings already carry sufficiently useful structure, though there is room for improvement. On the classification task, the result was 74% accuracy.
For a first version, this is a notable level, especially because it's not about a model trained on a giant corpus of natural images, but about a scheme where early training relies entirely on synthetic TAPe data. The authors specifically emphasize the context of this number. According to their estimate, the best models for COCO show around 79% on comparable metrics, so there is still a gap, but it no longer looks fundamental.
Even more importantly, according to them, TAPe continues to converge on a very small dataset. In the paper, this is contrasted with the YOLO family, which, it is claimed, struggles to converge even with 5,000 images, and strong configurations typically require pre-training on ImageNet.
Why transformers hinder
Currently, the connections between patches in this architecture are still organized through standard transformers. The reason is pragmatic: experiments are faster to run on them and it's quicker to check whether the general scaled approach works. For a research journal, this is a logical compromise.
If the basic hypothesis isn't confirmed, there's no point in immediately building a specialized architecture. But as quality improves, this temporary layer has started to show its limitations. The main complaint about transformers here is that the attention mechanism tries to relearn dependencies between patches that are already explicitly specified in TAPe data.
The authors believe that such a layer is not only redundant but can also corrupt the structured representations themselves. Added to this are slow convergence on full COCO and dependence on standard gradient descent. Therefore, the next step for the project is to move toward an architecture more suited to TAPe, where connections between elements are not reconstructed anew by attention but are used as part of the original structure.
What this means
The experiment so far looks like an early but already meaningful signal: synthetic structured data can produce working embeddings and competitive classification even on a small and challenging slice of COCO. If the next version of TAPe maintains these results after abandoning transformers, this would be a serious argument in favor of alternative CV stacks that are less dependent on huge corpora of pixel data.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.