YADRO trained the KVADRA_T tablet to recognize multiple objects in a frame in 20 ms
YADRO explained how it trained the KVADRA_T tablet to simultaneously recognize a person, a document, text, QR codes and barcodes in a single frame. Instead…
AI-processed from Habr AI; edited by Hamidun News
YADRO showed how it solved a task that, for a mobile device, sounds almost like detection but has to work faster: the KVADRA_T tablet was taught to simultaneously recognize a person, a document, text, QR codes, and barcodes in a single frame. The final multi-label model delivered an average F1-score of 94% and met the speed requirements for running directly on the device.
Why not multiclass
The company explains that standard multiclass classification was unsuitable here because of the very nature of the task. A single image can contain a person, a passport, lines of text, and a code to scan at the same time, while the classic setup tries to pick only one dominant class.
For a smart gallery or verification scenarios, that is not enough: the device needs to understand the full composition of the scene rather than guess the main object. Running a separate model for each object type was also a poor option, because on a tablet that quickly eats into the time and resource budget.
That is why the team moved to a multi-label approach, where each class is determined independently. But a simple setup with one shared classification head did not work here either: the classes belong to different visual domains, and the shared features started to interfere with one another.
In the first version, with MobileNet V3 and a single head, the model showed an F1-score of around 82%. After switching to a multi-head architecture with independent heads for different object types, the average metric rose to 94%, or roughly 12 percentage points at once.
How they built the dataset
The hardest part of the project turned out not to be choosing the backbone but the data. The team did not have a ready-made public multi-label dataset with the required combination of classes, so the sample was assembled almost from scratch from Roboflow, Kaggle, and open source repositories. In parallel, they had to track licenses so the data could be used in a product.
In the end, the developer assembled and cleaned a set of 193,000 images, where it was especially difficult to preserve balance between related classes such as document and text. For automatic labeling, they first tested standard SOTA detectors, mainly models from the YOLO family, but their quality for this task turned out to be insufficient.
After that, the team switched to vision-language models and built a pipeline around them for data cleaning and enrichment. This made it possible not only to label the images but also to remove duplicates and then selectively close gaps in the statistics for rare label combinations.
- compared classical detectors and VLM models for different classes
- chose Qwen2.5-VL-72B-Instruct as the main labeler because it delivered about 98% F1-score across classes
- removed duplicates via pHash, and checked ambiguous cases via SSIM
- filled in missing label combinations through prompt filters such as «text present, but no document?»
A separate problem emerged with the text class. Because of the nature of the task, the model easily latched onto patterns and lines that looked like letters, so this class had to be additionally constrained and balanced.
This approach made it possible not just to assemble a large dataset, but to make it suitable for a mobile multi-label model, where an error in class distribution quickly turns into false positives on real images.
What the tests showed
Following a series of experiments, the team settled on MobileNetV3 Large. The input frame resolution also had to be chosen as a compromise between quality and speed: the 1024 variant made inference too heavy, so the final choice was the 640 format, which preserved similar metrics but noticeably accelerated processing.
Hyperparameters were tuned through Optuna, while the experiments themselves and training diagnostics were managed in ClearML. This helped track gradient distributions, dataset versions, and the quality of individual runs without manual chaos.
After training, the model was converted to ONNX and then to TFLite and RKNN so it could run on mobile and hardware-accelerated configurations. On the NPU of the KVADRA_T tablet, inference at 640x640 takes about 20 ms, and the full frame-processing path fits into roughly 30 ms.
That is better than the 50 ms target limit the team set for the project. According to the developer, this time reserve can now be used for the next iteration of the model. YADRO plans to add the multi-label classification function in the next release of kvadraOS.
«I plan to use the spare 20 ms to make the model more complex.»
What it means
YADRO demonstrated something important for edge-AI: even on a tablet, it is possible to achieve near-realtime recognition of a complex scene if you assemble the dataset correctly, split the classification heads, and do not try to solve everything with one universal model.
For the market, this is another signal that useful CV functions will increasingly run locally rather than in the cloud.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.