MarkTechPost→ original

TII Releases Falcon Perception — 0.6B Model for Object Segmentation and Text-Based Search

TII released Falcon Perception — a 0.6-billion-parameter model that understands text queries for images and delivers precise object masks. Instead of the…

AI-processed from MarkTechPost; edited by Hamidun News
TII Releases Falcon Perception — 0.6B Model for Object Segmentation and Text-Based Search
Source: MarkTechPost. Collage: Hamidun News.
◐ Listen to article

On April 1, 2026, TII presented Falcon Perception — a compact multimodal model with 0.6 billion parameters that can find and segment objects in an image based on plain text queries without a fixed class list. For the market, this is an important signal: visual understanding tasks that have long been solved through complex pipelines of separate modules can now be addressed with a single unified architecture, with an open license and without a giant model size.

Most modern computer vision systems are still built on a modular scheme: one encoder extracts visual features, another block mixes them with text, and then a separate decoder predicts bounding boxes, masks, or answers. This approach works, but scales poorly: each new type of error is usually fixed with a new module, and the interaction between language and images remains limited.

In Falcon Perception, the Technology Innovation Institute team from Abu Dhabi bets on an early fusion approach: image and text enter a common sequence of tokens from the first transformer layer. Architecturally, the model is structured as a single Transformer with a hybrid attention scheme. Image tokens see each other bidirectionally and gather global visual context, while text and auxiliary tokens are decoded causally, relying on the already-processed image.

For each found object, the model goes through a short chain of steps: first it determines the center coordinates, then the size, and then builds a segmentation mask. This interface allows working with a variable number of objects — from zero to hundreds in a single image — and does not turn mask generation into a computationally expensive process.

Under the hood, Falcon Perception has quite serious data preparation. Initialization was done through distillation from DINOv3 and SigLIP2 to combine strong local visual features and better language grounding. Then the model was trained on a dataset of 54 million images, 195 million positive text expressions, and 488 million hard negative examples. For automatic annotation and selection, an ensemble of SAM 3, Qwen3-VL-30B, and Moondream3 was used, with disputed cases sent for manual review.

Separately, TII introduced PBench — a new diagnostic benchmark that breaks down results by complexity levels: from simple objects to OCR hints, spatial relationships, and dense scenes with hundreds of instances.

By metrics, the release looks convincing. On SA-Co, one of the open segmentation benchmarks, Falcon Perception achieved 68.0 Macro-F1 against 62.3 for SAM 3. The gain is particularly notable where simple "object recognition" is not enough: in attributes and subtypes, in queries with text within the frame, and in spatial formulations like "car on the left" or "third window from the left."

On PBench, the gap on simple objects is small, but on spatial tasks it reaches 21.9 points, on OCR-guided queries — 13.4, on relational tasks — 15.8. The weak point for now is presence calibration: by MCC, the model lags SAM 3 with a score of 0.64 versus 0.82, meaning in complex negative scenarios it still more often makes mistakes with the answer "object is absent."

The model also has a pragmatic side. Falcon Perception is released under Apache 2.0, available on Hugging Face and GitHub, and designed not only for lab experiments but also for practical deployment. Inference uses a stack based on PyTorch FlexAttention and paged KV cache; according to the team, on H100 typical latencies are around 100 ms for prefill, around 200 ms for feature upsampling, and approximately 50 ms for decoding multiple instances.

TII also demonstrated that the same early fusion recipe transfers to OCR: the accompanying Falcon OCR model with 0.3 billion parameters scored 80.3 on olmOCR and 88.64 on OmniDocBench.

The main takeaway here is not that TII released another compact vision-language model. It is far more important that Falcon Perception demonstrates the viability of a simpler and more unified approach to visual understanding: one architecture, one common stack, and fewer workarounds between language and vision. If the team improves presence calibration and reduces the number of false positives on hard negative scenarios, Falcon has a chance to become a strong foundation for assistants, robotics, visual search, and interfaces where an image needs to be understood from human text rather than from a predetermined class list.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…