IBM releases Granite 4.0 3B Vision for extracting data from documents and charts
IBM has released Granite 4.0 3B Vision, a compact multimodal model for enterprise documents. It can extract complex tables, understand charts, and find…
AI-processed from Hugging Face Blog; edited by Hamidun News
IBM and the Granite team have unveiled Granite 4.0 3B Vision — a compact multimodal model designed for working with corporate documents. It's not built for general visual tasks, but specifically for extracting structured data from tables, charts, forms, and complex PDFs.
What the model can do
IBM's main focus is practical scenarios where common multimodal models often fail due to complex layouts and the need to accurately link text with visual context. Granite 4.0 3B Vision can read tables with multi-level rows and columns, parse diagrams into machine-readable formats, and find semantic key-value pairs in forms and invoices. At the same time, the model retains image description mode: you can give it a document or image and ask for a detailed explanation of what's in it.
- Extracting tables from document images, including complex structures
- Converting charts into CSV, text descriptions, or code
- Finding semantic key-value pairs in forms, invoices, and questionnaires
- Working standalone or within a pipeline with Docling
How Granite is built
IBM explains the model's quality through three technical solutions. First is the ChartNet dataset for chart understanding. It includes 1.7 million synthetic and filtered examples across 24 chart types and 6 visualization libraries. Each sample has five related representations: code for construction, image, data table, text description, and a set of Q&A pairs. This markup teaches the model not just to describe a picture, but to recover data structure and chart meaning.
The second solution is a variant of DeepStack Injection architecture. IBM separates visual feature types: more abstract ones are fed to early layers for semantic understanding, while highly detailed ones go to later layers to maintain precision in binding elements to their location. The third solution is modular packaging. Granite 4.0 3B Vision comes as a LoRA adapter on top of Granite 4.0 Micro, so the same deployment can handle both multimodal requests and regular text tasks without a separate model. For enterprise stacks, this matters more than simply increasing the number of parameters.
Results on benchmarks
On benchmarks, the model performs stronger than many larger competitors. On the ChartNet validation set, it achieved the best result on Chart2Summary — 86.4%, and ranked second on Chart2CSV with 62.1%, behind only Qwen3.5-9B, which is more than twice as large. In table extraction, Granite leads on several tests: 92.1 on cropped PubTablesV2, 79.3 on full-page PubTablesV2, 64.0 on OmniDocBench, and 88.1 on TableVQA. For KVP tasks on VAREX, the model showed 85.5% exact match in zero-shot mode.
IBM separately describes two deployment modes. In the simple variant, the model runs as a standalone extraction tool for individual images — for example, forms, receipts, or charts. In a larger scenario, it connects to Docling, which handles OCR, layout parsing, visual element detection, and fragment segmentation. Because of this, Granite receives already-prepared tables and figures, and the pipeline reduces computational costs and increases throughput on large document archives.
What it means
For the enterprise AI market, this signals that the race isn't just about large universal models. IBM shows a different path: a compact VLM that solves a narrow but expensive business task — turning documents, reports, and forms into structured data. If the quality holds up in real implementations, such models will reach working systems faster than heavier multimodal platforms.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.