Zhipu AI releases GLM-OCR, a compact 0.9-billion-parameter OCR model for documents
Zhipu AI unveiled GLM-OCR, a compact 0.9-billion-parameter multimodal OCR model for parsing real-world documents. The system can handle tables, formulas, and…
AI-processed from MarkTechPost; edited by Hamidun News
Zhipu AI together with researchers from Tsinghua University presented GLM-OCR — a multimodal OCR model with 0.9 billion parameters for parsing real-world documents. The bet was made not on maximum size, but on the balance of quality, speed and inference cost.
Why OCR is difficult
Regular OCR has long coped well with clean text on flat scans, but real documents are arranged much worse. They mix tables, formulas, stamps, handwritten fields, code blocks, columns and non-standard reading order. It is on such cases that classical pipelines break: they can recognize characters but lose the page structure and the meaning of relationships between blocks.
Large multimodal models improve document understanding as a whole, but they have a different problem: price and speed. If a model reads a page like a regular visual-language system and generates an answer one token at a time, inference becomes expensive and slow. For production, where you need to process invoices, contracts, reports and questionnaires in a stream, this is already an engineering limitation, not an academic trifle.
How GLM-OCR works
At the core of GLM-OCR is a combination of a visual encoder CogViT with 0.4 billion parameters, a lightweight cross-modal connector and a language decoder GLM with 0.5 billion parameters.
The main technical idea is Multi-Token Prediction. Instead of predicting strictly one token per step, the model was trained to output ten tokens, and at inference it generates an average of 5.2 tokens per step.
According to the authors, this gives about 50% throughput improvement without sharp memory growth due to a parameter-sharing scheme. At the system level, the model also does not work as a flat mode "read the entire page at once". First PP-DocLayout-V3 marks up the document into semantic regions, and then GLM-OCR processes individual regions in parallel.
For document parsing, structured Markdown and JSON are generated at the output, and for KIE, that is key field extraction, the full document image is fed along with a prompt and the model immediately generates JSON according to a given schema.
- Parses pages by regions before recognition
- Processes found blocks in parallel
- Returns structured Markdown and JSON
- Separately supports KIE mode for field extraction
- Suitable for both cloud API and local execution
What the tests showed
On public benchmarks, the model shows strong results, but without universal leadership. GLM-OCR scored 94.6 on OmniDocBench v1.
5, 94.0 on OCRBench for text recognition, 96.5 on UniMERNet for formulas and 86.
0 on TEDS_TEST for tables. In field extraction tasks, the model showed 93.7 on Nanonets-KIE and 86.
1 on Handwritten-KIE. This is a good set of numbers for a model of this size, especially when compared to significantly heavier multimodal systems. At the same time, the authors themselves leave important caveats.
On PubTabNet GLM-OCR is not first: MinerU 2.5 has 88.4 against 85.
2. And in the reference models column Gemini-3-Pro shows higher results in KIE. That is, the correct formulation here is: GLM-OCR is among the leaders among open and compact solutions, but does not cover absolutely all scenarios better than everyone else.
From a practical perspective, the project looks not like a purely laboratory one. The authors declare support for vLLM, SGLang and Ollama, as well as fine-tuning through LLaMA-Factory. The report indicates throughput of 0.
67 images per second and 1.86 PDF pages per second in their test configuration. For cloud mode, a MaaS API is available at a price of 0.
2 yuan per million tokens: according to the team's calculations, one yuan is enough for approximately 2000 A4 scans or 200 simple 10-page PDFs.
What this means
The market for AI in documents is noticeably shifting from giant universal models to more compact specialized systems where not only quality matters, but also predictable cost. For business, this is a good signal: tasks of parsing invoices, contracts, scientific articles and internal forms are becoming easier to launch in production without excessive spending on hardware and inference.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.