NVIDIA introduced Nemotron OCR v2: multilingual OCR trained on 12.2 million synthetic documents
NVIDIA revealed details of Nemotron OCR v2, a multilingual OCR model trained primarily on 12.2 million synthetic documents. The focus is not a new…
AI-processed from Hugging Face Blog; edited by Hamidun News
NVIDIA showed how it built Nemotron OCR v2 — a multilingual OCR system where the main breakthrough came not from clever architectural tricks, but from synthetic data at scale. The company assembled a dataset of 12.26 million artificially generated documents and trained a model on it that reads multiple languages with a single engine and outputs up to 34.7 pages per second on a single A100.
Focus on Data
The previous version, Nemotron OCR v1, worked confidently with English, but failed on other languages. The SynthDoG benchmark showed that for Japanese, Korean, Russian, and Chinese, Normalized Edit Distance values were too high: results sometimes barely resembled the original text. The team even expanded the character set from 855 to 14,244 to cover Cyrillic and CJK scripts, but the improvement was small. The model could formally output the required characters, but barely saw them during training.
"The bottleneck was data, not architecture."
This became the turning point of the project. Manually annotating millions of documents with boxes at the word, line, and paragraph levels would have been too expensive, and web scraping of PDFs yields a noisy text layer and masses of errors. So NVIDIA took a different path: generate documents programmatically and know in advance the exact coordinates, transcriptions, and reading order of each fragment.
How the Corpus Was Built
For text, NVIDIA used mOSCAR — a large multilingual web corpus with 163 language subsets. This allowed taking not dictionary lists or machine-generated text, but more realistic phrases with natural word and character distributions. As a rendering engine, the company took SynthDoG from the Donut project and substantially reworked it. The output is not just page images, but a complete hierarchical annotation at the word, line, and paragraph levels, plus a relationship graph that defines the reading order.
The pipeline added several important elements to make the synthetic data closer to real documents:
- multi-template layouts: columns, tables, vertical text, tables of contents, slides, and Word-style pages
- annotation not only by boxes but also by element hierarchy and relationships between lines
- transition to line-based recognition for Japanese, Korean, and Chinese, where word boundaries are often ambiguous
- a large pool of open fonts — from 165 to 1,258 per language, including Google Fonts and Noto families
- aggressive augmentations: shadows, outlines, noise, blur, distortions, brightness and background changes
The resulting dataset contains 12,258,146 examples across six language categories: English, Japanese, Korean, Russian, Simplified Chinese, and Traditional Chinese. The multilingual model itself operates as a single stack for English, Russian, Japanese, Korean, and Chinese, without a separate language detection step. The extension logic is also simple: if a new language has a text corpus and suitable fonts, the pipeline can scale further without manual annotation and without rewriting the architecture.
Speed and Trade-offs
Nemotron OCR v2 was trained not only on synthetic data but also on approximately 680,000 real images. The architecture consists of three parts: a text detector based on RegNetX-8GF, a recognizer based on Transformer, and a relational module that understands which lines and blocks are connected. The key idea is that a heavy convolutional backbone processes the page once, and then its features are reused by all other components. This way, the system does not waste extra computation on each pipeline stage separately.
On the synthetic benchmark, the improvement looks very strong. For Russian, NED dropped from 0.564 in Nemotron OCR v1 to 0.
043 in v2; for Japanese — from 0.723 to 0.046; for Korean — from 0.
923 to 0.047; for Simplified Chinese — from 0.784 to 0.
035. According to NVIDIA, the unified multilingual version on this dataset even outperformed specialized language variants of PaddleOCR. But on the real OmniDocBench benchmark, the picture is more complex: Nemotron OCR v2 shows 34.
7 pages per second versus 1.2 for PaddleOCR v5, a more than 28-fold speed advantage, however on some subsets it lags behind the best competitors in accuracy. Here the product clearly chooses a balance in favor of processing speed rather than maximum quality at any cost.
What It Means
Nemotron OCR v2 is a good signal for the document AI market: synthetic data already delivers not a demonstration effect, but a practical way to quickly launch multilingual OCR models and scale them to new writing systems. For companies, this means a cheaper path to document recognition, especially where speed, versatility, and control over annotation matter, rather than an absolute record on every benchmark.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.