Vision-Language Model (VLM)
A Vision-Language Model (VLM) is an AI model that jointly processes visual inputs (images or video) and natural language text, enabling tasks such as image captioning, visual question answering, and document understanding.
A Vision-Language Model (VLM) is a multimodal AI architecture that combines a visual encoding component with a language model to support bidirectional reasoning between images and text. VLMs can describe images in natural language, answer questions about visual content, ground textual references to specific image regions, perform optical character recognition in complex layouts, and — in generative variants — produce images conditioned on text prompts. The term VLM conventionally emphasizes the vision-text pairing specifically, distinguishing it from broader multimodal systems that additionally handle audio or structured data.
The dominant VLM architecture pairs a pretrained vision encoder — most commonly a Vision Transformer (ViT) or a CLIP image encoder — with a decoder-only language model. Image patches are encoded into dense embeddings, which a projection layer (MLP or cross-attention) maps into the language model's token embedding space. The resulting combined sequence of visual and textual tokens is processed autoregressively. This design, used in LLaVA, PaliGemma (Google), InternVL (Shanghai AI Lab), and Qwen-VL (Alibaba), allows instruction fine-tuning to transfer the language model's existing conversational capabilities to the visual domain. CLIP-style contrastive pretraining on hundreds of millions of image-text pairs typically provides the initial cross-modal alignment.
VLMs are practically significant because a large share of real-world information is embedded in visual form: invoices, scientific figures, engineering schematics, satellite imagery, and medical scans. A VLM can parse an invoice photograph and extract line-item data, read a pathology slide image and flag abnormalities, or interpret a floor-plan and answer spatial questions about room adjacency — tasks that previously required purpose-built computer vision pipelines for each document type.
By 2026, high-capability VLMs are available both as commercial APIs and as open-weight models. GPT-4V and GPT-4o, Gemini 2.0, and Claude with vision lead on benchmarks such as MMMU (Massive Multidiscipline Multimodal Understanding) and DocVQA. Open-weight checkpoints including LLaVA-NeXT, PaliGemma 2, and InternVL2 are widely deployed in research and production. Top models approach human-level performance on several visual question-answering benchmarks; fine-grained spatial reasoning, precise object counting, and reading very small or degraded text remain active areas of improvement.