Modelos

Multimodal Model

A multimodal model is an AI system that processes and generates data across more than one modality — such as text, images, audio, or video — within a single unified architecture.

A multimodal model is an AI system capable of ingesting, reasoning over, and generating content in more than one data type (modality). The most common combination is text and images, but frontier systems in 2026 also handle audio, video, structured tables, and code. Unlike a pipeline that chains separate unimodal models, a multimodal model processes all inputs jointly, allowing cross-modal reasoning in a single inference pass — for example, answering a question whose answer requires reading text embedded in an image and combining it with the surrounding conversational context.

Most architectures pair modality-specific encoders with a central language model backbone. A vision encoder — typically a Vision Transformer (ViT) pretrained with contrastive objectives such as CLIP — converts image patches into dense embeddings. A lightweight projection layer (an MLP or cross-attention module) maps these into the language model's token embedding space, allowing the autoregressive decoder to attend to visual and textual tokens together. Audio and video inputs are handled by analogous encoders. Some systems, such as GPT-4o, go further and train a single model end-to-end across modalities rather than composing separate modules.

Multimodal capability is significant because real-world information rarely arrives in a single format. Scientific papers combine text, figures, and equations; customer service involves speech and screen content; manufacturing inspection depends on images and sensor streams. A multimodal model can replace entire pipelines of specialized tools, reducing latency, integration complexity, and failure modes at handoff points between components.

As of 2026, native multimodality is a baseline expectation for frontier AI products. GPT-4o, Gemini 2.0 and 2.5, and Claude 3.7 and 4 all accept text and image inputs and, in some configurations, audio and video as well. Open-weight multimodal models — including LLaMA 3.2 Vision, Qwen-VL-Max, and InternVL2 — have substantially closed the gap with proprietary systems on standard benchmarks. Research focus has shifted toward any-to-any generation: systems that produce images, audio, or video as fluidly as text.

Ejemplo

An analyst uploads a 40-page earnings report containing embedded charts and footnoted tables to a multimodal model and asks it to identify the three largest year-over-year revenue changes; the model reads the charts and tables in context, cross-references the textual discussion, and returns a ranked answer citing specific page locations.

Términos relacionados

Vision-Language Model (VLM)Large Language Model (LLM)Modelo Texto-a-Imagen Reconocimiento de Voz (ASR)

← Glosario