Modelos

Diffusion Model

A diffusion model is a generative AI system that produces images, audio, or video by learning to reverse an iterative noise-addition process, starting from random noise and progressively denoising it through learned steps to produce a coherent output.

A diffusion model is a class of generative model that learns the distribution of real data—most commonly images—by training to undo a stochastic process that gradually corrupts clean data into pure noise. The approach is grounded in non-equilibrium statistical thermodynamics and was formalized for deep learning in the Denoising Diffusion Probabilistic Models (DDPM) paper by Ho et al. in 2020. Diffusion models now underlie most leading image-generation systems: Stability AI's Stable Diffusion, OpenAI's DALL-E 3, Google's Imagen and Imagen 3, and Adobe Firefly all rely on diffusion or closely related score-based generative modeling.

Training involves two processes. The forward process takes a clean data sample and adds small amounts of Gaussian noise over T steps (often 1,000 or more), producing a sequence from original image to pure noise; this process is fixed and requires no learned parameters. The reverse process is what the neural network learns: given a noisy sample at step t and the noise level, predict the noise that was added, allowing it to be subtracted. At inference, the model starts from random Gaussian noise and iteratively applies this denoising step to produce a sample from the learned data distribution. Text-conditioned generation is achieved via cross-attention to text embeddings (from CLIP or T5), and classifier-free guidance amplifies adherence to the text prompt at the cost of diversity. Accelerated sampling methods—DDIM, DPM-Solver, and flow matching—have reduced the required denoising steps from thousands to tens, or even a single step with consistency distillation.

Diffusion models largely supplanted generative adversarial networks (GANs) as the dominant image-generation paradigm after 2022. GANs require a delicate adversarial training balance between generator and discriminator, suffer from mode collapse, and are difficult to scale. Diffusion models train with a simple denoising objective, are stable to optimize, produce diverse high-quality samples, and scale predictably with model and data size. The public release of Stable Diffusion 1.x in August 2022, which allowed running image generation on consumer GPUs, triggered rapid adoption and an explosion of downstream products and research.

As of 2026, diffusion and closely related flow-matching models remain the backbone of commercial image and video generation. Flow matching—used in Stable Diffusion 3, Black Forest Labs' FLUX models, and several video generators—offers a mathematically cleaner training objective and faster inference than DDPM, and is increasingly preferred in new systems. Video diffusion models, including OpenAI's Sora (announced February 2024) and Google's Veo, extend the architecture to temporal coherence across frames. Beyond visuals, diffusion-based approaches are applied to speech synthesis (Google's AudioLM, Meta's Voicebox) and molecular design, where models like those from Schrödinger and Insilico Medicine generate candidate drug molecules by denoising in 3-D coordinate space.

Ejemplo

A graphic designer types a text prompt into Adobe Firefly; the diffusion model starts from random noise and applies several hundred denoising steps guided by the prompt's text embedding, producing a photorealistic image in a few seconds.

Términos relacionados

Modelo Texto-a-Imagen Modelo Texto-a-Video Multimodal Model Generative Adversarial Network (GAN)

← Glosario