Models

Convolutional Neural Network (CNN)

A Convolutional Neural Network (CNN) is a deep learning architecture that applies learned weight-sharing filters over local patches of input data — most commonly images — to detect hierarchical spatial features such as edges, textures, and objects.

A CNN stacks three main types of layers: convolutional layers that slide small filters across the input to produce feature maps, nonlinear activation functions (typically ReLU) applied element-wise, and pooling layers that downsample feature maps to reduce spatial resolution and provide translation invariance. In a deep CNN, early layers detect low-level features such as edges and corners, while deeper layers combine these into high-level semantic concepts such as faces or vehicles.

The weight-sharing property of convolutions is the key efficiency insight: the same filter is applied at every spatial location, drastically reducing the number of parameters compared to fully connected layers on the same input. A 3×3 convolutional filter on an RGB image has only 27 weights regardless of image size, whereas a fully connected neuron over a 224×224 image would require over 150,000 connections. This inductive bias toward local spatial patterns also makes CNNs sample-efficient on image-structured data.

CNNs produced landmark results that drove the modern deep learning era. AlexNet (Krizhevsky et al., 2012) dramatically outperformed traditional computer vision methods on ImageNet, triggering broad adoption of deep networks. Subsequent architectures — VGG (2014), ResNet (2015, introducing skip connections to train 100-layer networks), EfficientNet (2019), and ConvNeXt (2022) — pushed accuracy further. CNNs are also the backbone of object detection (YOLO series, Faster R-CNN), image segmentation (U-Net for medical imaging), and video analysis systems.

As of 2026, CNNs coexist with Vision Transformers (ViT) and hybrid architectures. Pure attention-based models lead on several benchmarks, but CNNs remain preferred for real-time mobile inference, medical imaging, satellite imagery, and industrial inspection due to their computational efficiency. ConvNeXt V2 and FastViT exemplify continued refinement of convolutional designs, and CNN backbones remain embedded in production systems at Google, Meta, and NVIDIA.

Example

An autonomous driving system uses a YOLO-variant CNN running at 30 frames per second on an onboard GPU to detect pedestrians, vehicles, and traffic signs in the front camera feed with low latency.

← Glossary

Convolutional Neural Network (CNN)

Example

Related terms