Training

Gradient Descent

Gradient descent is an iterative optimization algorithm that trains machine learning models by repeatedly adjusting parameters in the direction that most reduces a loss function, using partial derivatives computed via backpropagation to guide each update step.

Gradient descent is the foundational optimization algorithm used to train machine learning models. It works by iteratively adjusting a model's parameters in the direction that most reduces a scalar loss function — a measure of prediction error computed over training examples. The core update rule subtracts a fraction of the gradient of the loss with respect to each parameter, where the fraction is controlled by a hyperparameter called the learning rate: a value too large causes oscillation or divergence, while a value too small makes training prohibitively slow.

In practice, computing the gradient over the full training dataset at every step is computationally infeasible for large datasets, so stochastic gradient descent (SGD) and mini-batch variants approximate the full gradient using a randomly sampled subset of examples per step. The gradient is computed via backpropagation, which applies the chain rule of calculus to propagate error signals backward through all layers of a neural network. Practical variants address common training challenges: momentum accumulates a moving average of past gradients to accelerate progress and dampen oscillations; Adam (Adaptive Moment Estimation, introduced by Kingma and Ba in 2014) maintains per-parameter adaptive learning rates based on estimates of the first and second gradient moments; AdamW decouples weight decay from the gradient update, improving regularization and becoming the dominant optimizer for large language model pre-training.

Gradient descent is important not because it is guaranteed to find a global minimum — the loss landscapes of deep neural networks are highly non-convex, containing many local minima and saddle points — but because it reliably finds parameter configurations with low training loss and strong empirical generalization. Understanding its failure modes — exploding or vanishing gradients, loss spikes, and sensitivity to learning rate schedules — is a core competency for practitioners training modern systems at scale.

As of 2026, AdamW combined with a cosine or linear learning rate schedule with warmup remains the standard optimizer for pre-training large language models across most major laboratories. Research into alternatives continues: Sophia applies curvature estimates based on the Hessian diagonal to normalize gradient updates, and Muon applies orthogonal gradient updates via Newton-Schulz iterations, with both showing promising results on language model training benchmarks. Distributed training across thousands of GPUs requires careful gradient synchronization, with gradient checkpointing, mixed-precision training in BF16 or FP8, and ZeRO-stage sharding becoming standard infrastructure for frontier model training.

Example

During the pre-training of a large language model, AdamW gradient descent processes mini-batches of token sequences, back-propagates the cross-entropy loss through hundreds of transformer layers, and updates hundreds of billions of parameters over thousands of iterations until validation loss converges.

Latest news on this topic

Habr AI Breaks Down Gradient Descent in C++ and CUDA Through MNIST Model Training2026-04-30

← Glossary

Gradient Descent

Example

Related terms

Latest news on this topic