Training

Model Checkpoint

A model checkpoint is a saved snapshot of a neural network's weights and optimizer state at a specific point during training, enabling resumption after hardware failures and selection of the best-performing version across training steps.

A model checkpoint is a serialized file or set of files capturing the complete state of a neural network at a given training step or epoch: the model weights, optimizer state (including momentum buffers and adaptive learning rate statistics), the current learning rate schedule position, and the step or epoch index. Saving checkpoints at regular intervals is standard practice in any non-trivial deep learning training run.

Checkpoints are written using framework-specific serialization formats. PyTorch uses .pt or .pth files produced by torch.save; TensorFlow uses SavedModel directories or .ckpt shards; the Hugging Face ecosystem has broadly adopted the safetensors format, which loads faster and avoids the security vulnerabilities of pickle-based serialization. For large models, checkpoint files range from a few megabytes for small classifiers to hundreds of gigabytes for 70B+ parameter language models, often stored as sharded files across distributed object storage.

Checkpoints serve multiple purposes in practice: fault tolerance (resuming a multi-week training run after a node failure without losing all progress), model selection (retaining the checkpoint from the step with the best validation metric rather than blindly taking the final weights), and deployment (using saved weights directly as the production inference artifact). Checkpoint averaging — computing the element-wise mean of weights across several recent checkpoints — is an additional technique sometimes used to improve generalization.

As of 2026, the Hugging Face Hub hosts millions of public model checkpoints, making checkpoint sharing the de facto standard for model distribution and reproducibility. Cloud training platforms such as AWS SageMaker, Google Vertex AI, and Azure ML offer built-in checkpoint management with automatic periodic saves to object storage such as S3 or GCS. For very large models, the time required to write and reload a checkpoint can itself become a bottleneck, prompting work on asynchronous checkpointing and incremental delta saves.

Example

During a week-long fine-tuning run of a 13B-parameter language model, checkpoints are saved every 500 training steps to cloud storage; when a GPU node fails at step 3,200, the run resumes from the step-3,000 checkpoint rather than restarting from scratch.

Related terms

← Glossary