💡Checkpointing

What is it ?

2 min readJan 29, 2022

Checkpoints are essentially snapshots of the running job state taken at regular intervals and stored in persistent storage.

Long running ML jobs (training, evaluation) are failure prone. To recover from failure and resume job, checkpointing is needed.
Checkpointing also keeps model snapshots at various time intervals / epochs. This is useful for online training, improving inference model accuracy.

For Job :

Restart accuracy : When the job is restarted from checkpoint, it must not lose accuracy. This is particularly important for training jobs.
Frequency : Frequency of taking checkpoints must be well thought out. It should not be too frequent (very expensive) or very sparse (chances of losing important work). There can be job dependent heuristics to determine the right frequency.

For Checkpoint store :

Write Bandwidth : Checkpointing large artifacts such as model can take up large write bandwidth. This is true if there are 1000s of parallel training jobs all writing checkpoints to the store. Incremental checkpointing should be enabled wherever applicable. Writing checkpoints should be done in a separate process/thread to avoid main job stalls.
Storage capacity : Depending on checkpoint usage, this may become a bottleneck. For example, storing multiple checkpoints for each model use-case for several weeks would require tremendous amounts of storage infrastructure. Model compression techniques like quantization and pruning can help.

The checkpointing process consists of 3 main stages:

Create an in-memory snapshot of the job state. For example in a training job the snapshot would consist of weight parameters, layer information, optimizer state, training config.
Build an optimized checkpoint. Use compression or other techniques to optimize storage.
Write the checkpoint to checkpoint storage (replicated).