NERSC logo National Energy Research Scientific Computing Center
  A DOE Office of Science User Facility
  at Lawrence Berkeley National Laboratory
 

Checkpointing (Creating a Restart File)

A checkpoint file, also known as a restart file, saves the state of a simulation and allows the run to be restarted.

Why Create a Checkpoint File?

There are a number of reasons why checkpointing a simulation is a good idea.

  • Simulation Longer than available Wall Clock Limit of Class
  • In some cases the user may not have a choice. To be fair to all users most NERSC batch submission classes have a wall clock limit of 24-48 hours. If a simulation takes longer than the wall clock limit of a given class, a user will have to checkpoint the run and resubmit and restart the job.

  • Save Simulation in case of Machine Outage
  • Although NERSC works to keep all computing resources available to users for as much time as possible, inevitably when working with new research machines, individual nodes or other parts of the system may fail. Because hardware failures are decidely still a part of supercomputing, except in rare circumstances, NERSC does not refund users for failed jobs. A user can avoid losing an entire run by periodically checkpointing a simulation. If a machine outage does cause the job to fail, a user can restart the simulation from the last checkpoint file.

  • Change Direction of a Simulation
  • In some cases a user may want to change parameters of a long running simulation and doesn't want or need to go back to the beginning. Other times checkpointing a run can be used to verify the simulation is giving expected results.

What to Store in a Checkpoint File?

Everything a user needs to save the state and restart a simulation should be included in the checkpoint file.

  • Simulation Data
  • Users should store all neccessary physical quantity data, which depending on the application might include, grid variables such as density, pressure or strength of magnetic fields, molecular states, the number and position of particles, the timestep, etc. Some quantities may not need to be stored because they can be easily derived from other variables.

  • Parameter Values
  • All runtime parameters, (inputs that can change from simulation to simulation) should be stored in the checkpoint file.

  • Meta-Data
  • A user may also want to include descriptions of a the run which will remind the user of the specifics of a run. This might include the date of the run, the simulation name or description, compilation flags, version of the code used etc.

  • Storing vs Re-computing Data
  • The user also needs to consider if it is more efficient to store simulation data or recalculate it from other stored values. Some quantities can be derived from other values reducing the amount of data that needs to be stored. One example of this is with a block or patch based grid code which uses guard or ghost cells to store values of neighboring processors' data. In some 3D simulations storing guard cells could double, triple or quadruple the amount of data needed to be stored. Conversely, to refill the guardcells after restarting will require interprocessor communication.

How Often to Checkpoint a Simulation?

  • Based on Wall-Clock Time
  • Deciding how often to checkpoint a simulation depends on the trade off between how long it takes to write a checkpoint file vs the likelihood that a simulation will need to be restarted. Checkpointing a simulation will increase the job's runtime and thus the amount that a user is charged. At the same time, the user wants to make sure that a job doesn't fail in the 23rd hour of a 24 hour run and then have to start again from the beginning. Checkpointing a run every 4-8 hours is a good starting point. From there the user can increase or decrease the frequency of checkpointing depending on machine stability and the time it takes to write a checkpoint file.

    For simulations requiring long runtimes, a user will want to make sure to checkpoint a run right before the wall clock limit is reached.

  • Based on Simulation Time or State
  • Besides checkpointing based on wall-clock time, the user may want to checkpoint after a simulation has reached a given state. This could be based on the number of timesteps, an equilibrium point or an intermediate position in the code which the user want to make sure isn't lost.

  • Disk Quota Considerations
  • For some users, creating checkpoint files is an IO intensive process and one that takes up lots of disk space. Because it is often impossible to monitor jobs and disk quotas in real time, many users create a rolling checkpoint feature where only the last 2 checkpoint files are kept. As the simulation advances the next checkpoint file will over-write an older checkpoint so always the 2 most recent checkpoint files are stored.

Checkpoint Formats

  • Standard IO Formats - Portability and Longevity
  • For maximum portability a checkpoint file should be in a standard self-describing IO format, such as HDF, HDF5, netCDF, or Parallel-NetCDF. This allows a user to run a simulation on one machine and restart the simulation on another. Using a standard IO format also relieves the user of having to keep track of both the code version and post processing tool version.

  • One File Per Processor vs Single Shared File
  • There are numerous reasons why writing to a shared single file or a few files, rather than having each processor write its own file makes sense. From a data management, data analysis, and often a file system meta-data server perspective a single shared file is a clear winner. We found in relatively few circumstances does the overhead of a single shared file approach make it prohibitively expensive. Additionally, if a single shared file approach is used, a code can often be more easily restarted on a different number of processors.

Verifying Checkpoint/Restart is Working Correctly

To verify that a checkpointing implementation is working correctly the user should run a few tests.

  • Step 1: Run a simulation to a certain state, perhaps 20 timesteps, write a checkpoint file and verify that the simulation gives the expected results.
  • Step 2: Run the same simulation from the beginning and checkpoint the run at an intermediate position, say after 10 timesteps. Then stop the running job.
  • Step 3: Restart the job from the intermediate checkpoint file, and run the simulation to the ending point, in this example, to timestep 20 and write a final checkpoint file. Verify the restarted simulation gives the expected results. The checkpoint files from Step 1 and the final checkpoint from Step 3 should have identical physical data, including machine precision. For codes that use a random number generator, the seed value should be stored in the checkpoint file.

Working with 3rd party Applications

Not all third party software applications support checkpointing. Users should check the application's documentation as sometimes, applications can be checkpointed for some calculations but not others.

Advanced Checkpoint/Restart Features

  • Restarting on a different number of processors
  • If a checkpoint file is stored in a general enough format and the code reading in the checkpoint file is also flexible a simulation may be able to be started on a different number of processors. This can be of great use if over time the memory footprint of a simulation is growing. The run can be checkpointed and restarted on a larger number of processors.

  • Trigger a checkpoint dump from outside the code
  • It can be useful for the code to be able to "checkpoint on demand". Perhaps this means that the code periodically checks for the existance of a file or flag which can be set while the job is running. If a job needs to be stopped for any reason, the user can create a checkpoint file at the last possible second.


LBNL Home
Page last modified: Mon, 28 Jan 2008 01:59:08 GMT
Page URL: http://www.nersc.gov/nusers/systems/franklin/io_strategies/checkpointing.php
Web contact: webmaster@nersc.gov
Computing questions: consult@nersc.gov

Privacy and Security Notice
DOE Office of Science