NERSC logo National Energy Research Scientific Computing Center
  A DOE Office of Science User Facility
  at Lawrence Berkeley National Laboratory
 

Checkpoint and Restart Considerations

On April 12, 2007, checkpoint/restart capability was enabled by default for all LoadLeveler jobs. The ability to checkpoint jobs allows NERSC to suspend jobs and later continue jobs - rather than kill them - if the need arises.

This change will be transparent to most jobs. However, there are some imposed limitations, which are detailed below. If you need to change the default behavior you may disable checkpoint/restart by specifying the checkpoint=no LoadLeveler directive in your batch script, or by unsetting the CHECKPOINT environment variable in your script.

We have had one report of a python script failing when checkpoint/restart is enabled.

Checkpoint and restart limitations

Use of the checkpoint and restart function has certain limitations. If planning to use the checkpoint and restart function, you need to be aware of the types of programs that cannot be checkpointed. You also need to be aware of certain program, operating system, mode, and other restrictions.

Programs that cannot be checkpointed

The following programs cannot be checkpointed:

  • Programs that do not have the environment variable CHECKPOINT set to yes.
  • Programs that are being run under:
    • The dynamic probe class library (DPCL).
    • Any debugger that is not checkpoint/restart-capable.
  • Processes that use:
    • Extended shmat support
    • Pinned shared memory segments
  • Sets of processes in which any process is running a setuid program when a checkpoint occurs.
  • Jobs for which POE input or output is a pipe.
  • Jobs for which POE input or output is redirected, unless the job is submitted from a shell that had the CHECKPOINT environment variable set to yes before the shell was started. If POE is run from inside a shell script and is run in the background, the script must be started from a shell started in the same manner for the job to be able to be checkpointed.
  • Jobs that are run using the switch or network table sample programs.
  • Interactive POE jobs for which the su command was used prior to checkpointing or restarting the job.
  • User space programs that are not run under a resource manager that communicates with POE (for example, LoadLeveler®).

Program restrictions

Any program that meets both these criteria:

  • is compiled with one of the threaded compile scripts provided by PE
  • may be checkpointed prior to its main() function being invoked

must wait for the 0031-114 message to appear in POE's STDERR before issuing the checkpoint of the parallel job. Otherwise, a subsequent restart of the job may fail.

Note:
The MP_INFOLEVEL environment variable, or the -infolevel command-line option, must be set to a value of at least 2 for this message to appear.

Any program that meets both these criteria:

  • is compiled with one of the threaded compile scripts provided by PE
  • may be checkpointed immediately after the parallel job is restarted

must wait for the 0031-117 message to appear in POE's STDERR before issuing the checkpoint of the restarted job. Otherwise, the checkpoint of the job may fail.

Note:
The MP_INFOLEVEL environment variable, or the -infolevel command line option, must be set to a value of at least 2 for this message to appear.

AIX function restrictions

The following AIX functions will fail, with an errno of ENOTSUP, if the CHECKPOINT environment variable is set to yes in the environment of the calling program:

  • clock_getcpuclockid()
  • clock_getres()
  • clock_gettime()
  • clock_nanosleep()
  • clock_settime()
  • mlock()
  • mlockall()
  • mq_close()
  • mq_getattr()
  • mq_notify()
  • mq_open()
  • mq_receive()
  • mq_send()
  • mq_setattr()
  • mq_timedreceive()
  • mq_timedsend()
  • mq_unlink()
  • munlock()
  • munlockall()
  • nanosleep()
  • pthread_barrierattr_init()
  • pthread_barrierattr_destroy()
  • pthread_barrierattr_getpshared()
  • pthread_barrierattr_setpshared()
  • pthread_barrier_destroy()
  • pthread_barrier_init()
  • pthread_barrier_wait()
  • pthread_condattr_getclock()
  • pthread_condattr_setclock()
  • pthread_getcpuclockid()
  • pthread_mutexattr_getprioceiling()
  • pthread_mutexattr_getprotocol()
  • pthread_mutexattr_setprioceiling()
  • pthread_mutexattr_setprotocol()
  • pthread_mutex_getprioceiling()
  • pthread_mutex_setprioceiling()
  • pthread_mutex_timedlock()
  • pthread_rwlock_timedrdlock()
  • pthread_rwlock_timedwrlock()
  • pthread_setschedprio()
  • pthread_spin_destroy()
  • pthread_spin_init()
  • pthread_spin_lock()
  • pthread_spin_trylock()
  • pthread_spin_unlock()
  • sched_getparam()
  • sched_get_priority_max()
  • sched_get_priority_min()
  • sched_getscheduler()
  • sched_rr_get_interval()
  • sched_setparam()
  • sched_setscheduler()
  • sem_close()
  • sem_destroy()
  • sem_getvalue()
  • sem_init()
  • sem_open()
  • sem_post()
  • sem_timedwait()
  • sem_trywait()
  • sem_unlink()
  • sem_wait()
  • shm_open()
  • shm_unlink()
  • timer_create()
  • timer_delete()
  • timer_getoverrun()
  • timer_gettime()
  • timer_settime()

Node restrictions

The node on which a process is restarted must have:

  • The same operating system level (including PTFs). In addition, a restarted process may not load a module that requires a system call from a kernel extension that was not present at checkpoint time.
  • The same switch type as the node where the checkpoint occurred.
  • The capabilities enabled in /etc/security/user that were enabled for that user on the node on which the checkpoint operation was performed.

If any threads in the parallel task were bound to a specific processor ID at checkpoint time, that processor ID must exist on the node where that task is restarted.

Task-related restrictions

  • The number of tasks and the task geometry (the tasks that are common within a node) must be the same on a restart as it was when the job was checkpointed.
  • Any regular file open in a parallel task when that task is checkpointed must be present on the node where that task is restarted, including the executable and any dynamically loaded libraries or objects.
  • If any task within a parallel application uses sockets or pipes, user callbacks should be registered to save data that may be in transit when a checkpoint occurs, and to restore the data when the task is resumed after a checkpoint or restart. Similarly, any user shared memory should be saved and restored.

Pthread and atomic lock restrictions

  • A checkpoint operation will not begin on a parallel task until each user thread in that task has released all pthread locks, if held.

    This can potentially cause a significant delay from the time a checkpoint is issued until the checkpoint actually occurs. Also, any thread of a process that is being checkpointed that does not hold any pthread locks and tries to acquire one will be stopped immediately. There are no similar actions performed for atomic locks (_check_lock and _clear_lock, for example).

  • Atomic locks must be used in such a way that they do not prevent the releasing of pthread locks during a checkpoint.

    For example, if a checkpoint occurs and thread 1 holds a pthread lock and is waiting for an atomic lock, and thread 2 tries to acquire a different pthread lock (and does not hold any other pthread locks) before releasing the atomic lock that thread 1 is waiting for, the checkpoint will hang.

  • If a pthread lock is held when a parallel task creates a new process (either implicitly using popen, for example, or explicitly using fork or exec) and the releasing of the lock is contingent on some action of the new process, the CHECKPOINT environment variable must be set to no before causing the new process to be created.

    Otherwise, the parent process may be checkpointed (but not yet stopped) before the creation of the new process, which would result in the new process being checkpointed and stopped immediately.

  • A parallel task must not hold a pthread lock when creating a new process (either implicitly using popen for example, or explicitly using fork) if the releasing of the lock is contingent on some action of the new process.

    Otherwise a checkpoint could occur that would cause the child process to be stopped before the parent could release the pthread lock causing the checkpoint operation to hang.

  • The checkpoint operation may hang if any user pthread locks are held across:
    • Any collective communication calls in MPI (or if LAPI is being used in the application, LAPI).
    • Calls to mpc_init_ckpt or mp_init_ckpt.
    • Any blocking MPI call that returns only after action on some other task.

Other restrictions

  • Processes cannot be profiled at the time a checkpoint is taken.
  • There can be no devices other than TTYs or /dev/null open at the time a checkpoint is taken.
  • Open files must either have an absolute pathname that is less than or equal to PATHMAX in length, or must have a relative pathname that is less than or equal to PATHMAX in length from the current directory at the time they were opened. The current directory must have an absolute pathname that is less than or equal to PATHMAX in length.
  • Semaphores or message queues that are used within the set of processes being checkpointed must only be used by processes within the set of processes being checkpointed.

    This condition is not verified when a set of processes is checkpointed. The checkpoint and restart operations will succeed, but inconsistent results can occur after the restart.

  • The processes that create shared memory must be checkpointed with the processes using the shared memory if the shared memory is ever detached from all processes being checkpointed. Otherwise, the shared memory may not be available after a restart operation.
  • The ability to checkpoint and restart a process is not supported for B1 and C2 security configurations.
  • A process can checkpoint another process only if it can send a signal to the process.

    In other words, the privilege checking for checkpointing processes is identical to the privilege checking for sending a signal to the process. A privileged process (the effective user ID is 0) can checkpoint any process. A set of processes can only be checkpointed if each process in the set can be checkpointed.

  • A process can restart another process only if it can change its entire privilege state (real, saved, and effective versions of user ID, group ID, and group list) to match that of the restarted process.
  • A set of processes can be restarted only if each process in the set can be restarted.
:1

LBNL Home
Page last modified: Fri, 13 Apr 2007 18:38:38 GMT
Page URL: http://www.nersc.gov/nusers/systems/bassi/running_jobs/checkpoint.php
Web contact: webmaster@nersc.gov
Computing questions: consult@nersc.gov

Privacy and Security Notice
DOE Office of Science