Checkpointing Brings Reliability to Massively Parallel Processing

In August 1997, NERSC achieved a milestone in high-performance computing: successfully stopping and restarting a number of scientific computing jobs on a Cray T3E supercomputer without any data processing loss or discontinuity. This accomplishment opened a new era of robust, reliable, production-mode MPP computing.

Called "checkpointing," the stop/restart procedure -- achieved twice in one week -- is believed to be the first time such a procedure has been accomplished on an MPP system. Checkpointing involves bringing all of the programs running on the computer to the same stage and stopping them, then recording all the information, transferring that information out of the machine, then, after the system is used for something else, putting the data back in and getting all the programs running again -- on a machine capable of carrying out tens of billions of operations per second.

Successful checkpointing will allow the NERSC staff to use the T3E-900's 512 processors more efficiently by moving jobs between the processors and making larger pools of processors available quickly for bigger jobs. It will also allow NERSC to make the entire 512-processor computer available to tackle a single, complex problem when necessary, as well as carry out upgrades and maintenance without disrupting the work of researchers.

The checkpointing procedure was successfully demonstrated on both of NERSC's T3E's, the 512-processor T3E-900 and the 96-processor T3E-600. The restarted jobs were running on clusters ranging from 16 to 256 processors. After the machines were put back on line, James Craw, head of the NERSC Systems Group, commented, "It's kind of ironic that we achieved this major milestone and none of our users noticed -- which was our objective."

"As far as I know, no other MPP system is planning to do system-wide checkpoint/restart without having to reprogram applications," said Bill Kramer, Deputy Director of NERSC. "This is really a momentous step for those of us in the high-performance computing community."



Next Page
Back to Table of Contents