NERSC logo National Energy Research Scientific Computing Center
  A DOE Office of Science User Facility
  at Lawrence Berkeley National Laboratory
 

Historical: Running on Seaborg
Seaborg Decommissioned January 2008

Seaborg Batch Jobs

This document describes the IBM SP batch queuing system called Loadleveler as implemented at NERSC. It explains how to submit batch jobs, how to monitor the progress of jobs after they are submitted, and how to delete unwanted jobs. Example scripts are included which can be used as templates for customization.

Related Information


Overview

All production runs on the SP should be done through the batch system. IBM's batch system software is called LoadLeveler. Since interactive resources are scarce on this system, most development and debugging runs should also be done in batch. This page describes how to use Loadleveler.

To submit a job to LoadLeveler create a batch script which contains the job's requirements, then submit the job to the batch system. You can then monitor the job as it progresses through the queue and can delete unwanted jobs. Jobs may be submitted at various NERSC charge priority values.

Here is a list of terms commonly used in reference to Loadleveler.

node

A Seaborg node contains 16 CPUs and a batch job has exclusive use of each node and is charged for the use of all 16 CPUs, even if the job uses fewer tasks per node.

SP processors within a node share a common pool of memory.

keyword

A Loadleveler batch script consists of UNIX shell commands and LoadLeveler keywords. See Loadleveler keywords used at NERSC. A complete list of all Loadleveler keywords is available in IBM's documentation, but note that not all available keywords are relevant or useful on the NERSC SP.

job ID

Loadleveler assigns an ID to every submitted job. The term "Job ID" will refer to the identifier used to monitor or delete a job.

class

A job is submitted to a specific submit class and load leveler class. A user puts the specific submit class into a batch script and load leveler determines which load leveler class it should be routed to. Each class has associated runtime and node limits and a user should select the class which best fits the job's requirements.

Historical: Class Info and Policies for Seaborg
Seaborg Decommissioned January 2008

Classes and Job Scheduling

All Loadleveler jobs must be submitted to a valid submit class. If the class doesn't exist, no error message will be issued. The job will be submitted, but will sit in the queue indefinitely. If this happens, delete the job using the llcancel command and resubmit to an available class.

NERSC users specify one of the following submit classes to queue jobs. Upon submission the job is routed to the appropriate LoadLeveler class according to the following criteria. (Users can not directly access the LoadLeveler classes.)

Submit Class Job Type LL Class1 Nodes Max Wallclock Relative Priority3 Class Charge Factor
interactive parallel interactive 1-8 30 mins 1 1
debug parallel debug 1-24 30 mins 2 1
premium parallel premium 1-299 24 hrs 3 2
regular parallel reg_1 1-31 12 hrs 5 1
  parallel reg_1l 1-31 24 hrs 5 1
  parallel reg_32 32-47 48 hrs 5 1
  parallel reg_48 48-255 48 hrs 4 .5 see note4
  parallel reg_256 256-299 48 hrs 4 .5 see note4
  parallel reg_3006 300-ALL 48 hrs see note6 .5 see note4
  serial2 reg_s 1 processor 12 hrs see note5 see note7
low parallel low 1-299 12 hrs 6 .5
xfer serial xfer 1 processor 4 hrs see note5 see note7

Notes

1 - Jobs must be submitted to the "Submit Class," not the "LL Class".

2 - Serial jobs must be submitted with a job_type of "serial" and a class of "regular". See Running Serial Jobs with Loadleveler for more information on the serial job classes.

3 - The priorites listed in the table are relative. NERSC assigns priorities in terms of "equivalent days waiting in the queue". For example, a premium job on Seaborg receives a two day boost over the largest regular class jobs. You can view typical class and queue wait times on the job stats webpage. In addition to the relative priority given to jobs depending on their LoadLeveler class, users with large allocations who must have a certain turn around in order to use their allocation may be eligible to receive a scheduling "boost".

4 - Jobs that run in the reg_48, reg_256, and reg_300 classes are discounted 50% of the regular rate.

5 - The transfer (xfer) class runs on a dedicated node, and thus does not compete with the other classes.

6 - The reg_300 class will usually have a run limit of zero. NERSC staff will monitor this class and make special arrangements to run jobs of this size.

7 - The serial and xfer jobs are only charged for running on 1 processor, not the entire node.

See also Queue Policies for information on run limits and Using large-memory nodes for information on specifying memory requirements.


You can use the llclass command on the system to obtain information about the LoadLeveler classes. Detailed information about a single LoadLeveler class can be found using llclass -l classname.

If you request more wall clock time than allowed by the class (as indicated by the Max Time column in the table above), your job will be submitted with the wall clock time adjusted to the maximum allowed. If you omit requested time, then a default of 30 minutes will be used.

You must explicitly specify the number of nodes with the stanza "#@ node = ", otherwise your job will be run on one node.

Your job will be charged and scheduled according to the priority listed in the class name. Both interactive and debug are charged at the regular rate. See Accounting on the SP.

The classes are configured to give the best service to premium and regular jobs. Users are expected to use the regular class for at least 80 percent of their jobs. Premium jobs are charged at twice the rate of regular jobs, but are scheduled at a higher priority.

Loadleveler uses a scheduling technique called "backfilling". This method starts smaller, shorter jobs if they will not affect the start time for the job that is scheduled to begin next. This scheduling technique is advantageous from both a user and system perspective. It allows a faster turn around for shorter jobs, and it maximizes system usage.

Running Serial Jobs with Loadleveler

There are two LoadLeveler serial classes on Seaborg, serial and xfer. Jobs run in these classes are charged only for one processor's actual wall clock time. These classes are designed for

  • preprocessing data needed by large parallel runs
  • postprocessing data produced by large parallel runs
  • transferring data between Seaborg, HPSS and NERSC servers (such as the visualization server)
  • transferring data between the user's home site and NERSC.

These two classes typically have very short queue wait times, usually less than 10 minutes, so if you submit a job to one of these queues as the last statement of a large parallel job, you can expect the serial job to start almost immediately.

These two classes are not designed for single node multi-threaded jobs, e.g. OpenMP. Jobs of this type should be submitted with these LoadLeveler keywords:

#@ job_type = parallel
#@ node            = 1

Serial Job Class

This class is for serial pre-and-post data processing. The jobs in this class share a single 16 GB node. There is a run limit of 16 jobs that can share the node at the same time. It has wall clock and CPU limits of 12 hours and a memory limit of 1 GigaByte.

You can make use of the serial class by specifying the following LoadLeveler keywords:
#@ job_type = serial
#@ class = regular

xfer Job Class

This class is intended specifically for data transfer jobs. The run limit for the class is 8 jobs sharing a node especially configured for good network access to HPSS. It has wall clock and CPU limits of 4 hours and a memory limit of 1 GigaByte. (It takes approximately 3 hours for 1 terabyte of data to be transferred to HPSS from Seaborg).

Note you can submit a xfer job from within your parallel batch script. The example below shows how a parallel job may submit an xfer job to move files. The serial data transfer job is initiated when the XCPU signal is sent (soft_limit) or the parallel computational program completes.

This class is specifically intended only for data transfer and not for any other type of serial job. There is a discussion with examples of using HPSS in a batch job at Accessing HPSS - Batch Jobs.

You can make use of the xfer class by specifying the following LoadLeveler keywords:

#@ job_type = serial
#@ class = xfer

Interactive/Debug Schedule

  • From 5AM PST to 6PM PST 16 nodes from the compute pool are reserved for debug and interactive use only. This should allow good turnaround time for debug and interactive work. Note that due to limitations in Loadleveler, these nodes can not be guaranteed to be available at 5AM. NERSC will make the best effort possible to meet the 5AM availability.
  • Independent of these reserved nodes, interactive and debug jobs may run on any nodes throughout the cluster as they become available.

NERSC Queue Policies for Seaborg

  • For the production batch classes, each user may have:
    • 3 jobs running (this parameter can be adjusted depending on system load).
    • 4 jobs in Idle state (jobs queued to run; this parameter can be adjusted depending on system load).
    If you have 4 jobs queued (in Idle state) and need to run an Interactive or Debug job, place one of your jobs on User Hold: llhold jobid. To requeue the job: llhold -r jobid.
  • The combined number of debug and interactive jobs that a user may have submitted or running at a given time must be two or fewer. Note that this policy only applies to jobs run in the interactive and debug batch classes. This includes parallel jobs (anything compiled with one of the "mp" compilers, e.g. mpxlf90) that are executed from the command line, as well as those jobs (parallel or otherwise) that are explicitly submitted to these two classes with the "llsubmit" command. The policy has NO effect on sequential programs executed from the command line, including all the normal Unix commands.
  • The class run limit for reg_1l (regular jobs using 1 to 31 nodes and requesting more than 12 wall hours) is 15 jobs running. There are no other class run limits.
  • Any job that has been in the queue for 7 days or more, and is in the "user hold" (Loadlever status HU) state, will be removed from the system. Note that this means:
    • Jobs may not be held for more than 7 days; and
    • Jobs older than 7 days may not be held.
  • A 60 minute time limit is enforced on all user processes on the login nodes.
  • Seaborg is occassionally removed from service for scheduled maintenance. Users will be given seven days notice before such events, usually on the "Message of the Day" (MOTD), which is displayed upon login and is also available here. The handling of the production workload depends on the nature of the downtime:
    • In most cases, NERSC will checkpoint all running jobs before the maintenance period, and restart them after the maintenance is complete. Jobs that cannot be checkpointed will be killed.
    • In rare cases where the maintenance involves updates to certain runtime components, restarting from checkpoint files is not possible. In this case, 48 hours before the scheduled downtime LoadLeveler will be configured to only start jobs that will complete before the downtime, based on jobs' requested wall clock time limits. As the downtime draws nearer, shorter jobs will be started. Thus for two days before a scheduled downtime, jobs will start in an order that does not necessarily reflect their submit order.

Common commands

Here's a list of commonly used Loadleveler commands with brief descriptions. Detailed information on these commands can be obtained by typing:

% man commandname
Command Description
llsubmit Submits a script file to Loadleveler for execution
llqs Lists all Loadleveler jobs, including requested wall clock time and number of nodes.
llq Same as llqs but with less information.
llcancel Deletes a job from the Loadleveler queue
llclass Displays information about the configured classes
llstatus Displays status of individual nodes on the SP

The llqs command is specific to NERSC. There are a number of options available, please see the man page.

The llq command has been usurped by llqs but it is still useful for details about a particular job. See Monitoring jobs for more information.

Creating batch jobs

To submit a batch job to Loadleveler you need to create a script file containing the commands that make up the batch job. This file must contain the following types of commands:

  • Loadleveler keywords embedded in the beginning of the script file. They provide Loadleveler with information needed to process the job.
  • Any AIX commands that can be entered interactively, including shell commands and user-created programs.

Here is an example of a script file named myjob that runs a program named a.out when submitted from seaborg:

#@ job_name        = myjob
#@ account_no      = repo_name
#@ output          = myjob.out
#@ error           = myjob.err
#@ job_type        = parallel
#@ environment     = COPY_ALL
#@ notification    = complete
#@ network.MPI     = csss,not_shared,us
#@ node_usage      = not_shared
#@ class           = regular
#
#
#@ tasks_per_node  = 16
#@ node	           = 1
#@ wall_clock_limit= 01:00:00
#
#@ queue

./a.out 

A program to check a LoadLeveler script for simple errors is available on Seaborg.

% ll_check_script script_name

Loadleveler keywords used at NERSC

For most jobs run on seaborg, NERSC suggests that the following keywords be specified. Some of these keywords may have default values, but NERSC considers them to be undefined and you should never rely on their existence or persistence.

A complete description of Loadleveler keywords can be found in IBM's documentation.

Required LoadLeveler Keywords
#@ job_type = parallel Specifies that your job is parallel.
#@ node = N Specifies that your job will use N nodes.
#@ network.MPI = csss,not_shared,US Specifies communication protocol and adapters. Please see the LL documentation if you wish to use LAPI.
#@ class = class Specifies that your job should be run in the submit class class.
#@ tasks_per_node = N Specifies the number of tasks that will run on each node. To run a different number of tasks on different nodes, see Selecting the number of nodes and tasks.
#@ wall_clock_limit = HH:MM:SS Specifies that your job requires HH:MM:SS of wall clock time.
#@ queue Places your job in the queue. It must be the last keyword specified. Any keywords placed after this in the script are ignored by the current job step.

Recommended Loadleveler Keywords
#@ node_usage = not_shared Specifies that your job should not share nodes with other jobs. This is the default and cannot be overridden.
#@ job_name = jobname Specifies the name of your job.
#@ account_no = repo_name Specifies the repository that will be charged. If omitted, your default repository will be used.
#@ notification = complete Specifies that email should be sent upon completion of the job. Other options include always, error, start, and never. If not specified, you receive no notification.
#@ shell = /usr/bin/shell Specifies the shell that parses the script. If not specified, your login shell will be used.

f
Other useful LoadLeveler keywords
#@ output = myjob.out Specifies the name and location of STDOUT. If not given, the default is /dev/null. The default directory is the submitting directory.
#@ error = myjob.err specifies the name and location of STDERR. If not given, the default is /dev/null. The default directory is the submitting directory.
#@ environment = COPY_ALL Specifies that all environment variables from your shell should be used. You can also list individual variables which should be separated with semicolons.
#@ restart = no Specifies that your job should not be requeued, if Load Leverler can requeue it, if it must vacate a node on which it is running.

If you are using a script that was used on another (non-SMP) SP, beware of the following keywords:

  • #@ max_processors
  • #@ min_processors

These keywords specify the maximum/minimum number of nodes (not processors!) requested for a parallel job, regardless of the number of processors contained in the node. The node keyword should be used instead.

Selecting the number of tasks and nodes

You must specify the number of tasks and number of nodes you wish to use. Remember that Seaborg has a maximum of 16 tasks per node; if you ask for more than 16 tasks per node (either explicitly or implicitly) your job will never run.

You usually specify numbers of tasks and nodes by including two of the following three LoadLeveler keywords in your batch script:

  • #@ node = desired number of nodes
  • #@ total_tasks = desired number of tasks
  • #@ tasks_per_node = desired number of tasks per node; up to a maximum of 16 on Seaborg

For example, to run on 4 nodes and pack the nodes with 16 MPI tasks per node, in your LoadLeveler script use:

#@ node = 4
#@ tasks_per_node = 16

To use a number of tasks that is not a multiple of 16, do not set the tasks_per_node keyword. For 29 tasks on 3 nodes, you would use:

#@ node = 3
#@ total_tasks = 29

LoadLeveler will allocate tasks on the nodes in a round-robin fashion.

It is also possible to select which tasks run together on a node. However you must be very careful to completely specify all tasks and the correct number of nodes. Do not use any of the three keywords listed above. Instead, to run 8 tasks on 3 nodes with task 0 on one node, odd tasks on another node and the remaining even tasks on the third node, use:

#@task_geometry = {(0)(1,3,5,7)(2,4,6)}

For more about task assignment considerations, see the IBM LoadLeveler documentation.

Using large-memory nodes

Each Seaborg node has a minimum of 16 GB of memory and a maximum of 64 GB. In the current configuration there are 4 compute nodes with 64 GB of memory and 64 nodes with 32 GB. The remaining 312 compute nodes have 16 GB.

You may set a "requirement" in your batch script that will run your job only on a node with a specified amount of memory. Be sure you understand the information on the Memory Management on the SP web page before targeting your job to run on large-memory nodes.

To use the 64-GB nodes exclusively (maximum of 4 nodes), use this line in your LoadLeveler script:

#@ requirements = (Memory == 65536)

To use the 32-GB nodes (maximum of 64 nodes), use:

#@ requirements = (Memory >= 32768)

Submitting Jobs

Once you have created a script file, it can be submitted to Loadleveler for execution using the llsubmit command:

% llsubmit script_file
llsubmit: The job "s01007.nersc.gov.101" has been submitted.

script_file specifies the name of the script file containing commands that comprise the batch job and keywords which control various aspects of the job, e.g., resource limits, output files, mail messages, etc.

You do not specify a job priority in addition to the class. The class will contain the name of the priority at which the job will be charged. See Class info and job scheduling.

There are system limits on the number of jobs a user can have in various states. See NERSC queue policies for seaborg.

Job steps and dependencies

LoadLeveler allows you to define "job steps" in a single batch script. Job steps are essentially different sets of instructions on how to run the job. They can be very useful in that one set of instructions can be set to execute depending the results of a previous set of instructions. This is referred to as a dependency.

See the LoadLeveler documentation for full details. Here we describe how to run an executable based on whether or not a previous program completed successfully. This strategy might be used by a code that writes checkpoint files at the end of a run and then uses those files as input for the next run.

You define steps in a LoadLeveler script by naming them with the keyword definition

#@step_name	=	step_name

followed by other definitions for that step and finally a "queue" directive. Additional steps follow.

Some LoadLeveler directives must be included in each step. A typical first job step for an MPI code would look like this:

#@ step_name    = step1
#@ job_type = parallel
#@ node = 8
#@ tasks_per_node = 16 
#@ network.MPI = csss,not_shared,us
#@ executable = a.out
#@ queue

The second job step contains a dependency, as specified on the with the "dependency" keyword. Here we are going to assume that if the first job step is successful it returns an error code of 0 (zero). Here is a typical second job step:

#@ step_name    = step2
#@ dependency   = (step1==0)
#@ job_type = parallel
#@ node = 4
#@ tasks_per_node = 16 
#@ network.MPI = csss,not_shared,us
#@ executable = b.out
#@ queue

The second job step will only be run if step1 has an exit code of 0. The executable a.out and b.out can themselves be executable shell scripts.

You can also include "if" statements in your LoadLeveler script itself to take different actions for different job steps. Here's an full example script that run the toy "flip" program introduced in the Introduction to MPI tutorial.

#LL WWW Script Generator. Richard Gerber, NERSC
#@ class = debug       
#@ shell = /usr/bin/csh
#@ wall_clock_limit = 0:05:00
#@ notification = always 
#@ output = $(host).$(jobid).$(stepid).out
#@ error = $(host).$(jobid).$(stepid).err
#@ environment = COPY_ALL
#
#@ step_name    = step1
#@ job_type = parallel
#@ node = 1
#@ tasks_per_node =  4
#@ network.MPI = csss,not_shared,us 
#@ queue
#
#@ step_name    = step2
#@ dependency   = (step1==0)
#@ job_type = parallel
#@ node = 1
#@ tasks_per_node =  2
#@ network.MPI = csss,not_shared,us 
#@ queue


if ($LOADL_STEP_NAME == "step1") then
        ./flip
else if ($LOADL_STEP_NAME == "step2") then
        echo "Step 1 returned exit code of 0."
        echo "100500" >flip.in
        ./flip
else
        echo "Error: no value for step name."
endif 
exit

Monitoring jobs

To monitor your job after submission, you can use the llqs command:

seaborg%  llqs 
Step Id        JobName      UserName  Class  ST NDS WallClck Submit Time
-------------- ------------ -------- ------- -- --- -------- -----------
s01007.1087.0   a240         buffy    regular R   32 00:31:44  3/13 04:30
s01001.529.0    s1.x         willow   regular R   64 00:28:17  3/12 21:45
s01001.578.0    xdnull       xander   debug   R    5 00:05:19  3/14 12:44
s01009.929.0    s01009.ners  spike    regular R  128 03:57:27  3/13 05:17
s01001.530.0    s2.x         willow   regular I   64 04:00:00  3/12 21:48
s01001.532.0    s3.x         willow   regular I   64 04:00:00  3/12 21:50
s01001.533.0    y1.x         willow   regular I   64 04:00:00  3/12 22:17
s01001.534.0    y2.x         willow   regular I   64 04:00:00  3/12 22:17
s01001.535.0    y3.x         willow   regular I   64 04:00:00  3/12 22:17
s01001.537.0    s01001.nerg  spike    regular I  128 02:30:00  3/13 06:10
s01009.930.0    s01009.nerg  spike    regular I  128 02:30:00  3/13 07:17

This information can also be found on the Seaborg Batch Queue Status page.

The jobs are listed in order from highest priority to lowest priority. The first column contains the job ID, which you will notice is different from the identifier given from the llsubmit command. The difference is that llsubmit returns the full node name, including the nersc.gov domain, which the first column of the above output does not contain. Otherwise the Job ID's are identical.

The second and third columns contain the job and owner names. The fourth lists which class the job was submitted to. The llclass command lists the available classes and their limits. Jobs submitted to the premium class will be scheduled faster but charged at a higher rate. Please see Priority Charging for more information.

The fifth column is the Loadleveler job state, which is usually one of the following:

ST Job State Description
R Running The job is currently running.
I Idle The job is being considered to run.
NQ Not Queued The job is not currently being considered to run. This is probably because you have submitted more than 10 jobs.
ST Starting The job is starting to run.
RP Remove Pending The job is in the process of being removed.
HU User Hold The user put the job on hold. You must issue the llhold -r command in order for it to be considered for scheduling.
HS System Hold The job was put on hold by the system. This is probably because you are over disk quota in $HOME.

There are other possible states, but they occur infrequently. Please check the Using and Administering Loadleveler guide for more information.

The rest of the columns list the requested number of nodes, the wall clock time, and submission time information. If the job is running, then the wall clock time is the time remaining before the job will finish.

Another useful command is llq with the -l option. This will display extensive information about one or all of your jobs. You can only use it on jobs you submitted. Please see the man page for details.

Using phost to determine node IDs

NERSC has written a utility, phost, that can be used to record which tasks ran on which node.

Deleting jobs

The llcancel command is used to delete a job.

% llcancel joblist

The joblist contains the list of Job IDs you want to delete. You can get this name from the first column of the output from either the llqs or the llq command:

seaborg%  llq 
Id                  Owner      Submitted   ST PRI Class        Running On 
------------------- ---------- ----------- -- --- ------------ -----------
 s01015.137.0       bones      10/7  12:05 R  50  regular       s00608    
 s01005.84.0        spock      10/7  16:51 R  50  regular       s01816    
 s01015.177.0       jimkirk    10/7  17:01 R  50  interactive   s00713    
 s01015.174.0       sulu       10/7  16:54 R  50  low           s01601    
 s01015.176.0       uhura      10/7  17:00 R  50  debug         s00101    

5 job steps in queue, 0 waiting, 0 pending, 5 running, 0 held

For example, if user spock wanted to delete a job, then the command is:

% llcancel s01005.84.0
llcancel: Cancel command has been sent to the central manager.

The llcancel command works for both running and queued jobs.

Account charging

By default, job usage is charged to your default repository. Use the NERSC Information Management (NIM) system to view your default repo.

You can specify the repo to be charged in your LoadLeveler script. Use this keyword:

#@account_no = repo_name

Your job charge depends on which class you specify. In general, on seaborg one submits a job to the class which contains the name of the desired NERSC batch charging priority. For example, a premium job is submitted to the class named "premium". Interactive and debug jobs are charged at the same rate as regular class jobs.

See Accounting on the SP for information about how accounts are charged.

Example batch script

Here is an example batch script.

#@ job_name        = myjob
#@ account_no      = repo_name
#@ output          = myjob.out
#@ error           = myjob.err
#@ job_type        = parallel
#@ environment     = COPY_ALL
#@ notification    = complete
#@ network.MPI     = csss,not_shared,us
#@ node_usage      = not_shared
#@ class           = regular
#
#
#@ node            = 16
#@ tasks_per_node  = 16 
#@ wall_clock_limit= 01:00:00
#
#@ queue

./a.out < input


LBNL Home
Page last modified: Tue, 22 Apr 2008 17:19:05 GMT
Page URL: http://www.nersc.gov/nusers/systems/SP/old_stuff/running_jobs/batch.php
Web contact: webmaster@nersc.gov
Computing questions: consult@nersc.gov

Privacy and Security Notice
DOE Office of Science