Running Jobs on Bassi
On this page
Related pages
- Job Launch Overview
- Communication Protocols
- Runtime Configuration and Options
- Checkpoint/restart considerations
- Interactive Jobs
- Batch Jobs
- Memory Considerations
- Class Info & Policies
- Monitoring Jobs
- IBM Error Messages
Introduction
NERSC's IBM POWER 5 p575 system, named Bassi, has 111 SMP "compute" nodes with 8 processors per node for a total of 888 processors. Each node has a common pool of 32 GBytes of memory.
The nodes are connected via a high-speed network known as HPS (also called Federation). Each node has a "dual-link" HPS adapter, which attaches to a dual-plane network. Each link can support 2 GB/s of data transfer in each direction.
Bassi supports both interactive and batch processing. Interactive jobs are limited to 4 nodes and 30 minutes of wallclock time. Parallel programs can be run either using distributed memory message passing, shared memory threading, or some combination of the two.
[Top]MPI Codes
Most programs running on Bassi execute in parallel and use the Message Passing Interface, or MPI, to communicate among separate tasks. The default computing environment is configured to support MPI jobs as transparently as possible. The discussions on these pages assumes you are running an MPI code unless otherwise noted.
[Top]OpenMP Codes
Interactive jobs that use OpenMP must run with the the environment variable MP_TASK_AFFINITY unset. As of August 2, 2006 (Parallel Environment 3.3.2.4) jobs run through LoadLeveler batch scripts ignore this environment variable. Make sure that you do not use the "rset" keyword to request task affinity in your batch script.
THESE SETTINGS ARE WRONG FOR OPENMP CODES. DO NOT USE FOR OPENMP CODES. #@ rset = rset_mcm_affinity #@ mcm_affinity_options = mcm_distribute, mcm_mem_pref, mcm_sni_none THESE SETTINGS ARE WRONG FOR OPENMP CODES. DO NOT USE FOR OPENMP CODES.
For interactive jobs use the appropriate statement:
unsetenv MP_TASK_AFFINITY (for csh, tcsh) unset MP_TASK_AFFINITY (for sh, ksh, bash)
The default number of OpenMP threads per task (process) is 8. If you are running a pure OpenMP code, you should run using one task per node. If you are using more than one task per node, you should reduce OMP_NUM_THREADS so that the product of tasks x threads is 8. A combination of tasks and threads significantly in exess of 8 may result in poor performance.
OpenMP code will probably performer better with the following environment variable setting:
export XLSMPTOPTS=SPINS=0;YIELDS=0 (for sh-like shells) setenv XLSMPTOPTS 'SPINS=0;YIELDS=0' (for csh-like shells)
If you are calling malloc() or ALLOCATE in OpenMP threads, you may want to set this environment variable:
export MALLOCMULTIHEAP=true (sh-like shells) setenv MALLOCMULTIHEAP true (csh-like shells)
Otherwise, memory allocations will be serialized on a node.
[Top]MPI-I/O Codes
Jobs that use MPI-I/O or explict MPI threading routines must run with the the environment variable MP_SINGLE_THREAD unset. As of August 10, 2006 MP_SINGLE_THREAD is no longer set in the default Bassi environment.
[Top]Running on Bassi
Job Launch Overview: POE and LoadLeveler
Bassi uses two software packages to run parallel programs: the Parallel Operating Environment (POE) executes parallel programs and LoadLeveler schedules jobs. Users can interact with this IBM software in a number of different ways and at a number of different levels. This can be very confusing, so a brief discussion of POE and LoadLeveler follow.
- Parallel Operating Environment (POE)
-
POE is used to run parallel programs. This product augments the basic AIX operating system with software needed to run parallel programs. The command poe (all lower case) executes parallel programs. However, the poe command is not explicitly required to run a parallel program, depending on which options were used to compile an executable. POE recognizes environment variables and poe command line flags that specify how a parallel program should run. Please see "Operation and Use" in the IBM Manuals for more on POE.
- LoadLeveler
-
LoadLeveler is used in addition to POE in order to run parallel jobs. Loadleveler is a "job management system" that is used to schedule all parallel jobs, regardless of whether the jobs are batch or interactive. More information on Loadleveler can be found on the IBM Batch page.
When running in job in batch mode, a user submits to LoadLeveler a script that contains commands and LoadLeveler keywords. The value of the LoadLeveler keywords determines how the code executes (e.g. number of nodes used, number of tasks, etc.)
You control how your parallel job executes by specifying
- LoadLeveler keyword values (batch mode), and/or
- values passed to POE on the command line, and/or
- environment variables
In batch mode you should completely specify how your job should run using LoadLeveler keywords exclusively, if possible. NERSC recommends that you be as explicit as possible in your specifications in order to avoid confusion.
In interactive mode poe command-line options override environment variable settings.
Avoid confusion! POE vs. LoadLeveler keywords and options
It is important to make the distinction between LoadLeveler keywords and poe command line options. They do not have the same names in general. For example, node is a LoadLeveler keyword, but is not a poe command-line option. The poe option is called nodes and is not a LoadLeveler keyword. total_tasks is a LoadLeveler keyword, but not a poe command-line switch. Therefore poe will completely ignore -total_tasks on the command line without warning or comment. For example, the following will run 4 tasks, rather than 8 tasks as might be expected:
% poe ./a.out -nodes 4 -total_tasks 8 (does not work as expected!)
Because the default value of the MP_TASKS_PER_NODE POE environment variable is 1, this command line will run 1 task on each of 4 nodes, and ignore the total_tasks specification on the command line because it is not a valid poe command line option. See Interactive jobs.
Managing Memory Usage
Each Bassi node has 32 GB of total memory. Some memory is used by the operating system (AIX, HPS switch, etc); the remainder - about 26.4 GB - is available for user applications. If your code tries to use more than the available physical memory it will start writing memory pages to disk. This will certainly cause poor code performance and excessive paging will lead to unpredictable behavior: code crashes, hangs, and possibly node crashes.
System software monitors node memory usage and attempts to protect itself by killing jobs that exceed a given memory limit. Please plan your runs to fit into the available physical memory on each node.
NOTE: The changes discussed here, scheduled for Aug. 14, have been deferred pending performance and functionality validation of the AIX 5.3 TL8 upgrade of August 14.
- Prior to Aug. 14, 2008: Any job that tries to use more than 27032 MB of memory on a node will be killed by the system software.
- Aug. 14, 2008 and later: Any job that tries to use more than 7400 MB of small page memory will be killed by the system software.
Bassi's POWER 5 nodes support a "large page" memory configuration. High Performance scientific applications generally benefit from using "large page." By default, parallel applications built on Bassi are enabled to use large pages. It is possible to build binaries that do not use large pages, however. See Memory Considerations for more details.
The Bassi compute nodes are configured to use 20 GB of large-page memory. The size of the large-page pool is set at boot time and cannot be otherwise changed. If a large-page-enabled application requests more memory than is available in large pages, additional memory will be allocated and used from the small memory pool (perhaps at lower performance). However, a code that runs only in small pages cannot access the large-page memory pool.
Please use the following table to help make effective use of the memory on the Bassi compute nodes. For running interactive serical utilities on the login nodes, see Running Interactive Jobs.
| Bassi Compute Nodes (8 CPUs per node) | |
|---|---|
| Total Physical Memory: | 32 GB |
| Large-Page Memory Pool Size: | 20 GB |
| Large-Page Memory Available to Large-Page Enabled User Applications: | 18.922 GB |
| Memory in Small Pages: | 12 GB |
| Small-Page Memory Available to User Applications: | 7400 MB |
| Total Memory Available to Large-Page Enabled User Applications: | 26.15 GB |
| Total Memory Available to Non-Large-Page Enabled User Applications: | 7400 MB |
Job Memory Limits
The maximum memory your code can use on a single node is 26.4 GB (27032 MB). If you want to be able to access the maximum amount of memory, consult the following table to see what you need to specify in your LoadLeveler batch script:| Tasks per Node | LoadLeveler Directive | |
|---|---|---|
| Aug. 14, 2008 and after | Earlier setting | |
| 8 | None needed | None needed |
| 7 | #@resources = ConsumableMemory(1057 mb) | #@resources = ConsumableMemory(3861 mb) |
| 6 | #@resources = ConsumableMemory(1233 mb) | #@resources = ConsumableMemory(4505 mb) |
| 5 | #@resources = ConsumableMemory(1480 mb) | #@resources = ConsumableMemory(5406 mb) |
| 4 | #@resources = ConsumableMemory(1850 mb) | #@resources = ConsumableMemory(6758 mb) |
| 3 | #@resources = ConsumableMemory(2466 mb) | #@resources = ConsumableMemory(9010 mb) |
| 2 | #@resources = ConsumableMemory(3700 mb) | #@resources = ConsumableMemory(13516 mb) |
| 1 | #@resources = ConsumableMemory(7400 mb) | #@resources = ConsumableMemory(27032 mb) |
Memory limits are being enforced via LoadLeveler. Jobs that request more than 7400 MB of ConsumableMemory per node will not start. Default values are in place so that job scripts that request 8 tasks per node can access the maximum amount of memory. Jobs will be killed if they try to access more than 7400 MB of small-page memory per node.
Jobs that run with fewer than 8 tasks per node need to correctly specify their memory usage in batch scripts. The maximum memory a job can use on a single node is calculated by:
(tasks per node requested) * (ConsumableMemory requested per task)
The default value of ConsumableMemory per task is set at 925 mb. If you run fewer than 8 tasks per node and want to access the full amount of memory permitted, you need add a line to your LoadLeveler batch script:
#@resources = ConsumableMemory(N mb)
where N is the amount of memory per task that your job will use. NOTE: Each task is not limited to N mb; rather, all tasks on a given node are limited to using an aggregate amount of memory equal to: (number of tasks on that node)*(N mb) with a maximum limit of 6552 mb of small-page memory. N must be an integer. Here "tasks" refers to the number of independent instances of your executable that are started (initiated) on a node.
For example, to run 2 tasks per node and access the maximum amount of memory, set this in your batch script:#@tasks_per_node = 2 #@resources = ConsumableMemory(3700 mb)
Running on Bassi
Communication protocols
Most parallel jobs that run on Bassi do so through an explicit message passing interfaces. The choice of API and protocols must be specified at run time, by setting environment variables for interactive jobs and through a network LoadLeveler keyword for batch jobs (see below and elsewhere on these pages.)
Message Passing APIs
There are two explicit message passing interfaces, or APIs, available. Calls to these libraries are made in source code and a run-time library containing these functions is loaded when the job executes.
- MPI
-
Most codes use MPI, the popular Message Passing Interface. The default programming and interactive run-time environment is configured to support MPI. See the IBM Documentation for details of the IBM implementation of MPI.
Most MPI batch job scripts need to include this line:
#@network.MPI = sn_all,not_shared,us
- LAPI
-
The low-level IBM LAPI interface is also available. See the IBM Documentation for details.
Most LAPI batch job scripts need to include this line:
#@network.LAPI = sn_all,not_shared,us
Interactive jobs that use LAPI must set an environment variable thusly:
bassi% export MP_MSG_API=LAPI (sh-like shells) bassi% setenv MP_MSG_API LAPI (csh-like shells)
Communications Protocols
There are two separate implementations of the communications library that actually ships data packets across the network:
- Internet Protocol (IP)
-
This implementation uses the standard IP protocol that commonly links computers over ethernet networks. IP offers less performance that User Space (see below), but some specialized application may require IP.
Batch job scripts that require IP need to include one of these lines:
#@network.MPI = sn_all,not_shared,ip #@network.LAPI = sn_all,not_shared,ip
Interactive jobs that use IP must set an environment variable thusly:
bassi% export MP_EUILIB=ip (sh-like shells) bassi% setenv MP_EUILIB ip (csh-like shells)
- User Space (US) Communication Subsystem
-
This communication subsystem is designed to take advantage of Bassi's high performance switch (HPS).
You will use the "US" protocol in almost all cases. The library is dynamically linked to your code at runtime. You do not have to modify your source code, but you must specify a protocol when you run your program.
For MPI batch jobs using US, the follow line must appear in the batch script:
#@network.MPI = sn_all,not_shared,us
Running on Bassi: Runtime Configuration and Options
NOTE: Beginning September 2, 2006 the environment variables settings described below are ignored by the runtime system for jobs submitted through the batch system. They now apply only to jobs launched interactively from the shell command line.
- Managing Task and Memory Affinity on SMPs
- Communication over HPS
- Improved MPI Latency for Single-Threaded Applications
Managing Task and Memory Affinity on SMPs
Bassi's SMP nodes are organized around components called Multi-chip Modules, MCM's. An MCM contains several processors, I/O buses, and memory. An MCM on Bassi contains one active processor. (A Bassi MCM might be better termed a "DCM," or dual-chip module, of which one is active on Bassi.) While a processor in an MCM can access the I/O bus and memory in another MCM, most scientific applications will see improved performance if the processor, the memory it uses, and the I/O adapter it connects to, are all in the same MCM.
Threaded applications, including OpenMP codes, should not request affinity in most cases. If task affinity is requested, for example, all threads spawned from a single task will be bound to a single MCM. Usually these codes would prefer that AIX schedule and migrate the threads as efficiently as it can.
The runtime behavior is controlled by keyword settings in batch job scripts and by environment variable settings for interactive parallel jobs.
Batch jobs that want memory, task, or I/O affinity must include two lines in their batch scripts, one to request affinity and another to specify affinity options. The most common specification for jobs run at NERSC is expected to be the following:
#@ rset = rset_mcm_affinity #@ mcm_affinity_options = mcm_distribute, mcm_mem_pref, mcm_sni_none
Memory Affinity
By requesting MCM memory affinity, a processor will preferentially obtain memory from the local MCM. This setting is independent of the task or I/O affinity described below.
| Type | Specification | Values | Default |
|---|---|---|---|
| Environment variable: | MEMORY_AFFINITY | MCM | -1 (none) | MCM* |
| LoadLeveler directive: | #@mcm_affinity_options | mcm_mem_none | mcm_mem_pref | mcm_mem_req | mcm_mem_none |
In batch jobs a setting of mcm_mem_pref will request that memory be allocated from the local MCM whenever possible, mcm_mem_none specifies no memory affinity, and mcm_mem_req requires that all memory be allocated from the local MCM only.
Task Affinity
Task affinity settings controls the placement of tasks of a parallel job. By requesting MCM afinity, a task will not be migrated between MCM's during its execution. The tasks are allocated in a round-robin fashion among the MCM's attached to the job. By default, the tasks are allocated to all the MCMs in the node.
Most codes will benefit from using MCM task affinity, but this also binds all tasks and spawned threads to a single MCM, thus disabling effective use of OpenMP. OpenMP users need to unset MP_TASK_AFFINITY before running interactive parallel jobs. Likewise, MPI-IO and parallel HDF5 make heavy use of threads and will perform poorly unless task affinity is disabled.
| Type | Specification | Values | Default |
|---|---|---|---|
| Environment variable: | MP_TASK_AFFINITY | MCM | -1 (none) | MCM* |
| LoadLeveler directive: | #@ mcm_affinity_options | mcm_distribute | mcm_accumulate | mcm_accumulate |
In batch job scripts a setting of mcm_distribute will distribute the tasks as described above. A setting of mcm_acculate will attempt to accumulate all tasks onto a single MCM whenever possible.
Communication over HPS
The network switch on Bassi is known as HPS (High Performance Switch), or "Federation."
Remote Direct Memory Access (RDMA) is a mechanism which allows large contiguous messages to be transferred while reducing the message transfer overhead. To use RDMA, in interactive parallel jobs set the environment variable MP_USE_BULK_XFER; for batch jobs you must add the keyword #@bulkxfer=yes.
| Type | Specification | Values | Default |
|---|---|---|---|
| Environment variable: | MP_USE_BULK_XFER | yes | no | yes* |
| LoadLeveler directive: | #@bulkxfer | yes | no | no |
Contiguous messages with data lengths greater than or equal to the value of MP_BULK_MIN_MSG_SIZE will use the bulk transfer path. Messages with data lengths that are smaller than the value you specify for this environment variable, or are noncontiguous, will use packet mode transfer.
| Type | Specification | Values | Default |
|---|---|---|---|
| Environment variable: | MP_BULK_MIN_MSG_SIZE | 4K-2048M | 150K |
| LoadLeveler directive: | N/A | N/A | N/A |
The following image shows the point-to-point bandwidth for RDMA vs. non-RDMA MPI communication using MP_BULK_MIN_MSG_SIZE=4096. Click for a larger PDF version of the image.
Improved MPI Latency for Single-Threaded Applications
To avoid lock overheads in a program that is known to be single-threaded (user-created threads), set the environment variable MP_SINGLE_THREAD to yes. The internode MPI latency is reduced from about 5.1 microseconds to 4.5 microseconds if MP_SINGLE_THREAD is set to "yes."
MPI-IO and MPI one-sided functions are unavailable if this variable is set to yes.
| Type | Specification | Values | Default |
|---|---|---|---|
| Environment variable: | MP_SINGLE_THREAD | yes | no | no |
| LoadLeveler directive: | N/A | N/A | N/A |
Running on Bassi
Interactive jobs
Serial Jobs
You run interactive serial programs on the node you are currently logged onto by typing the executable file's name, e.g.:
bassi% ./a.out
Processes run on the login nodes are subject to limits on certain resources such as memory. These limits are in place to make sure that the login nodes are available and responsive to all users logged in. If your program needs greater resources than provided by the interactive limits, you should run it in batch. There are both soft (default) and hard (maximum) limits. The limits are:
| Resource | Soft Limit | Hard Limit |
|---|---|---|
| Memory (data) | 128 MB | 2 GB |
| CPU Time | 3600 secs | 3600 secs |
| MAX Processes | 512 | 512 |
You query and change these limits with the limit command for csh and tcsh users and with the ulimit command for ksh, sh, and bash users.
Query the limits:
% limit (csh, tcsh: shows limits in effect) % limit -h (csh, tcsh: shows maximum values) % ulimit -a (bash, sh, ksh: shows limits in effect) % limit -aH (bash,sh, ksh: shows maximum values)
% limit datasize 2097151 (csh, tcsh) % ulimit -d 2097152 (bash, sh, ksh)
You cannot run interactive serial jobs on any of the "compute" nodes.
Parallel Jobs
Interactive parallel programs are executed by the Parallel Operating Environment (POE) software. POE can run the job only on the "compute" nodes. Parallel jobs do not run on the login nodes that are used for interactive terminal sessions and interactive serial jobs.
Interactive jobs are run through the LoadLeveler job scheduling software in the interactive class. Therefore interactive jobs are subject to class resource limits and policies associated with the interactive class.
To run a parallel job interactively:
- compile your program with one of the "parallel" compiler invocations, e.g. mpxlf
- use the poe command to invoke the Parallel Operating Environment.
NOTE: Use of the poe command is optional for programs compiled with one of the "parallel" compiler invocations. Those programs will execute just as if you had typed poe at the command line. In fact, there is no way to run an executable compiled with an "mp" compiler as a "serial" program.
Options can be passed to poe on the command line or by setting environment variables. Command line options override the environment variable settings.
Interactive jobs are charged to your default repository, unless you specify otherwise. See IBM Accounts & Charging.
Required POE flags
You should specify two of the following:
| POE command line option | POE Environment variable | Default | Description |
|---|---|---|---|
| -procs | MP_PROCS | 1 | The number of program tasks. |
| -nodes | MP_NODES | 1* | Specifies the number of physical nodes on which to run the parallel tasks. |
| -tasks_per_node | MP_TASKS_PER_NODE | 1* | Specifies the number of tasks to be run on each of the physical nodes. |
If you choose not to run with tasks per node set to the number of processors on the node, LoadLeveler will allocate tasks to the nodes in a balanced fashion.
There are a number of other available flags available. Two of the most useful flags are -retry N -retrycount M. These options specify an attempt to launch your parallel job should be made M times, with wait of N seconds between launch attempts. This is a good way of running an interactive job when the machine is busy.
Note that these are not the LoadLeveler keywords that are used in batch scripts, even though some of the names may be similar or even identical. Command-line arguments that poe does not recognize are passed to your program as arguments without warning or comment by poe.
See Operation and Use, Volume 1 Using the Parallel Operating Environment for POE command line flags and environment variables.
For example, to run a parallel job with N total tasks on M "compute" nodes, with 10 attempts to run the job, waiting 30 seconds between attempts, you would type:
bassi% poe ./a.out -procs N -nodes M -retry 30 -retrycount 10
Running on Bassi
Bassi Batch Jobs
This document describes the IBM SP batch queuing system called Loadleveler as implemented at NERSC. It explains how to submit batch jobs, how to monitor the progress of jobs after they are submitted, and how to delete unwanted jobs. Example scripts are included which can be used as templates for customization.
- Overview
- Class info, scheduling and policies
- Common commands
- Creating batch jobs
- Loadleveler keywords commonly used at NERSC
- Selecting the number of tasks and nodes
- Submitting jobs
- Job steps and dependencies
- Monitoring jobs
- Deleting jobs
- Account charging
- Example batch script
Overview
All production runs on the SP should be done through the batch system. IBM's batch system software is called LoadLeveler. Since interactive resources are scarce on this system, most development and debugging runs should also be done in batch. This page describes how to use Loadleveler.
To submit a job to LoadLeveler create a batch script which contains the job's requirements, then submit the job to the batch system. You can then monitor the job as it progresses through the queue and can delete unwanted jobs. Jobs may be submitted at various NERSC charge priority values.
Here is a list of terms commonly used in reference to Loadleveler.
- node
A Bassi node contains 8 CPUs. Your batch job has exclusive use of each node it uses and you are charged for the use of all CPUs while your job is resident on a node.
Processors within a node share a common pool of memory.
- keyword
A Loadleveler batch script consists of UNIX shell commands and LoadLeveler keywords. See Loadleveler keywords commonly used on Bassi. A complete list of all Loadleveler keywords is available in IBM's documentation, but note that not all available keywords are relevant or useful on Bassi.
- job ID
Loadleveler assigns an ID to every submitted job. The term "Job ID" will refer to the identifier used to monitor or delete a job.
- class
This is similar to what is commonly called a queue. Each class has associated limits and properties, and a user selects a "submit" class for the job. There are also "destination" classes in which the job ultimately runs. In general, users submit to a submit class and the job is automatically routed to a destination class. Users cannot submit jobs to a destination class directly.
Class Info and Policies
Classes and Job Scheduling
All Loadleveler jobs must be submitted to a valid submit class. If the class doesn't exist, no error message will be issued. The job will be submitted, but will sit in the queue indefinitely. If this happens to you, you must delete the job using the llcancel command and resubmit to an available class.
NERSC users specify one of the following submit classes to queue jobs. Upon submission the job is routed to the appropriate LoadLeveler class according to the following criteria. (Users can not directly access the LoadLeveler classes.)
| Submit Class1 | Job Type | Destination Class2 | Nodes | Available Processors | Max Wallclock | Relative Priority3 | MPP Charge (units of nodes*wall hrs) | Availability |
|---|---|---|---|---|---|---|---|---|
| interactive | parallel | interactive | 1-4 | 1-32 | 30 mins | 1 | 48 | Everyone |
| debug4 | parallel | debug | 1-8 | 1-64 | 30 mins | 2 | 48 | Everyone |
| premium5 | parallel | premium | 1-48 | 1-384 | 12 hrs | 4 | 96 | Everyone |
| regular | parallel | reg_1 | 1-15 | 1-120 | 36 hrs | 5 | 48 | Everyone |
| parallel | reg_16 | 16-31 | 121-248 | 18 hrs | 5 | 48 | Everyone | |
| parallel | reg_32 | 32-48 | 249-384 | 18 hrs | 5 | 48 | Everyone | |
| low | parallel | low | 1-32 | 1-256 | 12 hrs | 6 | 24 | Everyone |
| special6 | parallel | special | 1-64 | 1-512 | 48 hrs | 3 | 48 | By special arrangement |
| full_config6 | parallel | full_config | 1-ALL | 1-ALL | 48 hrs | 3 | 48 | By special arrangement |
Notes
1 - This is the class name to be used in LoadLeveler scripts.
2 - Users cannot submit scripts directly to a destination class,
but this is the class name that will appear when using job monitoring utilities.
3 - The priorites listed in the table are relative. NERSC assigns priorities in terms of
"equivalent days waiting in the queue".
In addition to the relative priority given to jobs depending on their LoadLeveler class, certain projects
with high priority within DOE receive a "scheduling boost". These tend to be INCITE projects.
4 - 4 nodes are reserved exclusively for interactive and debug use weekdays from 5:00 to 18:00
Pacific Time.
5 - The intent of the premium queue is to
allow for faster turnaround before conferences and urgent project deadlines.
It should be used with care, and in most cases a project should not spend more than 10 percent of its
time in premium.
6 - Available by special arrangement only.
See also Queue Policies, below, for information on run limits.
You can use the llclass command on the system to obtain information about the LoadLeveler classes. Detailed information about a single LoadLeveler class can be found using llclass -l classname.
If you request more wall clock time than allowed by the class (as indicated by the Max Wallclock column in the table above), your job will be submitted with the wall clock time adjusted to the maximum allowed. If you omit requested time, then a default of 30 minutes will be used.
Your job will be charged and scheduled according to the priority listed in the class name. Both interactive and debug are charged at the regular rate. See MPP Accounting.
The classes are configured to give the best service to premium and regular jobs. Premium jobs are charged at twice the rate of regular jobs, but are scheduled at a higher priority.
Loadleveler uses a scheduling technique called "backfilling". This method starts smaller, shorter jobs if they will not affect the start time for the job that is scheduled to begin next. This scheduling technique is advantageous from both a user and system perspective. It allows a faster turn around for shorter jobs, and it maximizes system usage.
NERSC Queue Policies for Bassi
- For the production batch classes, each user may have:
- 3 jobs running (this parameter can be adjusted depending on system load).
- 4 jobs in Idle state (jobs queued to run; this parameter can be adjusted depending on system load).
- The combined number of debug and interactive jobs that a user may have submitted or running at a given time must be two or fewer. Note that this policy only applies to jobs run in the interactive and debug batch classes. This includes parallel jobs (anything compiled with one of the "mp" compilers, e.g. mpxlf90) that are executed from the command line, as well as those jobs (parallel or otherwise) that are explicitly submitted to these two classes with the "llsubmit" command. The policy has NO effect on sequential programs executed from the command line, including all the normal Unix commands.
- The interactive and debug classes are to be used for code development, testing, and debugging. Production runs are strictly prohibited from using the interactive and debug classes. User accounts are subject to suspension if they are determined to be using the interactive or debug class for production computing.
- Any job that has been in the queue for 7 days or more, and is in the "user
hold" (Loadlever status HU) state, will be removed from the system. Note that
this means:
- Jobs may not be held for more than 7 days; and
- Jobs older than 7 days may not be held.
- Since jobs on User Hold age in the queue, their release may perturb the scheduler such that overall system throughput is degraded. In such circustances NERSC may change the state of User Hold jobs to System Hold, and release them only when overall system throughput will not be affected.
- A 60 minute time limit is enforced on all user processes on the login nodes.
- Job "chaining" in the debug and interactive classes is strictly forbidden. Chaining is defined as using a batch script to submit another batch script. User accounts are subject to suspension if they are found to be chaining jobs in the debug or interactive classes.
- Bassi is occassionally removed from service for maintenance. Users will be given seven days notice before such events, usually on the "Message of the Day" (MOTD), which is displayed upon login and is also available here. Usually, a system reservation will be made so that all jobs will finish normally before a maintence period; however, jobs that are running - for any reason - may be terminated at the start of a maintence period.
Common commands
Here's a list of commonly used Loadleveler commands with brief descriptions. Detailed information on these commands can be obtained by typing:
% man commandname
| Command | Description |
|---|---|
| llsubmit | Submits a script file to Loadleveler for execution |
| llqs | Lists all Loadleveler jobs, including requested wall clock time and number of nodes. |
| llq | Same as llqs but with less information. |
| llcancel | Deletes a job from the Loadleveler queue |
| llclass | Displays information about the configured classes |
| llstatus | Displays status of individual nodes on the SP |
The llqs command is specific to NERSC. There are a number of options available, please see the man page.
The llq command has been usurped by llqs but it is still useful for details about a particular job. See Monitoring jobs for more information.
Creating batch jobs
To submit a batch job to Loadleveler you need to create a script file containing the commands that make up the batch job. This file must contain the following types of commands:
- Loadleveler keywords embedded in the beginning of the script file. They provide Loadleveler with information needed to process the job.
- Any AIX commands that can be entered interactively, including shell commands and user-created programs.
Here is an example of a script file named myjob that
runs a program named a.out when submitted from Bassi:
#@ job_name = myjob #@ account_no = repo_name #@ output = myjob.out #@ error = myjob.err #@ job_type = parallel #@ environment = COPY_ALL #@ notification = complete #@ network.MPI = sn_all,not_shared,us #@ node_usage = not_shared #@ class = regular #@ bulkxfer = yes # # #@ tasks_per_node = 8 #@ node = 1 #@ wall_clock_limit= 01:00:00 # #@ queue ./a.out
A program to check a LoadLeveler script for common simple errors is available on Bassi. This does not provide a comphrehensive error check (far from it, in fact!)
% ll_check_script script_name
LoadLeveler keywords commonly used at NERSC
For most jobs run on Bassi, NERSC suggests that the following keywords be specified. Some of these keywords may have default values, but NERSC considers them to be undefined and you should never rely on their existence or persistence.
There are many, many LoadLeveler keywords. Many do not have any meaning in the NERSC environment. Many others can contain site-specific values. NERSC recommends that scripts contain the minimum set of keywords needed to run your job in the way you want. If you are trying to use keywords that you do not understand, it is best to delete them. A common error is to specify a keyword and value that do not apply at NERSC (or are wrong for the NERSC environment); such jobs will be accepted into the queue, but will never run.
A complete description of Loadleveler keywords can be found in IBM's LoadLeveler manual.
#@ job_type = parallel |
Specifies that your job is parallel. |
#@ node = N |
Specifies that your job will use N nodes. |
#@ network.MPI =sn_all,not_shared,US |
Specifies communication protocol and adapters. csss is an alias for sn_all. Please see the LL documentation if you wish to use LAPI. |
#@ class = class |
Specifies that your job should be run in the submit class class. |
#@ tasks_per_node = N |
Specifies the number of tasks that will run on each node. To run a different number of tasks on different nodes, see Selecting the number of nodes and tasks. |
#@ wall_clock_limit = HH:MM:SS |
Specifies that your job requires HH:MM:SS of wall clock time. |
#@ queue |
Places your job in the queue. It must be the last keyword specified. Any keywords placed after this in the script are ignored by the current job step. |
#@ node_usage = not_shared |
Specifies that your job should not share nodes with other jobs. This is the default and cannot be overridden. |
#@ job_name = jobname |
Specifies the name of your job. |
#@ account_no = repo_name |
Specifies the repository that will be charged. If omitted, your default repository will be used. |
#@ notification = complete |
Specifies that email should be sent upon completion of the job. Other options include always, error, start, and never. The default is complete. |
#@ shell = /usr/bin/shell |
Specifies the shell that parses the script. If not specified, your login shell will be used. |
#@ bulkxfer = yes |
Specifies the the job use "bulk transfer" for message passing. Most codes will benefit from using "bulk transfer". |
#@resources = ConsumableMemory(N mb) |
Specifies the memory available to your job per node. See Memory usage considerations. The default value of N is 3395 mb per task. |
#@ output = myjob.out |
Specifies the name and location of STDOUT. If not given, the default is /dev/null. The default directory is the submitting directory. |
#@ error = myjob.err |
specifies the name and location of STDERR. If not given, the default is /dev/null. The default directory is the submitting directory. |
#@ environment = COPY_ALL |
Specifies that all environment variables from your shell should be used. You can also list individual variables which should be separated with semicolons. |
If you are using a script that was used on another (non-SMP) IBM machine, beware of the following keywords:
- #@ max_processors
- #@ min_processors
These keywords specify the maximum/minimum number of nodes (not processors!) requested for a parallel job, regardless of the number of processors contained in the node. The node keyword should be used instead.
Selecting the number of tasks and nodes
You must specify the number of tasks and number of nodes you wish to use. Remember that Bassi has a maximum of 8 tasks per node; if you ask for more than 8 tasks per node (either explicitly or implicitly) your job will never run.
You usually specify numbers of tasks and nodes by including two of the following three LoadLeveler keywords in your batch script:
- #@ node = desired number of nodes
- #@ total_tasks = desired number of tasks
- #@ tasks_per_node = desired number of tasks per node; up to a maximum of 8 on Bassi
For example, to run on 4 nodes and pack the nodes with 8 MPI tasks per node, in your LoadLeveler script use:
#@ node = 4 #@ tasks_per_node = 8
To use a number of tasks that is not a multiple of 8, do not set the tasks_per_node keyword. For 29 tasks on 6 nodes, you would use:
#@ node = 6 #@ total_tasks = 29
LoadLeveler will allocate tasks on the nodes in a round-robin fashion.
It is also possible to select which tasks run together on a node. However you must be very careful to completely specify all tasks and the correct number of nodes. Do not use any of the three keywords listed above. Instead, to run 8 tasks on 3 nodes with task 0 on one node, odd tasks on another node and the remaining even tasks on the third node, use:
#@task_geometry = {(0)(1,3,5,7)(2,4,6)}
See LoadLeveler documentation on "Task Assignment Considerations".
Submitting Jobs
Once you have created a script file, it can be submitted to Loadleveler for execution using the llsubmit command:
% llsubmit script_file llsubmit: The job "s01007.nersc.gov.101" has been submitted.
script_file specifies the name of the script file containing commands that comprise the batch job and keywords which control various aspects of the job, e.g., resource limits, output files, mail messages, etc.
You do not specify a job priority in addition to the class. The class will contain the name of the priority at which the job will be charged. See Class info and job scheduling.
There are system limits on the number of jobs a user can have in various states. See NERSC queue policies for Bassi.
Job steps and dependencies
LoadLeveler allows you to define "job steps" in a single batch script. Each step is essentially a different job that can run based on a set of "dependencies" on other job steps contained in the same script. LoadLeveler class limits apply to each step separately.
See the LoadLeveler documentation for full details. Here we describe how to run an executable based on whether or not a previous program completed successfully. This strategy might be used by a code that writes checkpoint files at the end of a run and then uses those files as input for the next run.
You define steps in a LoadLeveler script by naming them with the keyword definition
#@step_name = step_name
followed by other definitions for that step and finally a "queue" directive. Additional steps follow.
Some LoadLeveler directives must be included in each step. A typical first job step for an MPI code would look like this:
#@ step_name = step1 #@ job_type = parallel #@ node = 8 #@ wall_clock_limit = 24:00:00 #@ class = regular #@ tasks_per_node = 8 #@ network.MPI = sn_all,not_shared,us #@ executable = a.out #@ queue
The second job step contains a dependency, as specified on the with the "dependency" keyword. Here we are going to assume that if the first job step is successful it returns an error code of 0 (zero). Here is a typical second job step:
#@ step_name = step2 #@ dependency = (step1==0) #@ job_type = parallel #@ wall_clock_limit = 24:00:00 #@ class = regular #@ node = 4 #@ tasks_per_node = 8 #@ network.MPI = sn_all,not_shared,us #@ executable = b.out #@ queue
The second job step will only be run if step1 has an exit code of 0. The executable a.out and b.out can themselves be executable shell scripts.
You can also include "if" statements in your LoadLeveler script itself to take different actions for different job steps. Here's an full example script that run the toy "flip" program introduced in the Introduction to MPI tutorial.
#@ class = debug
#@ shell = /usr/bin/csh
#@ wall_clock_limit = 0:05:00
#@ notification = always
#@ output = $(host).$(jobid).$(stepid).out
#@ error = $(host).$(jobid).$(stepid).err
#@ environment = COPY_ALL
#
#@ step_name = step1
#@ job_type = parallel
#@ node = 1
#@ tasks_per_node = 4
#@ network.MPI = sn_all,not_shared,us
#@ queue
#
#@ step_name = step2
#@ dependency = (step1==0)
#@ job_type = parallel
#@ node = 1
#@ tasks_per_node = 2
#@ network.MPI = sn_all,not_shared,us
#@ queue
if ($LOADL_STEP_NAME == "step1") then
./flip
else if ($LOADL_STEP_NAME == "step2") then
echo "Step 1 returned exit code of 0."
echo "100500" >flip.in
./flip
else
echo "Error: no value for step name."
endif
exit
Monitoring jobs
To monitor your job after submission, you can use the llqs command:
% llqs Step Id JobName UserName Class ST NDS WallClck Submit Time ------------- ------------ -------- ------- -- --- -------- ----------- b0307.1087.0 a240 buffy regular R 32 00:31:44 3/13 04:30 b0301.529.0 s1.x willow regular R 64 00:28:17 3/12 21:45 b0301.578.0 xdnull xander debug R 5 00:05:19 3/14 12:44 b0309.929.0 s01009.ners spike regular R 128 03:57:27 3/13 05:17 b0301.530.0 s2.x willow regular I 64 04:00:00 3/12 21:48 b0301.532.0 s3.x willow regular I 64 04:00:00 3/12 21:50 b0301.533.0 y1.x willow regular I 64 04:00:00 3/12 22:17 b0301.534.0 y2.x willow regular I 64 04:00:00 3/12 22:17 b0301.535.0 y3.x willow regular I 64 04:00:00 3/12 22:17 b0301.537.0 s01001.nerg spike regular I 128 02:30:00 3/13 06:10 b0309.930.0 s01009.nerg spike regular I 128 02:30:00 3/13 07:17
This information can also be found on the NERSC Batch Queue Status page.
Using phost to determine node IDs
NERSC has written a utility, phost, that can be used to record which tasks ran on which node.
Deleting jobs
The llcancel command is used to delete a job.
% llcancel joblist
The joblist contains the list of Job IDs you want to delete. You can get this name from the first column of the output from either the llqs or the llq command:
bassi% llq Id Owner Submitted ST PRI Class Running On ------------------- ---------- ----------- -- --- ------------ ----------- b01015.137.0 bones 10/7 12:05 R 50 regular s00608 b01005.84.0 spock 10/7 16:51 R 50 regular s01816 b01015.177.0 jimkirk 10/7 17:01 R 50 interactive s00713 b01015.174.0 sulu 10/7 16:54 R 50 low s01601 b01015.176.0 uhura 10/7 17:00 R 50 debug s00101 5 job steps in queue, 0 waiting, 0 pending, 5 running, 0 held
For example, if user spock wanted to delete a job, then the command is:
% llcancel s01005.84.0 llcancel: Cancel command has been sent to the central manager.
The llcancel command works for both running and queued jobs.
Account charging
By default, job usage is charged to your default repository. Use the NERSC Information Management (NIM) system to view your default repo.
You can specify the repo to be charged in your LoadLeveler script. Use this keyword:
#@account_no = repo_name
Your job charge depends on which class you specify. In general, one submits a job to the class which contains the name of the desired NERSC batch charging priority. For example, a premium job is submitted to the class named "premium". Interactive and debug jobs are charged at the same rate as regular class jobs.
See Accounting on the IBM for information about how accounts are charged.
Example batch script
Here is an example batch script.
#@ job_name = myjob #@ account_no = repo_name #@ output = myjob.out #@ error = myjob.err #@ job_type = parallel #@ environment = COPY_ALL #@ notification = complete #@ network.MPI = sn_all,not_shared,us #@ node_usage = not_shared #@ class = regular # # #@ node = 16 #@ tasks_per_node = 8 #@ wall_clock_limit= 01:00:00 # #@ bulkxfer = yes # #@ queue ./a.out < input
Class Info and Policies
Classes and Job Scheduling
All Loadleveler jobs must be submitted to a valid submit class. If the class doesn't exist, no error message will be issued. The job will be submitted, but will sit in the queue indefinitely. If this happens to you, you must delete the job using the llcancel command and resubmit to an available class.
NERSC users specify one of the following submit classes to queue jobs. Upon submission the job is routed to the appropriate LoadLeveler class according to the following criteria. (Users can not directly access the LoadLeveler classes.)
| Submit Class1 | Job Type | Destination Class2 | Nodes | Available Processors | Max Wallclock | Relative Priority3 | MPP Charge (units of nodes*wall hrs) | Availability |
|---|---|---|---|---|---|---|---|---|
| interactive | parallel | interactive | 1-4 | 1-32 | 30 mins | 1 | 48 | Everyone |
| debug4 | parallel | debug | 1-8 | 1-64 | 30 mins | 2 | 48 | Everyone |
| premium5 | parallel | premium | 1-48 | 1-384 | 12 hrs | 4 | 96 | Everyone |
| regular | parallel | reg_1 | 1-15 | 1-120 | 36 hrs | 5 | 48 | Everyone |
| parallel | reg_16 | 16-31 | 121-248 | 18 hrs | 5 | 48 | Everyone | |
| parallel | reg_32 | 32-48 | 249-384 | 18 hrs | 5 | 48 | Everyone | |
| low | parallel | low | 1-32 | 1-256 | 12 hrs | 6 | 24 | Everyone |
| special6 | parallel | special | 1-64 | 1-512 | 48 hrs | 3 | 48 | By special arrangement |
| full_config6 | parallel | full_config | 1-ALL | 1-ALL | 48 hrs | 3 | 48 | By special arrangement |
Notes
1 - This is the class name to be used in LoadLeveler scripts.
2 - Users cannot submit scripts directly to a destination class,
but this is the class name that will appear when using job monitoring utilities.
3 - The priorites listed in the table are relative. NERSC assigns priorities in terms of
"equivalent days waiting in the queue".
In addition to the relative priority given to jobs depending on their LoadLeveler class, certain projects
with high priority within DOE receive a "scheduling boost". These tend to be INCITE projects.
4 - 4 nodes are reserved exclusively for interactive and debug use weekdays from 5:00 to 18:00
Pacific Time.
5 - The intent of the premium queue is to
allow for faster turnaround before conferences and urgent project deadlines.
It should be used with care, and in most cases a project should not spend more than 10 percent of its
time in premium.
6 - Available by special arrangement only.
See also Queue Policies, below, for information on run limits.
You can use the llclass command on the system to obtain information about the LoadLeveler classes. Detailed information about a single LoadLeveler class can be found using llclass -l classname.
If you request more wall clock time than allowed by the class (as indicated by the Max Wallclock column in the table above), your job will be submitted with the wall clock time adjusted to the maximum allowed. If you omit requested time, then a default of 30 minutes will be used.
Your job will be charged and scheduled according to the priority listed in the class name. Both interactive and debug are charged at the regular rate. See MPP Accounting.
The classes are configured to give the best service to premium and regular jobs. Premium jobs are charged at twice the rate of regular jobs, but are scheduled at a higher priority.
Loadleveler uses a scheduling technique called "backfilling". This method starts smaller, shorter jobs if they will not affect the start time for the job that is scheduled to begin next. This scheduling technique is advantageous from both a user and system perspective. It allows a faster turn around for shorter jobs, and it maximizes system usage.
NERSC Queue Policies for Bassi
- For the production batch classes, each user may have:
- 3 jobs running (this parameter can be adjusted depending on system load).
- 4 jobs in Idle state (jobs queued to run; this parameter can be adjusted depending on system load).
- The combined number of debug and interactive jobs that a user may have submitted or running at a given time must be two or fewer. Note that this policy only applies to jobs run in the interactive and debug batch classes. This includes parallel jobs (anything compiled with one of the "mp" compilers, e.g. mpxlf90) that are executed from the command line, as well as those jobs (parallel or otherwise) that are explicitly submitted to these two classes with the "llsubmit" command. The policy has NO effect on sequential programs executed from the command line, including all the normal Unix commands.
- The interactive and debug classes are to be used for code development, testing, and debugging. Production runs are strictly prohibited from using the interactive and debug classes. User accounts are subject to suspension if they are determined to be using the interactive or debug class for production computing.
- Any job that has been in the queue for 7 days or more, and is in the "user
hold" (Loadlever status HU) state, will be removed from the system. Note that
this means:
- Jobs may not be held for more than 7 days; and
- Jobs older than 7 days may not be held.
- Since jobs on User Hold age in the queue, their release may perturb the scheduler such that overall system throughput is degraded. In such circustances NERSC may change the state of User Hold jobs to System Hold, and release them only when overall system throughput will not be affected.
- A 60 minute time limit is enforced on all user processes on the login nodes.
- Job "chaining" in the debug and interactive classes is strictly forbidden. Chaining is defined as using a batch script to submit another batch script. User accounts are subject to suspension if they are found to be chaining jobs in the debug or interactive classes.
- Bassi is occassionally removed from service for maintenance. Users will be given seven days notice before such events, usually on the "Message of the Day" (MOTD), which is displayed upon login and is also available here. Usually, a system reservation will be made so that all jobs will finish normally before a maintence period; however, jobs that are running - for any reason - may be terminated at the start of a maintence period.
Running on Bassi
Monitoring your job
NERSC has developed a utility for users to get information about queued and running jobs called llqs. In addition to the information provided by the native Loadleveler llq command, it displays the number of nodes and wallclock time. The -C option shows the repository to which the job is being charged. The -u username option shows jobs of just one user.
Bassi's queues are displayed on the web, updated every 10 minutes. A completed jobs list is updated daily at midnight.
Other utilities are provided by IBM. See Monitoring Batch Jobs
![]() |
Page last modified: Thu, 17 Jan 2008 23:41:20 GMT Page URL: http://www.nersc.gov/nusers/resources/bassi/running_jobs/print.php Web contact: webmaster@nersc.gov Computing questions: consult@nersc.gov Privacy and Security Notice |
![]() |

