Running Jobs |
Historical: Running on Seaborg |
| Submit Class | Job Type | LL Class1 | Nodes | Max Wallclock | Relative Priority3 | Class Charge Factor |
|---|---|---|---|---|---|---|
| interactive | parallel | interactive | 1-8 | 30 mins | 1 | 1 |
| debug | parallel | debug | 1-24 | 30 mins | 2 | 1 |
| premium | parallel | premium | 1-299 | 24 hrs | 3 | 2 |
| regular | parallel | reg_1 | 1-31 | 12 hrs | 5 | 1 |
| parallel | reg_1l | 1-31 | 24 hrs | 5 | 1 | |
| parallel | reg_32 | 32-47 | 48 hrs | 5 | 1 | |
| parallel | reg_48 | 48-255 | 48 hrs | 4 | .5 see note4 | |
| parallel | reg_256 | 256-299 | 48 hrs | 4 | .5 see note4 | |
| parallel | reg_3006 | 300-ALL | 48 hrs | see note6 | .5 see note4 | |
| serial2 | reg_s | 1 processor | 12 hrs | see note5 | see note7 | |
| low | parallel | low | 1-299 | 12 hrs | 6 | .5 |
| xfer | serial | xfer | 1 processor | 4 hrs | see note5 | see note7 |
2 - Serial jobs must be submitted with a job_type of "serial" and a class of "regular". See Running Serial Jobs with Loadleveler for more information on the serial job classes.
3 - The priorites listed in the table are relative. NERSC assigns priorities in terms of "equivalent days waiting in the queue". For example, a premium job on Seaborg receives a two day boost over the largest regular class jobs. You can view typical class and queue wait times on the job stats webpage. In addition to the relative priority given to jobs depending on their LoadLeveler class, users with large allocations who must have a certain turn around in order to use their allocation may be eligible to receive a scheduling "boost".
4 - Jobs that run in the reg_48, reg_256, and reg_300 classes are discounted 50% of the regular rate.
5 - The transfer (xfer) class runs on a dedicated node, and thus does not compete with the other classes.
6 - The reg_300 class will usually have a run limit of zero. NERSC staff will monitor this class and make special arrangements to run jobs of this size.
7 - The serial and xfer jobs are only charged for running on 1 processor, not the entire node.
See also Queue Policies for information on run limits and Using large-memory nodes for information on specifying memory requirements.
You can use the llclass command on the system to obtain information about the LoadLeveler classes. Detailed information about a single LoadLeveler class can be found using llclass -l classname.
If you request more wall clock time than allowed by the class (as indicated by the Max Time column in the table above), your job will be submitted with the wall clock time adjusted to the maximum allowed. If you omit requested time, then a default of 30 minutes will be used.
You must explicitly specify the number of nodes with the stanza "#@ node = ", otherwise your job will be run on one node.
Your job will be charged and scheduled according to the priority listed in the class name. Both interactive and debug are charged at the regular rate. See Accounting on the SP.
The classes are configured to give the best service to premium and regular jobs. Users are expected to use the regular class for at least 80 percent of their jobs. Premium jobs are charged at twice the rate of regular jobs, but are scheduled at a higher priority.
Loadleveler uses a scheduling technique called "backfilling". This method starts smaller, shorter jobs if they will not affect the start time for the job that is scheduled to begin next. This scheduling technique is advantageous from both a user and system perspective. It allows a faster turn around for shorter jobs, and it maximizes system usage.
There are two LoadLeveler serial classes on Seaborg, serial and xfer. Jobs run in these classes are charged only for one processor's actual wall clock time. These classes are designed for
These two classes typically have very short queue wait times, usually less than 10 minutes, so if you submit a job to one of these queues as the last statement of a large parallel job, you can expect the serial job to start almost immediately.
These two classes are not designed for single node multi-threaded jobs, e.g. OpenMP. Jobs of this type should be submitted with these LoadLeveler keywords:
#@ job_type = parallel #@ node = 1
This class is for serial pre-and-post data processing. The jobs in this class share a single 16 GB node. There is a run limit of 16 jobs that can share the node at the same time. It has wall clock and CPU limits of 12 hours and a memory limit of 1 GigaByte.
You can make use of the serial class by specifying the following LoadLeveler keywords:#@ job_type = serial #@ class = regular
This class is intended specifically for data transfer jobs. The run limit for the class is 8 jobs sharing a node especially configured for good network access to HPSS. It has wall clock and CPU limits of 4 hours and a memory limit of 1 GigaByte. (It takes approximately 3 hours for 1 terabyte of data to be transferred to HPSS from Seaborg).
Note you can submit a xfer job from within your parallel batch script. The example below shows how a parallel job may submit an xfer job to move files. The serial data transfer job is initiated when the XCPU signal is sent (soft_limit) or the parallel computational program completes.
This class is specifically intended only for data transfer and not for any other type of serial job. There is a discussion with examples of using HPSS in a batch job at Accessing HPSS - Batch Jobs.
You can make use of the xfer class by specifying the following LoadLeveler keywords:
#@ job_type = serial #@ class = xfer
Here's a list of commonly used Loadleveler commands with brief descriptions. Detailed information on these commands can be obtained by typing:
% man commandname
| Command | Description |
|---|---|
| llsubmit | Submits a script file to Loadleveler for execution |
| llqs | Lists all Loadleveler jobs, including requested wall clock time and number of nodes. |
| llq | Same as llqs but with less information. |
| llcancel | Deletes a job from the Loadleveler queue |
| llclass | Displays information about the configured classes |
| llstatus | Displays status of individual nodes on the SP |
The llqs command is specific to NERSC. There are a number of options available, please see the man page.
The llq command has been usurped by llqs but it is still useful for details about a particular job. See Monitoring jobs for more information.
To submit a batch job to Loadleveler you need to create a script file containing the commands that make up the batch job. This file must contain the following types of commands:
Here is an example of a script file named myjob that
runs a program named a.out when submitted from seaborg:
#@ job_name = myjob #@ account_no = repo_name #@ output = myjob.out #@ error = myjob.err #@ job_type = parallel #@ environment = COPY_ALL #@ notification = complete #@ network.MPI = csss,not_shared,us #@ node_usage = not_shared #@ class = regular # # #@ tasks_per_node = 16 #@ node = 1 #@ wall_clock_limit= 01:00:00 # #@ queue ./a.out
A program to check a LoadLeveler script for simple errors is available on Seaborg.
% ll_check_script script_name
For most jobs run on seaborg, NERSC suggests that the following keywords be specified. Some of these keywords may have default values, but NERSC considers them to be undefined and you should never rely on their existence or persistence.
A complete description of Loadleveler keywords can be found in IBM's documentation.
#@ job_type = parallel |
Specifies that your job is parallel. |
#@ node = N |
Specifies that your job will use N nodes. |
#@ network.MPI = csss,not_shared,US |
Specifies communication protocol and adapters. Please see the LL documentation if you wish to use LAPI. |
#@ class = class |
Specifies that your job should be run in the submit class class. |
#@ tasks_per_node = N |
Specifies the number of tasks that will run on each node. To run a different number of tasks on different nodes, see Selecting the number of nodes and tasks. |
#@ wall_clock_limit = HH:MM:SS |
Specifies that your job requires HH:MM:SS of wall clock time. |
#@ queue |
Places your job in the queue. It must be the last keyword specified. Any keywords placed after this in the script are ignored by the current job step. |
#@ node_usage = not_shared |
Specifies that your job should not share nodes with other jobs. This is the default and cannot be overridden. |
#@ job_name = jobname |
Specifies the name of your job. |
#@ account_no = repo_name |
Specifies the repository that will be charged. If omitted, your default repository will be used. |
#@ notification = complete |
Specifies that email should be sent upon completion of the job. Other options include always, error, start, and never. If not specified, you receive no notification. |
#@ shell = /usr/bin/shell |
Specifies the shell that parses the script. If not specified, your login shell will be used. |
#@ output = myjob.out |
Specifies the name and location of STDOUT. If not given, the default is /dev/null. The default directory is the submitting directory. |
#@ error = myjob.err |
specifies the name and location of STDERR. If not given, the default is /dev/null. The default directory is the submitting directory. |
#@ environment = COPY_ALL |
Specifies that all environment variables from your shell should be used. You can also list individual variables which should be separated with semicolons. |
#@ restart = no |
Specifies that your job should not be requeued, if Load Leverler can requeue it, if it must vacate a node on which it is running. |
If you are using a script that was used on another (non-SMP) SP, beware of the following keywords:
These keywords specify the maximum/minimum number of nodes (not processors!) requested for a parallel job, regardless of the number of processors contained in the node. The node keyword should be used instead.
You must specify the number of tasks and number of nodes you wish to use. Remember that Seaborg has a maximum of 16 tasks per node; if you ask for more than 16 tasks per node (either explicitly or implicitly) your job will never run.
You usually specify numbers of tasks and nodes by including two of the following three LoadLeveler keywords in your batch script:
For example, to run on 4 nodes and pack the nodes with 16 MPI tasks per node, in your LoadLeveler script use:
#@ node = 4 #@ tasks_per_node = 16
To use a number of tasks that is not a multiple of 16, do not set the tasks_per_node keyword. For 29 tasks on 3 nodes, you would use:
#@ node = 3 #@ total_tasks = 29
LoadLeveler will allocate tasks on the nodes in a round-robin fashion.
It is also possible to select which tasks run together on a node. However you must be very careful to completely specify all tasks and the correct number of nodes. Do not use any of the three keywords listed above. Instead, to run 8 tasks on 3 nodes with task 0 on one node, odd tasks on another node and the remaining even tasks on the third node, use:
#@task_geometry = {(0)(1,3,5,7)(2,4,6)}
For more about task assignment considerations, see the IBM LoadLeveler documentation.
Each Seaborg node has a minimum of 16 GB of memory and a maximum of 64 GB. In the current configuration there are 4 compute nodes with 64 GB of memory and 64 nodes with 32 GB. The remaining 312 compute nodes have 16 GB.
You may set a "requirement" in your batch script that will run your job only on a node with a specified amount of memory. Be sure you understand the information on the Memory Management on the SP web page before targeting your job to run on large-memory nodes.
To use the 64-GB nodes exclusively (maximum of 4 nodes), use this line in your LoadLeveler script:
#@ requirements = (Memory == 65536)
To use the 32-GB nodes (maximum of 64 nodes), use:
#@ requirements = (Memory >= 32768)
Once you have created a script file, it can be submitted to Loadleveler for execution using the llsubmit command:
% llsubmit script_file llsubmit: The job "s01007.nersc.gov.101" has been submitted.
script_file specifies the name of the script file containing commands that comprise the batch job and keywords which control various aspects of the job, e.g., resource limits, output files, mail messages, etc.
You do not specify a job priority in addition to the class. The class will contain the name of the priority at which the job will be charged. See Class info and job scheduling.
There are system limits on the number of jobs a user can have in various states. See NERSC queue policies for seaborg.
LoadLeveler allows you to define "job steps" in a single batch script. Job steps are essentially different sets of instructions on how to run the job. They can be very useful in that one set of instructions can be set to execute depending the results of a previous set of instructions. This is referred to as a dependency.
See the LoadLeveler documentation for full details. Here we describe how to run an executable based on whether or not a previous program completed successfully. This strategy might be used by a code that writes checkpoint files at the end of a run and then uses those files as input for the next run.
You define steps in a LoadLeveler script by naming them with the keyword definition
#@step_name = step_name
followed by other definitions for that step and finally a "queue" directive. Additional steps follow.
Some LoadLeveler directives must be included in each step. A typical first job step for an MPI code would look like this:
#@ step_name = step1 #@ job_type = parallel #@ node = 8 #@ tasks_per_node = 16 #@ network.MPI = csss,not_shared,us #@ executable = a.out #@ queue
The second job step contains a dependency, as specified on the with the "dependency" keyword. Here we are going to assume that if the first job step is successful it returns an error code of 0 (zero). Here is a typical second job step:
#@ step_name = step2 #@ dependency = (step1==0) #@ job_type = parallel #@ node = 4 #@ tasks_per_node = 16 #@ network.MPI = csss,not_shared,us #@ executable = b.out #@ queue
The second job step will only be run if step1 has an exit code of 0. The executable a.out and b.out can themselves be executable shell scripts.
You can also include "if" statements in your LoadLeveler script itself to take different actions for different job steps. Here's an full example script that run the toy "flip" program introduced in the Introduction to MPI tutorial.
#LL WWW Script Generator. Richard Gerber, NERSC
#@ class = debug
#@ shell = /usr/bin/csh
#@ wall_clock_limit = 0:05:00
#@ notification = always
#@ output = $(host).$(jobid).$(stepid).out
#@ error = $(host).$(jobid).$(stepid).err
#@ environment = COPY_ALL
#
#@ step_name = step1
#@ job_type = parallel
#@ node = 1
#@ tasks_per_node = 4
#@ network.MPI = csss,not_shared,us
#@ queue
#
#@ step_name = step2
#@ dependency = (step1==0)
#@ job_type = parallel
#@ node = 1
#@ tasks_per_node = 2
#@ network.MPI = csss,not_shared,us
#@ queue
if ($LOADL_STEP_NAME == "step1") then
./flip
else if ($LOADL_STEP_NAME == "step2") then
echo "Step 1 returned exit code of 0."
echo "100500" >flip.in
./flip
else
echo "Error: no value for step name."
endif
exit
To monitor your job after submission, you can use the llqs command:
seaborg% llqs Step Id JobName UserName Class ST NDS WallClck Submit Time -------------- ------------ -------- ------- -- --- -------- ----------- s01007.1087.0 a240 buffy regular R 32 00:31:44 3/13 04:30 s01001.529.0 s1.x willow regular R 64 00:28:17 3/12 21:45 s01001.578.0 xdnull xander debug R 5 00:05:19 3/14 12:44 s01009.929.0 s01009.ners spike regular R 128 03:57:27 3/13 05:17 s01001.530.0 s2.x willow regular I 64 04:00:00 3/12 21:48 s01001.532.0 s3.x willow regular I 64 04:00:00 3/12 21:50 s01001.533.0 y1.x willow regular I 64 04:00:00 3/12 22:17 s01001.534.0 y2.x willow regular I 64 04:00:00 3/12 22:17 s01001.535.0 y3.x willow regular I 64 04:00:00 3/12 22:17 s01001.537.0 s01001.nerg spike regular I 128 02:30:00 3/13 06:10 s01009.930.0 s01009.nerg spike regular I 128 02:30:00 3/13 07:17
This information can also be found on the Seaborg Batch Queue Status page.
The jobs are listed in order from highest priority to lowest priority. The first column contains the job ID, which you will notice is different from the identifier given from the llsubmit command. The difference is that llsubmit returns the full node name, including the nersc.gov domain, which the first column of the above output does not contain. Otherwise the Job ID's are identical.
The second and third columns contain the job and owner names. The fourth lists which class the job was submitted to. The llclass command lists the available classes and their limits. Jobs submitted to the premium class will be scheduled faster but charged at a higher rate. Please see Priority Charging for more information.
The fifth column is the Loadleveler job state, which is usually one of the following:
| ST | Job State | Description |
|---|---|---|
| R | Running | The job is currently running. |
| I | Idle | The job is being considered to run. |
| NQ | Not Queued | The job is not currently being considered to run. This is probably because you have submitted more than 10 jobs. |
| ST | Starting | The job is starting to run. |
| RP | Remove Pending | The job is in the process of being removed. |
| HU | User Hold | The user put the job on hold. You must issue the llhold -r command in order for it to be considered for scheduling. |
| HS | System Hold | The job was put on hold by the system. This is probably because you are over disk quota in $HOME. |
There are other possible states, but they occur infrequently. Please check the Using and Administering Loadleveler guide for more information.
The rest of the columns list the requested number of nodes, the wall clock time, and submission time information. If the job is running, then the wall clock time is the time remaining before the job will finish.
Another useful command is llq with the -l option. This will display extensive information about one or all of your jobs. You can only use it on jobs you submitted. Please see the man page for details.
NERSC has written a utility, phost, that can be used to record which tasks ran on which node.
The llcancel command is used to delete a job.
% llcancel joblist
The joblist contains the list of Job IDs you want to delete. You can get this name from the first column of the output from either the llqs or the llq command:
seaborg% llq Id Owner Submitted ST PRI Class Running On ------------------- ---------- ----------- -- --- ------------ ----------- s01015.137.0 bones 10/7 12:05 R 50 regular s00608 s01005.84.0 spock 10/7 16:51 R 50 regular s01816 s01015.177.0 jimkirk 10/7 17:01 R 50 interactive s00713 s01015.174.0 sulu 10/7 16:54 R 50 low s01601 s01015.176.0 uhura 10/7 17:00 R 50 debug s00101 5 job steps in queue, 0 waiting, 0 pending, 5 running, 0 held
For example, if user spock wanted to delete a job, then the command is:
% llcancel s01005.84.0 llcancel: Cancel command has been sent to the central manager.
The llcancel command works for both running and queued jobs.
By default, job usage is charged to your default repository. Use the NERSC Information Management (NIM) system to view your default repo.
You can specify the repo to be charged in your LoadLeveler script. Use this keyword:
#@account_no = repo_name
Your job charge depends on which class you specify. In general, on seaborg one submits a job to the class which contains the name of the desired NERSC batch charging priority. For example, a premium job is submitted to the class named "premium". Interactive and debug jobs are charged at the same rate as regular class jobs.
See Accounting on the SP for information about how accounts are charged.
Here is an example batch script.
#@ job_name = myjob #@ account_no = repo_name #@ output = myjob.out #@ error = myjob.err #@ job_type = parallel #@ environment = COPY_ALL #@ notification = complete #@ network.MPI = csss,not_shared,us #@ node_usage = not_shared #@ class = regular # # #@ node = 16 #@ tasks_per_node = 16 #@ wall_clock_limit= 01:00:00 # #@ queue ./a.out < input
![]() |
Page last modified: Tue, 22 Apr 2008 17:19:05 GMT Page URL: http://www.nersc.gov/nusers/systems/SP/old_stuff/running_jobs/batch.php Web contact: webmaster@nersc.gov Computing questions: consult@nersc.gov Privacy and Security Notice |
![]() |