LoadLeveler is a job management system that allows users
to run more jobs in less time by matching their processing
needs to available resources. LoadLeveler serves as job
scheduler and provides a facility for building, submitting
and processing jobs quickly and efficiently in a dynamic
environment.LoadLeveler is the batch system on the SERC's
IBM RS/6000 workstations and servers and also on IBM RS/6000
SP.
The following FAQ will help the user to use the LoadLeveler in
SERC.
When do I
use LoadLeveler?
Resource intensive
jobs taking more CPU time and/or lot of memory or diskspace,
should be submitted through the LoadLeveler.
How does LoadLeveler work?
LoadLeveler Processes
jobs and monitors the workload by running the following
daemons and processes:
LoadL_master referred to as the master daemon,
this manages all daemons on its resident machine. LoadL_schedd referred
to as the schedd daemon, this manages batch submissions on its resident
machine.
LoadL_shadow spawned by the schedd daemon,
the shadow process communi cates with the starter process
for a job on a server machine.
LoadL_startd referred to as the startd daemon, this accepts dispatched
jobs on its resident machine. LoadL_starter spawned by startd daemon, the
starter process manges a running job on server machine.
LoadL_kbdd referred to as the keyboard daemon. this monitors keyboard and
mouse activity on its resident AIX machine.
LoadL_collector referred to as the collector daemon, this is the central
collector of machine status from all machines in the Load Leveler pool.
LoadL_negotiator referred to as the negotiator daemon, this is the central
scheduler and collector of job status from all machines in the LoadLeveler
pool. Some of these components reside on every workstation or host machine
in the LoadLeveler pool.Others reside only on the host designated as the
central manager.
What queues are available for we to submit jobs? What are the resource
and access limits to these queues?
Job Classes in Research Domain
| Workstations
/ Servers |
CPU
time limit |
On IBM 340
Headless |
q30h, q120h, q240h, q480h, q960h,
q50hrs_h :
30, 120, 240, 480, 960 minutes and 50 hours respectively.
|
| On IBM 340 / 43P |
|
| On IBM 590/591 |
q480s, q960s : 480 and 960 minutes
respectively.
q480g, q960g : Queues for Gaussian'92 users with
480 and 960 minutes respectively. Needs special
validation.
|
| On IBM 595 |
|
Job Classes in Course Domain
| Workstations
/ Servers |
|
| On IBM 43P |
q120p, q240p, q480p :
q120, 240 and 480 minutes respectively.
|
| On IBM 591 |
q120s, q240s, q480s, q960s :
120, 240, 480 and 960 minutes respectively.
|
Job Classes in Common Domain
| Workstations
/ Servers |
|
| On IBM 591 |
|
How do I submit a job to a LoadLeveler?
Include /home/loadl/bin
in your default search path. Create a command file (sample
command file) and submit that file using: llsubmit
command_file_name
This will submit an executable named a.out
from your current working directory to a machine where the job class q240h
is defined.
The standard error and standard output will be dumped into files error.log
and output.log respectively.
You may also use GUI-based xloadl for
the same purpose.
How do I know the status of my job?
Use llq command. This command returns
information about jobs in the LoadLeveler job queue.
A session might look like:
% llq
| ID |
Owner |
Submitted |
ST |
PRI |
CLASS |
Running On |
ibm580_2.12910.0
|
ochjag
|
11/4 10:06
|
R
|
50
|
q960s
|
ibm580_1
|
ibm580_1.21264.0
|
ochpanda
|
11/4 12:12
|
R
|
50
|
q480s
|
ibm580_2
|
ibm580_2.12923.0
|
sscmsh
|
11/4 13:44
|
R
|
50
|
q960g
|
ibm580_8
|
ibm580_1.21261.0
|
seckiran
|
11/4 10:39
|
I
|
50
|
q480s
|
|
ibm580_7.6079.0
|
mecsumit
|
11/3 21:16
|
I
|
50
|
q960s
|
|
ibms10.1097.0
|
secajay
|
11/4 00:06
|
P
|
50
|
q960s
|
|
ibm580_8.3156.0
|
secharsh
|
11/4 12:00
|
P
|
50
|
q480s
|
|
Each field is defined as:
| Id |
|
| |
|
| Owner |
|
| |
|
| Submitted |
The date and time
the job was submitted. Note that this has nothing
to do with the start time of the job.
|
| |
|
| ST |
The STate of the
job.
The possible values are:
R Running: The job is running
I Idle: The job is waiting for another job by the
same owner to finish, or for resources.
P Pending: The job is trying to allocate resources
D Deferred: The job is waiting for resources to
be available.
ST STarting: The job is starting
C Completed: The job is completed
H Hold: The job is held and will not run until released.
RM ReMoved: The job has been removed from the queue
NQ NotQueued: You already have a job running.
NR Not Run: The job will never be run because a
dependency associated with the job was found to
be false.
Idle, Pending, STarting,
Running.
Or:
Idle, Pending, Deferred, Pending, Deferred.
Or:
Pending, Starting, Running.
|
| |
|
| PRI |
The user-level priority
of the job; at this time all users at the SERC have
the same user-level priority.
|
| |
|
| Class |
|
| |
|
| Running On |
|
How do I know when my job got scheduled
and when it got completed?
By using notification
keyword in the Batch file
Syntax : Notification = always | error | start | never | complete
Description: notification
specifies when the user specified in notify_user
is sent mail. The options are:
always
error
start
never
complete |
Notify you when the job begins,
ends, or if it incurs error conditions.
Notify you only if the job fails.
Notify you only when the job begins.
Never notify you.
Notify you only when job ends. This is the default.
|
See sample commandfile for correct
usage
How do I collect the job's standard output
and error?
By specifying output
and error keywords
in your Batch file.
output and error
keywords specifies the name of the file to use as standard output(stdout)
and error(stderr) when your job runs. If not specified, the file /dev/null
is used.
See sample commandfile for correct
usage
Where does my job gets scheduled?
Jobs get scheduled depending on
the queue you have submitted to.
For more information on queues refer to this Question
Can I get my job scheduled on to a particular machine?
Yes, mention machine
name with "Machine " keyword in requirements option
in command file. requirements
= (Machine == "machine_name")
See sample commandfile for correct usage
How do I cancel my submitted job(s)?
Use command "llcancel
job_id" to cancel one or more jobs from
the LoadLeveler queue.
You can get job identification number ( job_id) using llq
command.
What kind of jobs can be run under LoadLeveler?
Resource intensive
jobs taking more CPU time and/or
lot of memory or diskspace, should be submitted through
the LoadLeveler.
How do I run my long running matlab jobs?
Create an executable
say file1.m that contain code in the MATLAB language are
called M-files.
Create another executable say file2 containing the following line "matlab
< file1.m" .
In batch file include # @ executable = file2
Example
Are there any limits on the number of jobs I can submit
?
Yes, one can submit a maximum
of 5 jobs to LoadLeveler.
How can I be nice to my colleagues when submitting several
jobs simultaneously ?
By submitting jobs
with a lesser priority. Priority of a job can be set by
using user_priority keyword in batch file. When you
build a job you can set a user priority for that job by
assigning it a number between 0 and 100, inclusive. The
higher number correspond with higher priority.
Are there limits imposed by LoadLeveler on individual jobs
( Number of jobs scheduled, cpu limits etc)?
Use "llclass"
command to know about maximum slots and free slots for all
defined queues.
Use of /tmp in submitting LoadLeveler jobs?
/tmp is a local
filesystem. If you redirect your output to /tmp, then output
will be redirected to /tmp where your job is scheduled to
run, which is unknown to you till the job gets scheduled
by LoadLeveler.
Then how about /temp?
/temp
is a globally mounted filesystem.If you redirect your output
to /temp then output will be redirected to /temp which is
visible from all research pool machines. However note that
/temp is a scratch area so you are advised to take quick
backup of your files.
Does the current working directory play any role while submitting
jobs?
Yes, If you are
submitting a job from current working directory and you
have not specified initialdir
in batch file, then the initial directory is the current
working directory at the time you submit the job. Filenames
mentioned in the command file which do not begin with a
/ are relative to the initial
directory.
Why does my job ends abruptly when submitted
to LoadLeveler while it runs well from terminal?
You have set some
environmental variables. Commands that set a terminal state,
such as "tset" or "stty" should be avoided
You are submitting a large executable job .In
that case instead of directly submitting the executable make an executable
say file.exe that contains full path of your executable and specify file.exe
in command file.
How do I run my long running mathematica
jobs?
Create an executable
file1 that contain code in the mathematica language . Create
an executable file2, with the following line math -batchoutput<file1
. In batch file include # @ queue = q960m and
# @ executable = file2 . If you do not specify queue as q960m then your
job will not run on a machine which is having mathematica.
EXAMPLE
Problems in submitting/executing jobs under LoadLeveler
Unable to submit
jobs from a machine
| Error message |
Possible
cause |
| llsubmit: Command not found |
LoadLeveler home directory not mounted |
| submit: Schedd on ibm340_21 cannot
store the executables. Job not submitted. |
Diskspace in filesystem of LoadLeveler
is inadequate |
| submit: Unable to connect to host running
schedd |
Loadleveler daemons
local and central daemons are not running |
For further assistance, please contact HelpDesk@SERC by
E-mail or phone (#444
within SERC).
|