SERC

SERC Computing Facility: Softwares

LoadLeveler is a job management system that allows users to run more jobs in less time by matching their processing needs to available resources. LoadLeveler serves as job scheduler and provides a facility for building, submitting and processing jobs quickly and efficiently in a dynamic environment.LoadLeveler is the batch system on the SERC's IBM RS/6000 workstations and servers and also on IBM RS/6000 SP.

The following FAQ will help the user to use the LoadLeveler in SERC.



When do I use LoadLeveler?

Resource intensive jobs taking more CPU time and/or lot of memory or diskspace, should be submitted through the LoadLeveler.

How does LoadLeveler work?

LoadLeveler Processes jobs and monitors the workload by running the following daemons and processes:

LoadL_master referred to as the master daemon, this manages all daemons on its resident machine. LoadL_schedd referred to as the schedd daemon, this manages batch submissions on its resident machine.
LoadL_shadow spawned by the schedd daemon, the shadow process communi cates with the starter process for a job on a server machine.
LoadL_startd referred to as the startd daemon, this accepts dispatched jobs on its resident machine. LoadL_starter spawned by startd daemon, the starter process manges a running job on server machine.
LoadL_kbdd referred to as the keyboard daemon. this monitors keyboard and mouse activity on its resident AIX machine.
LoadL_collector referred to as the collector daemon, this is the central collector of machine status from all machines in the Load Leveler pool.
LoadL_negotiator referred to as the negotiator daemon, this is the central scheduler and collector of job status from all machines in the LoadLeveler pool. Some of these components reside on every workstation or host machine in the LoadLeveler pool.Others reside only on the host designated as the central manager.

What queues are available for we to submit jobs? What are the resource and access limits to these queues?

Job Classes in Research Domain

Workstations / Servers CPU time limit
On IBM 340
Headless

    q30h, q120h, q240h, q480h, q960h, q50hrs_h :
    30, 120, 240, 480, 960 minutes and 50 hours respectively.

On IBM 340 / 43P

    q960w : 960 minutes

On IBM 590/591

    q480s, q960s : 480 and 960 minutes respectively.
    q480g, q960g : Queues for Gaussian'92 users with 480 and 960 minutes respectively. Needs special validation.

On IBM 595

    q960t : 960 minutes

Job Classes in Course Domain

Workstations / Servers

        CPU time limit

On IBM 43P

    q120p, q240p, q480p :
    q120, 240 and 480 minutes respectively.

On IBM 591

    q120s, q240s, q480s, q960s :
    120, 240, 480 and 960 minutes respectively.

Job Classes in Common Domain

Workstations / Servers

        CPU time limit

On IBM 591

    q960m : 960 minutes for mathematica jobs.

How do I submit a job to a LoadLeveler?

Include /home/loadl/bin in your default search path. Create a command file (sample command file) and submit that file using: llsubmit command_file_name
This will submit an executable named a.out from your current working directory to a machine where the job class q240h is defined.
The standard error and standard output will be dumped into files error.log and output.log respectively.

You may also use GUI-based
xloadl for the same purpose.

How do I know the status of my job?

Use llq command. This command returns information about jobs in the LoadLeveler job queue.
A session might look like:
%
llq

ID Owner Submitted ST PRI CLASS Running On
ibm580_2.12910.0
ochjag
11/4 10:06
R
50 
q960s 
ibm580_1
ibm580_1.21264.0
ochpanda
11/4 12:12
R
50
q480s 
ibm580_2
ibm580_2.12923.0
sscmsh 
11/4 13:44
R
50
q960g
ibm580_8
ibm580_1.21261.0
seckiran
11/4 10:39
I
50
q480s
ibm580_7.6079.0 
mecsumit
11/3 21:16
I
50
q960s
ibms10.1097.0
secajay
11/4 00:06
P
50
q960s
ibm580_8.3156.0
secharsh
11/4 12:00
P
50
q480s

          7 jobs in queue 2 waiting, 2 pending, 3 running, 0 held.

Each field is defined as:

Id

    The identification number if the job.

   
Owner

    The login ID of the owner of the job .

   
Submitted

    The date and time the job was submitted. Note that this has nothing to do with the start time of the job.

   
ST

    The STate of the job.
    The possible values are:
    R Running: The job is running
    I Idle: The job is waiting for another job by the same owner to finish, or for resources.
    P Pending: The job is trying to allocate resources
    D Deferred: The job is waiting for resources to be available.
    ST STarting: The job is starting
    C Completed: The job is completed
    H Hold: The job is held and will not run until released.
    RM ReMoved: The job has been removed from the queue
    NQ NotQueued: You already have a job running.
    NR Not Run: The job will never be run because a dependency associated with the job was found to be false.

    A typical path of job states might be:

    Idle, Pending, STarting, Running.
    Or:
    Idle, Pending, Deferred, Pending, Deferred.
    Or:
    Pending, Starting, Running.

   
PRI

    The user-level priority of the job; at this time all users at the SERC have the same user-level priority.

   
Class

    The class of the job.

   
Running On

    The node a job is running on.

How do I know when my job got scheduled and when it got completed?

By using notification keyword in the Batch file
Syntax : Notification = always | error | start | never | complete

Description:
notification specifies when the user specified in notify_user is sent mail. The options are:

always
error
start
never
complete

    Notify you when the job begins, ends, or if it incurs error conditions.
    Notify you only if the job fails.
    Notify you only when the job begins.
    Never notify you.
    Notify you only when job ends. This is the default.

See sample commandfile for correct usage

How do I collect the job's standard output and error?

By specifying output and error keywords in your Batch file.

Syntax:

    output = filename
    error = filename

output and error keywords specifies the name of the file to use as standard output(stdout) and error(stderr) when your job runs. If not specified, the file /dev/null is used.

See sample commandfile for correct usage

Where does my job gets scheduled?

Jobs get scheduled depending on the queue you have submitted to.
For more information on queues refer to this Question

Can I get my job scheduled on to a particular machine?

Yes, mention machine name with "Machine " keyword in requirements option in command file. requirements = (Machine == "machine_name")

See sample commandfile for correct usage

How do I cancel my submitted job(s)?

Use command "llcancel job_id" to cancel one or more jobs from the LoadLeveler queue.
You can get job identification number ( job_id) using llq command.

What kind of jobs can be run under LoadLeveler?

Resource intensive jobs taking more CPU time and/or lot of memory or diskspace, should be submitted through the LoadLeveler.

How do I run my long running matlab jobs?

Create an executable say file1.m that contain code in the MATLAB language are called M-files.
Create another executable say file2 containing the following line "matlab < file1.m" .
In batch file include # @ executable = file2

Example

Are there any limits on the number of jobs I can submit ?

Yes, one can submit a maximum of 5 jobs to LoadLeveler.

How can I be nice to my colleagues when submitting several jobs simultaneously ?

By submitting jobs with a lesser priority. Priority of a job can be set by using user_priority keyword in batch file. When you build a job you can set a user priority for that job by assigning it a number between 0 and 100, inclusive. The higher number correspond with higher priority.

Are there limits imposed by LoadLeveler on individual jobs ( Number of jobs scheduled, cpu limits etc)?

Use "llclass" command to know about maximum slots and free slots for all defined queues.

Use of /tmp in submitting LoadLeveler jobs?

/tmp is a local filesystem. If you redirect your output to /tmp, then output will be redirected to /tmp where your job is scheduled to run, which is unknown to you till the job gets scheduled by LoadLeveler.

Then how about /temp?

/temp is a globally mounted filesystem.If you redirect your output to /temp then output will be redirected to /temp which is visible from all research pool machines. However note that /temp is a scratch area so you are advised to take quick backup of your files.

Does the current working directory play any role while submitting jobs?

Yes, If you are submitting a job from current working directory and you have not specified initialdir in batch file, then the initial directory is the current working directory at the time you submit the job. Filenames mentioned in the command file which do not begin with a / are relative to the initial directory.

Why does my job ends abruptly when submitted to LoadLeveler while it runs well from terminal?

You have set some environmental variables. Commands that set a terminal state, such as "tset" or "stty" should be avoided

                  OR

You are submitting a large executable job .In that case instead of directly submitting the executable make an executable say file.exe that contains full path of your executable and specify file.exe in command file.

How do I run my long running mathematica jobs?

Create an executable file1 that contain code in the mathematica language . Create an executable file2, with the following line math -batchoutput<file1 . In batch file include # @ queue = q960m and
# @ executable = file2 . If you do not specify queue as q960m then your job will not run on a machine which is having mathematica.

EXAMPLE

Problems in submitting/executing jobs under LoadLeveler

Unable to submit jobs from a machine

Error message                               Possible cause
llsubmit: Command not found LoadLeveler home directory not mounted
submit: Schedd on ibm340_21 cannot store the executables. Job not submitted. Diskspace in filesystem of LoadLeveler is inadequate
submit: Unable to connect to host running schedd Loadleveler daemons local and central daemons are not running
For further assistance, please contact HelpDesk@SERC by E-mail or phone (#444 within SERC).