NCSA Home
Contact Us | Intranet | Search

Running Jobs on NCSA's Xeon Linux Cluster

  1. Overview
  2. Running MPI Programs
  3. Queues
  4. Disk Space for Batch Jobs
  5. LSF batch Commands
    1. bsub
      1. bsub for Interactive Jobs
    2. bjobs
    3. bhist
    4. bkill
    5. bacct
    6. bpeek
  6. Sample Batch Script
  7. Managing Batch Scripts
  8. LSF Documentation

1. Overview

Tungsten runs the job manager LSF (Load Share Facility) batch, a load-sharing batch system from Platform Computing. See the lsfintro man page for a description about LSF, and the lsfbatch man page for a list of batch commands available in LSF batch.

The access nodes are restricted to compiling tasks and have a runtime limit of 30 minutes. Processes that exceed the limits may be terminated. Use interactive job submissions for debugging.

2. Running MPI Programs

The Xeon Cluster uses ChaMPIon/Pro for running MPI programs. Instead of using mpirun to run MPI programs, use the cmpirun command. For example, to run the program testMPI on 4 processors:

   cmpirun -np 4 -lsf testMPI < myin > myout
See "cmpirun -h" for the short help. There are many environment variables and options that can be used with cmpirun. Some of the most frequently used are:

-gdb

Starts the job under the gdb debugger.
-lsf

Needed to work with machines allocated by lsf.
-mpi_debug
enables extra checking in the ChaMPIon/Pro library.
-mpi_verbose
enables verbose output from the ChaMPIon/Pro library.
-np n

Specifies the number of processors n to run on.
-poll

Turns on polling mode for Myrinet jobs. This is especially useful for jobs with large numbers of small messages.

-scale_level 2
Set to 2 when running with more than 1000 processors.
-timeout seconds
Sets the timeout period during startup.
-tv

Starts the job under the totalview debugger.
-verbose
displays errors and warnings.

Also see Debugging in the ChaMPIon/Pro environment for cmpirun debugging options.

3. Queues

The following queues are currently available for users:

QueueWalltimeMax # Nodes
debug30 mins8
normal (default)48 hours512(*)
long100 hours512(*)

(*) We recommend that you limit the number of nodes per job to 512.

4. Disk Space for Batch Jobs

The system creates a scratch directory for each running batch job. The job directory is created for you when LSF starts your job and is accessible within the batch script using the $SCR environment variable. See the sample batch script on how to use $SCR in a batch job.

The cdjob command can be used to change the working directory to the scratch directory of a running batch job. The syntax is :

     cdjob jobid

Your job scratch directory may be deleted soon after your job completes, so you should take care to transfer results to the mass storage system at the end of your job script.

5. LSF batch Commands

A complete list of LSF batch commands can be found in the man page for lsfbatch. Below are brief descriptions of the more useful commands. For more detailed information, refer to the individual man pages.

    5.1 bsub

    The bsub command is used to submit a batch job to a queue.
    • All options to bsub can be specified either on the command line or as a line in a script (known as an embedded option). If embedded options are used, the script must be submitted using the following format:

      bsub < script_name

      where script_name is the name of the script and the < is required. Scripts submitted this way are spooled, meaning the system saves a copy of the script. Hence, changing the script file after the job is submitted does not affect execution.

      To execute a script in C shell, use the following as the first line of your script:

      #!/bin/csh

    • To use embedded bsub options in batch scripts, begin each line containing options with #BSUB (leave at least one blank space between the BSUB and the start of the first option).

    • The main bsub commands are listed below. The sample batch script illustrates bsub usage and options. Also see the bsub man page for other options.

      • -n proc specifies the number of processes (default = 1). This is the maximum number of active processes at any given time during the lifetime of the job. If different numbers of processors are used over the lifetime of the job, you must specify the maximum number used.

      • -W run time limit specify total job wall clock time (default = 30 mins). The syntax is [hour:]minute.

      • -R "span[ptile=X]" Specify that the job should use one or two processors per node (default = 2).

      • -o out_file store the standard output/error of the job to file out_file.

      • -J job_name specify a job name.

      • -N: send mail at the end of a job.

      • -P psn: charge your job to a specific project (PSN).

      • -q queuename: submit your job to the queuename queue.

    5.1.1 bsub for Interactive Jobs

    The -Is option tells bsub you want to run an interactive job. You can also use other bsub options such as those documented in the sample batch script. For example, the following command:

       bsub -Is -n4 -W 1:00 tcsh
    

    will run an interactive job on 4 processors using tcsh with a wallclock limit of 1 hour.

    After you enter the command, you will have to wait for lsf to start the job. As with any job, your interactive job will wait in the queue until the specified number of nodes is available. If you specify a small number of nodes, the wait will be shorter.

    When you are done with your runs, you can use the exit command to end the job.

    You will be charged for the wall clock time used by all requested nodes until you end the job.

    5.2 bjobs

    The bjobs command displays the status of jobs. Enter bjobs to find the status of your jobs. To limit the output to a particular job, specify the jobid on the command line. To find the status of all jobs on the system use the -u all option.

    For example, the following command returns information on all jobs currently in the queue:

     % bjobs -u all
    JOBID  USER      STAT QUEUE    FROM    EXEC JOB_NAME     NDS   WALL   ELAP
    67513  jdoe       RUN normal   tuna    tuna isajob2       32  11:00  11:04
    67518  smith      RUN normal   tuna    tuna deltatest     48  12:00   6:24
    67519  brown      RUN normal   tuna    tuna testjob       12  12:00   3:40
    67570  jdoe       RUN normal   tunb    tuna 32run         32   6:00   2:46
    67529  plum       RUN normal   tuna    tunb bigset        16  12:00   2:37
    67572  black      RUN normal   tuna    tuna interactive    2   6:00   2:37
    67846  jdoe       RUN normal   tuna    tunb 2short        32   3:00   1:21
    56534  brown     PEND normal   tuna                      256   2:00
    58901  white     PEND normal   tunb         runit        256  12:00
    58931  white     PEND normal   tunb         bench        256   0:30
    67517  jdoe      PEND normal   tuna         200run       200  12:00
    

    Popular bjobs options:

    • -r: prints information only about running jobs
    • -l: prints more detailed information, can be used with a jobid or -u all
      The following command will print detailed information on job 67513: bjobs -l 67513
    • -q: prints information only about jobs in a particular queue
      The following command prints information about all jobs in the production queue: bjobs -u all -q normal

    For a full list of bjobs options, see the bjobs man page.

    On Tungsten, bjobs is actually an NCSA wrapper for the real LSF bjobs command. It was created to eliminate the listing of the nodes in a running jobs as well as display some new columns:

    • NDS: the number of nodes requested
    • WALL: the wall clock limit
    • ELAP: the number of hours that the job has been running (format HH:MM)
    • EXEC: has been changed to indicate which subcluster the job is running on instead of displaying the full list of compute nodes in the job

    Users can still run the real LSF bjobs command by specifying ${LSF_BINDIR}/bjobs.

    5.3 bhist

    The bhist command displays the history of batch jobs in the LSF batch system. See the man page for more information. For older jobs, make sure to use the -n option to specify the number of event log files that bhist searches. The default is 1; i.e., the current event log file.

    For example,

    bhist -n4 -l jobid gives detailed information on a particular job that ran in the last few days.

    bhist -n4 -l -a -u userid gives detailed information on all jobs in the last few days for a particular user.

    5.4 bkill

    The bkill command deletes a queued job or kills a running job. Obtain the jobid using the bjobs command. Using the sample session shown above, user plum deletes his batch job by entering:

     % bkill 67529
     Job deleted.
    

    5.5 bacct

    The bacct command displays accounting information that LSF batch keeps on completed batch jobs.

    bacct -l jobid
    gives detailed information on a particular job (use the bhist command to find your jobid)
    bacct -b -u userid
    gives a summary of information on all jobs for a particular user
    bacct -l -C 2004/04/20,2004/04/22 -u userid
    gives detailed information on all jobs for a particular user completed between the days specified.

    NOTE: NCSA system accounting used to compute CPU usage is done separately from that of LSF batch, so accounting information returned by bacct should be treated as approximate.

    5.6 bpeek

    The bpeek command displays the stdout and stderr output of a unfinished batch job in the LSF batch system up to the time that this command is invoked. It is useful for monitoring the progress of a job and identifying errors. Users can only invoke bpeek on their own jobs. Enter bpeek jobid to get information on a particular job.

6. Sample Batch Script

A sample LSF batch script for a ChaMPIon/Pro MPI job is available in /usr/local/doc/lsf that you can copy and modify as needed for your own use.

The sample batch script uses scratch space for batch jobs ($SCR). It also uses UniTree for permanent storage of files. It assumes that the executable and any input files are already on UniTree. If that's not true in your case or if you have problems with UniTree within batch jobs, see this FAQ.

7. Managing Batch Scripts

There is a program named find_batch_scripts that will help you locate batch scripts on the system [should you forget their location].

8. LSF Documentation (PDF)

Note: These documents are only available to NCSA HPC users and require an NCSA login.

Top