- Overview
- Interactive Use
- Queues
- Batch Commands
- qsub
- qsub -I
- qstat
- qhist
- qhosts
- qpeek
- qps
- qdel
- Sample Batch Script
- Managing Batch Scripts
- Disk Space for Batch Jobs
- Automated Saving of Files from Batch Jobs
1. Overview
The NCSA SGI Altix uses the Altair Portable
Batch System (PBS) Pro
with the Moab Workload Manager
for running jobs. To keep all jobs running within each system's memory [for best performance], and to achieve improved system uptime, memory specification for batch jobs is required and enforced. See PBS memory enforcement and monitoring for details.
2. Interactive Use
The access node cobalt.ncsa.uiuc.edu is available for interactive
use.
User limits (for all active login sessions) are as follows:
- a maximum of 4 processes per job
- 4 Gbyte memory per process
- CPU time of 30 mins per process
Jobs exceeding the above policy will be terminated.
In general, interactive use should be limited to compiling and other
development tasks, such as editing source and debugging;
and limited staging of files. The batch system is available for all other jobs.
See the section on
qsub -I
for instructions on how to run an interactive job on the compute nodes.
3. Queues
The following queues are currently available for users:
| Queue | Wall Clock Limit | Max #Processors | Max Memory |
| debug | 30 mins | 24 | 256 Gbytes |
| standard | 50 hours | 256 | 1.5 Tbytes |
| extended | 144 hours | 8 | 15 Gbytes |
| long | 144 hours | 128 | 1.5 Tbytes |
| dedicated(1) | 200 hours | 504 | 3 Tbytes |
|
(1)
Queue available by special request. Please send email to consult@ncsa.uiuc.edu to request access.
Note: For best performance, the machine has been configured to reserve 8
processors
to protect operating system and other system processes. So it is strongly
recommended
to specify a maximum of 504 processors for jobs in the dedicated
queue.
Jobs in the dedicated queue will be charged for all
CPUs on the host regardless of how many processors the job uses.
4. Batch Commands
Below are brief descriptions of the useful batch commands.
For more detailed information, refer to the individual man pages or the PBS Users' Guide.
4.1. qsub
The qsub command is used to submit a batch job to a queue.
All options to qsub can be specified either on the command line
or as a line in a script (known as an embedded option). Command line
options have precedence over embedded options.
Scripts can be submitted using
qsub [list of qsub options] script_name
The main qsub options are listed below.
The sample batch script illustrates
qsub usage and options.
Also see the qsub man page for other options.
-
-l resource-list: specifies resource limits.
The resource_list argument is of the form:
resource_name[=[value]][:resource_name[=[value]]:...]:resource
The resource_names required are:
walltime: maximum wall clock time (hh:mm:ss) [default: 10 mins]
ncpus: the number of processors to use.
mem: the total memory required for the job (all processors).
It is important to provide an accurate estimate of the memory requirement because
of the way the batch system allocates memory and processors.
Note:
The memory specification for your job will be enforced so your job must run within the requested memory. Jobs will be terminated if they exceed their memory request. See also: Checking memory use
Example:
#PBS -l walltime=00:30:00 -l ncpus=8 -l mem=16gb
-
-q queue_name: specify queue name.[default: standard]
- -N jobname: specifies the job name.
- -o out_file:
store the standard output of the job to file out_file.
[default :<jobname>.o<PBS_JOBID>]
- -j oe:
merge standard output and standard error into standard output file.
- -k oe:
place standard output and standard error files in your $HOME
directory. The filenames will be of the form
<jobname>.o<PBS_JOBID> and <jobname>.e<PBS_JOBID>
respectively. If this option
is used in conjunction with -j oe,
standard output and standard error are combined into standard output file.
The -k option overrides the -o option.
- -V:
export all your environment variables to the batch job.
-
-m be:
send mail at the begining and end of a job.
- -A psn:
charge your job to a
specific project (PSN). (for users on more than one PSN)
4.1.1 qsub -I
The -I option tells qsub you want to run an interactive job. You can also
use other qsub options such as those documented in the batch sample scripts
(/usr/local/doc/pbs/samples/).
For example, the following command:
qsub -I -V -l walltime=00:30:00,ncpus=8,mem=8gb
will run an interactive job with a wall clock limit of 30 minutes, using
8 processors and 8 gigabytes of memory.
After you enter the command, you will have to wait for PBS to start the
job. As with any job, your interactive job will wait in the queue until
the specified number of nodes is available. If you specify a small
number of nodes, the wait will be shorter. Once the job starts, you
will see something like this:
qsub: waiting for job 1298.co-login1 to start
qsub: job 1298.co-login1 ready
----------------------------------------
!Begin PBS Prologue Thu Aug 5 12:45:53 CDT 2004
Job ID: 1298
Username: arnoldg
Group: aau
Creating Batch Directory 1298 in /scratch/batch
----------------------------------------
$
When you are done with your interactive commands, you can use the exit command to end
the job:
$ exit
logout
qsub: job 1298.co-login1 completed
You will be charged for the cpu time used by all requested nodes until you
end the job.
4.2. qstat
The
qstat command displays the status of PBS batch jobs.
- qstat -a gives the status of all jobs on the system.
- qstat -n lists nodes allocated to a running job in
addition to basic information.
- qstat -f PBS_JOBID gives detailed information
on a particular job.
- qstat -q provides summary information on all the queues.
See the man page for other options available.
4.3. qhist
The qhist command summarizes the raw accounting record(s) for one or more jobs. See the output of "qhist --help" for details.
To display information about a specific job, the syntax is qhist PBS_JOBID.
$ qhist 2242
JobId: 2242
JobName: crafty_debug
User: ian
Project: aau
Queue: standard
Job limits:
wall clock: 01:00:00
cpus: 8
Queued: 08/19/04 12:37
Started: 08/19/04 12:56
Ended: 08/19/04 13:14
Usage:
wall clock: 00:18:43
qhist can also produce tables of information from the PBS raw accounting records. For example, to create a table for your jobs that started between August 10, 2004 and August 14, 2004, run the following command
$ qhist -S 08/10/2004,08,14,2004
JobId JobName NCPU Stat StartDate EndDate
-----------------------------------------------------------------
1658 crafty_debug 8 E 08/10/04 14:08 08/10/04 14:16
1661 crafty_debug 8 E 08/10/04 16:14 08/10/04 16:22
1662 crafty_debug 8 E 08/10/04 16:51 08/10/04 17:01
1733 STDIN 16 E 08/11/04 14:17 08/11/04 14:36
1778 crafty_debug 8 E 08/12/04 12:18 08/12/04 12:19
1840 crafty_debug 8 E 08/12/04 14:06 08/12/04 14:15
1872 crafty_debug 8 E 08/12/04 15:51 08/12/04 16:00
1891 crafty_debug 8 E 08/13/04 11:35 08/13/04 11:43
1912 crafty_debug 8 E 08/13/04 15:35 08/13/04 15:44
1915 crafty_debug 8 E 08/13/04 15:50 08/13/04 15:59
1918 crafty_debug 8 E 08/13/04 16:12 08/13/04 16:21
1919 STDIN 4 E 08/13/04 16:53 08/13/04 17:04
-----------------------------------------------------------------
Total # jobs = 12
Notes:
- Depending upon the search criteria, qhist may search through records for a couple days on both ends of the date range you specify in order to collect more information about the job(s).
- the Stat column displays the last known status of the job:
- Q - queued
- H - queued and in hold state
- D - deleted (delete record found, but end record not found)
- S - started running (start record found)
- E - ended, can be either normally or via deletion (end record found)
- A - aborted by Torque server, for example due to user being over quota or failed job dependencies (-W depend).
4.4 qhosts
The qhosts command summarizes PBS information for hosts and provides counts of claimed CPU and MEM for each host in the Cobalt array. Load Average, memory used, and uptime data are also provided. Example:
[consult@co-login1 ~/bin]$ qhosts
---------PBS job data-----------|-----Performance Co-Pilot measurments-----
| [..load average....]
HOST JOBS CPUS MEM[gb] | 1 min 5 min 15 min MEM UPTIME
----------------------------------------------------------------------------
co-compute1 3 148 790 | 133 131 143 733 10 days 01:07
co-compute2 3 132 612 | 106 112 117 991 04 days 03:15
co-login1 | 0 1 1 178 10 days 01:16
co-viz1 | 0 0 0 2 14 days 16:34
co-viz2 | 0 0 0 2 10 days 01:24
co-viz3 | 0 0 0 0 00 days 01:14
co-viz4 | 0 0 0 2 04 days 04:12
co-viz5 | 0 0 0 2 10 days 01:24
co-viz6 | 0 0 0 2 10 days 01:21
co-viz7 | 0 0 0 2 08 days 03:19
co-viz8 | 0 0 0 2 08 days 05:53
4.5 qpeek
The qpeek command (currently in beta) displays the standard output and standard error of an unfinished job.
qpeek JobID
will display the stdout/stderr for a specific job.
See qpeek -h for other options.
4.6 qps
The qps command (currently in beta) prints ps style information for processes* running on the cobalt array. [* only the first openmp or thread is shown for threaded processes]
qps
will display the process information for all your processes on the cobalt array
qps -j JOBID
will display the process information for a particular job
See qps -h for other options.
4.7 qdel
The qdel command deletes a queued job or kills a running job.
The syntax is qdel PBS_JOBID.
5. Sample Batch Script
Sample batch scripts are available in the directory
/usr/local/doc/pbs/samples for use as a template.
The sample batch scripts use UniTree
for permanent storage of files. It assumes that the executable and any
input files are already on UniTree. If that's not true in your case
or if you have problems with UniTree within batch jobs, see this FAQ.
6. Managing Batch Scripts
There is a program named
find_batch_scripts that will help you locate batch scripts
on the system [should you forget their location].
7. Disk Space for Batch Jobs
Scratch space for batch jobs is provided via a per-job scratch directory that
is created at the beginning of the job. This directory is created under
/scratch/batch, and is based on the JobID. If the batch script uses one of the sample scripts as a template, the name of this scratch directory is
available to job scripts with the
$SCR environment variable.
Your job scratch directory may be deleted soon
[possibly immediately] after your job completes, so
you should take care to transfer results to the mass storage system (see
the section Automated Saving of Files from Batch Jobs).
The cdjob command
can be used to change the working directory to the scratch directory of a
running batch job.
The syntax is
cdjob PBS_JOBID
8. Automated Saving of Files from Batch Jobs
The saveafterjob utility is available for
automated, guaranteed saving of output files from batch jobs to the mass
storage system.
For details on its use, see the saveafterjob
page and the sample PBS batch scripts.
back to Top