NCSA Home
Contact Us | Intranet | Search

Xeon Cluster Running Jobs FAQ

 

I have a job in the queue and it's not running yet--why?

    At NCSA, we run the same job scheduling package with all of our batch systems. The scheduling of jobs is based on first-in, first-out [FIFO] but with some important modifications to make sure that all jobs and users get a fair chance to run. Some of the various scenarios you may observe are explained here.

    Q: Why are many machines or resources currently idle?

    A: In the time period before a large cpu-count job starts, it always looks this way. A large cpu-count job will probably be starting very soon and the idle resources will be used once again.

    Q: User X just submitted a job like mine after my job and their job has started, why?

    A: Our scheduler "remembers" how many jobs have been run by each user for a while. If User X has run few jobs recently, and you've run more, then User X will get more priority so that they can catch up and everyone gets a fair-share of the system.

    The scheduler will also backfill smaller and shorter jobs to utilize any idle resources.

    Q: How can I get a test, debug, or any of my jobs to start sooner--I really need to try something out now!

    A: The debug queue is available for small cpu jobs. In addition, because the scheduler has to let resources sit idle before starting any large [many cpu] job, you can try to take advantage of that situation by submitting a job that requests short walltime.

    Q: I don't understand why my job hasn't started. How can anyone get work done when the queues are so busy?

    A: See the output of the qs command. It shows the queued jobs along with the time they've been waiting in the queue. You can compare your jobs to jobs that have waited a similar time and determine that other users' experience is in common with yours.

I have an account and can login, but why do I get the message saying I have no accounts that can be used for batch jobs on this system when I try to submit jobs?

    That's probably because your account is expired or overused. Or you may only have a 10 SUs courtesy account which cannot be used to submit batch jobs. You can use the 'tgusage' command to check your account expiring date and SUs that were allocated and used.

My job is done, where is the output?

If your batch script had a -o option, the job standard output (stdout) should be in the specified file in the directory from which the bsub command was issued. If the file isn't there, check your email. The cause could be the batch system couldn't write the output file (for example, bad path in the -o option or ran bsub in a directory in which you don't have write permissions).

If you didn't use the -o option, it should have been sent in email to your NCSA account. Double check your email forwarding.

What does "MPI [194]: Send completion error (12) to rank nodeid" mean?

This error means that there is a problem with the Myrinet network on one or more nodes. If this error occurs, please send email to consult@ncsa.uiuc.edu with the standard error and output files from the batch job that failed.

What does "FATAL ERROR on MPI node 16: GM send to MPI node 80 failed: ..." mean?

The complete error message can be like this:

FATAL ERROR on MPI node 16 (tuna254): GM send to MPI node 80 (???
[00:60:dd:49:76:fd]) failed: status 18 (target node was unreachable)
check the target host, mapping or cables
Small/Ctrl message completion error!
This generately means one of the nodes that your job is running on has some problem. Please send email to consult@ncsa.uiuc.edu with the standard error and output files from the batch job that failed.

My job failed with MPI_COMM_RANK: Null communicator error. What could be wrong?

If you see error message like this: 0 - MPI_COMM_RANK: Null communicator, [0] Aborting program! p0_16883: p4_error: 197, please make sure that you used the same version of MPI for compiling and running, and included the corresponding header file mpi.h in your code.

What does "Connection to tuna015 closed by remote host" mean?

If this error occurs, generally it means there was some problem with the system. Please send email to consult@ncsa.uiuc.edu with the standard error and output files from the batch job that failed.

I got "[0]MPI Abort by user, [0] Aborting program!" error, what's going wrong?

It's possible that some input files are missing. Please make sure you have all your working files in place.

My job failed with the error message: "/u/ac/userid/.lsbatch/1169494531.jobid: line 8: cannot create temp file for here document: No space left on device". What does that mean?

First please check your home directory quota using "quota" command. If you are sure that you have plenty of space left in your home directory, please report the probelm to consult@ncsa.uiuc.edu. Some of the directories in the local disk might have been filled up.

I have an account and can login, but why do I get the message saying I have no accounts that can be used for batch jobs on this system when I try to submit jobs?

    That's probably because your account is expired or overused. Or you may only have a 10 SUs courtesy account which cannot be used to submit batch jobs. You can use the 'tgusage' command to check your account expiring date and SUs that were allocated and used.

I have a serial program, and I want to run multiple simulations with it on a set of machines as one batch job. How can I do that?

This job script is an example of how you can run a serial program or command concurrently on a set of machines using ssh. Note, in order to make efficient use of the machines, it's important that the instances of your program on each machine complete in about the same time. Otherwise, machines that finish early will be idle and wasting resources.

#!/bin/sh
#BSUB -n 4              # Specify 4 processes
#BSUB -W 1:00           # Specify job run time limit of 1 hour
#BSUB -P abc            # Charge job to project abc (recommended for users
                        # with multiple projects)
#BSUB -o testjob.%J.o   # Store the standard output and standard error of the
                        # job in file testjob.jobid.o (optional)
#BSUB -N                # Send mail when job terminates (optional)
#BSUB -J testjob        # Specify job name (optional)


# This shell script would run a command or set of commands for you on each
# machine in your job.

for host in `cat $LSB_NODEFILE | uniq`
                # ^^^^^^^^^^^^ use PBS_NODEFILE for PBS or torque batch systems
do
         ( ssh -a -q -x $host "$HOME/bin/a.out.sh $SCR" ) &
                             # ^^^^^^^^^^^^^^^^^^^^^^ your commands in quotes
done
wait   # waits for all the outstanding ssh subshells above to complete

The a.out.sh script could resemble the one below if you wanted to run multiple sets of the same serial program on each machine to use the available cpus.

#!/bin/sh

N=2    # run this many copies per host
SCR=$1
PROGRAM="${HOME}/a.out.serial"

# change to job scratch directory $SCR
cd $SCR

# make a directory for this machine/node and move into it
HOST=`hostname`
mkdir -p $HOST
cd $HOST

for ITERATION in `seq 1 $N`
do
  # open a sub shell and setup the serial run there, backgrounding the subshell
  (
    mkdir $ITERATION
    cd $ITERATION
    # copy any needed input files to here, untar a bundle here ...
    cp $HOME/input.dat .
    $PROGRAM > output
  ) &
  # ^ important, do not omit the ampersand
done

# wait until each of subshells from the for loop above completes 
wait

Back to top