| |
|
|
|
|
Grid Universal Remote (GUR) Co-scheduler |
Co-scheduling is the process of making and synchronizing reservations on computation resources at multiple
sites. The GUR tool is a python script which uses the ssh and scp commands to help users make reservations, compile
programs, and co-schedule jobs. GUR is installed on the IA-64 clusters at NCSA (Mercury)
and at SDSC. GUR is invoked from the command line from either machine. Co-scheduling is expected to be
available on resources at other sites in the near future. A Web interface is in development.
Access to this software is provided by invoking the +gur softenv key.
Note: mpich-g2 is
required for a co-scheduled mpi job run. Code should be compiled under the same
mpich-g2 environment at both sites specified in the co-schedule job file. To
set your environment so that the intel built mpich-g2 is available, use the
softenv key "+mpich-g2-intel".
Ex. soft add +mpich-g2-intel
Paths and Policies
Policies at NCSA for co-scheduling are the same as for other reservations; reservations policies are documented here.
GUR Workflow
- User runs grid-proxy-init to establish grid credential
grid-proxy-init
- User constructs an appropriate jobfile (See example jobfiles)
vi jobfile
or
gur.py --dumpjobfile --output=metajob.script
- User runs gur, with jobfile as the input
gur.py --reserve --jobfile=jobfile
GUR returns path to file containing reservation information
GUR: metajob submitted:
/<working directory path>/<username>
/info/gur/test/gurdata/metajob.1190763126.7100041
- GUR makes reservations at remote clusters. GUR uses gsissh
to invoke commands on remote machines.
- User runs jobs on remote clusters (See example rsl files)
mpirun -globusrsl job.rsl
- User cancels reservation, with metajob script as the input
gur.py --cancel
--metajobfile=/rmount/users01/sdsc/<username>
/info/gur/test/gurdata/
metajob.1190763126.7100041
Example Jobfiles, by scenario
Lines followed with " \" should be typed on one line.
| Scenario 1: 128 nodes over two systems, without regard to
distribution to each system |
[metajob]
# total nodes
total_nodes = 128
machine_preference = tg-login1.sdsc.teragrid.org:2119 \
#slash#jobmanager-pbs_gcc_resid \
#ia64-compute,grid-hg.ncsa.teragrid.org \
#slash#jobmanager-pbs#fastcpu
# allow re-ordering of the machine_preference list?
machine_preference_reorder = yes
# duration
duration = 3600
earliest_start = 11:30_06/07/2007
latest_end = 17:00_06/15/2007
# use single or multiple clusters ('single' or 'multiple')
usage_pattern = multiple
# Machine-specific info
machines_dict_string = {
'tg-login1.sdsc.teragrid.org:2119#slash \
#jobmanager-pbs_gcc_resid#ia64-compute' : {
'username_string' : '',
'account_string' : 'TG-XYZ999999X',
'email_notify' : 'johndoe@sdsc.edu',
'min_int' : 1,
'max_int' : 128
},
'grid-hg.ncsa.teragrid.org#slash \
#jobmanager-pbs#fastcpu' : {
'username_string' : '',
'account_string' : 'TG-XYZ999999X;',
'email_notify' : 'johndoe@sdsc.edu;',
'min_int' : 1,
'max_int' : 128
},
}
|
| Scenario 2: 256 nodes over two systems, 128 nodes each |
[metajob]
# total nodes
total_nodes = 256
machine_preference = tg-login1.sdsc.teragrid.org:2119 \
#slash#jobmanager-pbs_gcc_resid \
#ia64-compute,grid-hg.ncsa.teragrid.org \
#slash#jobmanager-pbs#fastcpu
# allow re-ordering of the machine_preference list?
machine_preference_reorder = yes
# duration
duration = 3600
earliest_start = 11:30_06/07/2007
latest_end = 17:00_06/15/2007
# use single or multiple clusters ('single' or 'multiple')
usage_pattern = multiple
# Machine-specific info
machines_dict_string = {
'tg-login1.sdsc.teragrid.org:2119 \
#slash#jobmanager-pbs_gcc_resid#ia64-compute' : {
'username_string' : '',
'account_string' : 'TG-XYZ999999X',
'email_notify' : 'johndoe@sdsc.edu',
'min_int' : 128,
'max_int' : 128
},
'grid-hg.ncsa.teragrid.org#slash \
#jobmanager-pbs#fastcpu' : {
'username_string' : '',
'account_string' : 'TG-XYZ999999X',
'email_notify' : 'johndoe@sdsc.edu',
'min_int' : 128,
'max_int' : 128
},
}
|
| Scenario 3: 64 nodes over two systems, all on one cluster is okay |
[metajob]
# total nodes
total_nodes = 64
machine_preference = tg-login1.sdsc.teragrid.org:2119 \
#slash#jobmanager-pbs_gcc_resid \
#ia64-compute,grid-hg.ncsa.teragrid.org \
#slash#jobmanager-pbs#fastcpu
# allow re-ordering of the machine_preference list?
machine_preference_reorder = yes
# duration
duration = 3600
earliest_start = 11:30_06/07/2007
latest_end = 17:00_06/15/2007
# use single or multiple clusters ('single' or 'multiple')
usage_pattern = multiple
# Machine-specific info
machines_dict_string = {
'tg-login1.sdsc.teragrid.org:2119 \
#slash#jobmanager-pbs_gcc_resid#ia64-compute' : {
'username_string' : '',
'account_string' : 'TG-XYZ999999X',
'email_notify' : 'johndoe@sdsc.edu',
'min_int' : 0,
'max_int' : 64
},
'grid-hg.ncsa.teragrid.org#slash \
#jobmanager-pbs#fastcpu' : {
'username_string' : '',
'account_string' : 'TG-XYZ999999X',
'email_notify' : 'johndoe@sdsc.edu',
'min_int' : 0,
'max_int' : 64
},
}
|
Example job.rsl file
+
(&
(resourceManagerContact="grid-hg.ncsa.teragrid.org \
/jobmanager-pbs")
(count=2)
(hostcount=1)
(maxTime=10)
(jobtype=mpi)
(label="subjob 0")
(environment=(TESTENV1 1)
(GLOBUS_DUROC_SUBJOB_INDEX 0)
(TESTENV2 2))
(arguments= -t 10 -n 2 -l 10 -i 0.03125)
(directory=/home/ncsa/kenneth/testprog)
(executable=/home/ncsa/kenneth/testprog/ring26g2)
(reservation_id=johndoe.1289)
)
(&
(resourceManagerContact="tg-login1.sdsc.teragrid.org:2119 \
/jobmanager-pbs_gcc_resid")
(count=2)
(hostcount=1)
(maxTime=10)
(jobtype=mpi)
(label="subjob 1")
(environment=(TESTENV1 1)
(GLOBUS_DUROC_SUBJOB_INDEX 1)
(TESTENV2 2))
(arguments= -t 10 -n 2 -l 10 -i 0.03125)
(directory=/users/kenneth/testprog)
(executable=/users/kenneth/testprog/ring26g2)
(reservation_id=1191274794)
)
|
|
|
|
|
|