Overview
This is an example to run jobs and especially SLIC , the Simulator for the Linear Collider, on the FermiGrid which is part of the Open Science Grid. SLIC is a Geant4-based simulations package that uses an XML geometry input format called LCDD to describe geometry, sensitive detectors and readout geometry. In this example SLIC is tared up and put on a web accessible disk space. The grid jobs wgets the tar file unpacks it and runs SLIC on the stdhep files that is provided with the tar package.
This is only one way to do it. Other options include:
- sending the tar file with the job submission
- installing SLIC on nodes that are visible to the grid worker nodes
Prerequisites for sending jobs to the GRID
- get a DOE grid certificate from http://security.fnal.gov/pki/Get-Personal-DOEGrids-Cert.html
This page also explains how to export the certificate from the browser and how to deal with directory permissions and such. - register to the ILC VO (Virtual organization) at http://cd-amr.fnal.gov/ilc/ilcsim/ilcvo-registration.shtml that will guide you to:
https://voms.fnal.gov:8443/vomrs/ilc/vomrs - Everything is set up on ILCSIM. So to try things out it is recommended to get an account on ILCSIM using the following form
http://cd-amr.fnal.gov/ilc/ilcsim/ilcsim.shtml
Setting up your own gateway to the grid is beyond the scope of this write up. It involves installing the Virtual Data Toolkit (VDT) , you'll need a host certificate for the gateway machine etc. etc. For an administration guide see the Fermi Grid web page.
Setting up the Environment
To set up the environment and to get the necessary grid Proxy log into ILCSIM and issue the following commands:
source /fnal/ups/grid/setup.sh voms-proxy-init -voms ilc:/ilc/detector # give passwd etc.
To check the status of the proxy:
voms-proxy-info -all
Submitting the first Job
Now you should be all setup to submit a first trivial test job just to make sure that everything is working. Just cut and paste the following lines into your terminal window. This will submit a grid job which starts 5 separate processes. The processes will not do anything exciting but execute sleep for 10 seconds before they terminate. Since not output is created the sleep_grid.out.$(Cluster).$(Process) and sleep_grid.err.$(Cluster).$(Process) should be empty.
(Note!: $(Cluster) represents the jobnumber and $(Process) represents the (5) process numbers)
The condor log files are: sleep_grid.log.\$(Cluster).\$(Process)
cat > sleep_grid << +EOF universe = grid type = gt2 globusscheduler = fngp-osg.fnal.gov/jobmanager-condor executable = /bin/sleep transfer_output = true transfer_error = true transfer_executable = true log = sleep_grid.log.\$(Cluster).\$(Process) notification = NEVER output = sleep_grid.out.\$(Cluster).\$(Process) error = sleep_grid.err.\$(Cluster).\$(Process) stream_output = false stream_error = false ShouldTransferFiles = YES WhenToTransferOutput = ON_EXIT globusrsl = (jobtype=single)(maxwalltime=999) Arguments = 10 queue 5 +EOF condor_submit sleep_grid
Submitting a Job running SLIC
where the job description file mytestslicjob.run looks like:
universe = grid type = gt2 globusscheduler = fngp-osg.fnal.gov/jobmanager-condor executable = /home2/ilc/wenzel/grid/test_slic.sh transfer_output = true transfer_error = true transfer_executable = true log = myjob.log.$(Cluster).$(Process) notification = NEVER output = myjob.out.$(Cluster).$(Process) error = myjob.err.$(Cluster).$(Process) stream_output = false stream_error = false globusrsl = (jobtype=single)(maxwalltime=999) queue
which triggers the following script:
#!/bin/sh -f wget http://kyoto.fnal.gov/wenzel/SimDist.tgz tar xzf SimDist.tgz cd SimDist printenv scripts/slic.sh -r 5 -g sidaug05.lcdd -i ffHAA_2k.stdhep -o ffHAA_2k ls ffHAA_2k.slcio # This sets up the environment for osg . $OSG_GRID/setup.sh source $VDT_LOCATION/setup.csh srmcp "file:///{PWD}/ffHAA_2k.slcio" "srm://cmssrm.fnal.gov:8443/srm/managerv1?SFN=/2/wenzel/slic/ffHAA_2k.slcio"
To run some commands directly on the grid head nodes use a syntax like this:
globus-job-run fngp-osg.fnal.gov/jobmanager-condor /bin/ls /grid/app globus-job-run fngp-osg.fnal.gov/jobmanager-condor /usr/bin/printenv globus-job-run fngp-osg.fnal.gov/jobmanager-condor /bin/df
The examples above show how to check what grid applications are installed, the runtime environment of a job and what file systems are mounted. To check for available SLIC/SimDist distributions type:
globus-job-run fngp-osg.fnal.gov/jobmanager-condor /bin/ls /grid/app/ilc/detector/SimDist/
Checking and killing your jobs
You can see the status of all jobs using the following command:
condor_q or condor_q -globus
Or to check the jobs submitted by user <username>:
condor_q -submitter <username>
You can view information about all requests with the following command:
condor_status -submitters
To cancel a job type condor_rm followed by the job number:
condor_rm <job number>