How do I use the OSG Grid

compared with
Current by Hans Wenzel
on Aug 19, 2009 09:26.


 
Key
These lines were removed. This word was removed.
These lines were added. This word was added.

View page history


There are 12 changes. View first change.

 {toc}
  
 h2. Overview
  
 This example shows how to run jobs and especially  [SLIC|https://confluence.slac.stanford.edu/display/ilc/SLIC], the Simulator for the Linear Collider, on the *[*FermiGrid*|http://fermigrid.fnal.gov/]* which is part of the *[*Open Science Grid*|http://www.opensciencegrid.org/]*. *SLIC* is a Geant4-based simulations package that uses an XML geometry input format called LCDD to describe geometry, sensitive detectors and readout geometry. 
 {warning}The example scripts should be pasted directly into the terminal on ILCSIM. Do _not_ use an editor, as the escape characters will not be interpreted correctly.
 {warning}
 
 h2. Prerequisites
  
 # Obtain a DOE grid certificate from [http://security.fnal.gov/pki/Get-Personal-DOEGrids-Cert.html], which also explains how to export the certificate from the browser, dealing with directory permissions, etc. 
 # Register with the ILC VO (Virtual organization) at [http://cd-amr.fnal.gov/ilc/ilcsim/ilcvo-registration.shtml], which will guide you to: [https://voms.fnal.gov:8443/vomrs/ilc/vomrs] 
 # Get an account on *ILCSIM* and *ILCSIM2*, using the following form  [http://cd-amr.fnal.gov/ilc/ilcsim/ilcsim.shtml]. This machines serve as a portal to the grid.
 {note}Setting up your own gateway to the grid is beyond the scope of this write-up. It involves installing and configuring the Virtual Data Toolkit ([VDT|http://vdt.cs.wisc.edu/]) , installing a *host certificate* for the gateway machine, etc. For an administrative guide see the [Fermi Grid web page|http://fermigrid.fnal.gov/].
 {note}
  
 h2. Setup and Configuration
  
  
 h3. Kerberos
  
 Fermilab uses Kerberos for external authentication. _This section assumes that you have a Fermilab Kerberos principal._ Follow [these instructions|http://computing.fnal.gov/cd/forms/requirements_offsite_new.html] if you need an account at Fermilab and are authorized to obtain one.
  
 Assuming that your machine has recent versions of SSH and Kerberos and you will _not_ be using a Cryptocard, download Fermilab's official Kerberos configuration file.
  
 Download the file.
 {noformat}wget http://security.fnal.gov/krb5.conf
 {noformat}Set the environment variable KRB5_CONFIG to point to the Fermilab configuration file.
 {noformat}export KRB5_CONFIG=`pwd`/krb5.conf
 {noformat}This variable can be added to your shell profile or setup in a script and the configuration file it points to will override the one in _/etc_.
  
 h3. Connecting to ILCSIM
  
 Initialize the Kerberos session.
 {noformat}kinit -f USERNAME@FNAL.GOV
{noformat}Connect to _ilcsim_ using _ssh_.
  {noformat}Connect to _ilcsim_ using _ssh_
  
 .
 {noformat}ssh USERNAME@ilcsim.fnal.gov
 {noformat}You may need to use [^ssh_config] for the SSH configuration file.
 {noformat}ssh -F ssh_config USERNAME@ilcsim.fnal.gov
 {noformat}
 
 h3. Setup the Grid Tools
  
 Setup the grid tools in a bash shell.
 {noformat}source /fnal/ups/grid/setup.sh
 {noformat}Setup the grid tools in tcsh or csh.
 {noformat}source /fnal/ups/grid/setup.csh
 {noformat}
 
 h3. Session Certificate and quotas
  
 Finally, obtain a session certificate .
 {noformat}voms-proxy-init -voms ilc:/ilc/sid
 {noformat}By default the proxy is valid for 12 hours, which is probably too short for your job. To obtain a proxy that is valid for 72 hours, issue the command:
 {noformat}voms-proxy-init -valid 72:00 -voms ilc:/ilc/sid
 {noformat}The following groups should have a higher limit on the number of current jobs than the generic _/ilc/detector_ so feel free to substitute them in the _voms-proxy-init_ command. The following webpage list the quotas on the general FERMIGRID: [http://fermigrid.fnal.gov/grid_users/fermigrid_gpgrid.pdf]
 * */ilc/sid* \- SiD
 * */ilc/ilddet* \- ILC Large Detector
  
 To check the status of the proxy:
 {noformat}voms-proxy-info -all
 {noformat}To check quotas and to check how many slots are already taken:
 {noformat}condor_config_val GROUP_QUOTA_group_siddet -name fnpc5x1.fnal.gov -pool fnpccm1.fnal.gov
 condor_userprio -all -pool fnpccm1.fnal.gov
 {noformat}
 
 h3. Running from an External Site
  
 If you want to submit jobs from a node other than ILCSIM, the ilc VOMS server information needs to be explicitly provided.
  
 The following should be put into a file, _ilc-fermilab-voms_.
 {noformat}"ilc" "voms.fnal.gov" "15023" "/DC=org/DC=doegrids/OU=Services/CN=http/voms.fnal.gov" "ilc"
 {noformat}Now a valid grid certificate can be obtained by referencing this configuration file.
 {noformat}voms-proxy-init ilc:/ilc -userconf ./ilc-fermilab-voms
 {noformat}
 {warning}The above command will fail if _ilc-fermilab-voms_ is not owned by your account.
 {warning}Provided that your local node is configured correctly, this should allow you to start a grid session on an external node outside Fermilab.
 {info}Simple commands such as _globus-job-run_ should work "out of the box" from an external site. In order to actually submit jobs to the Fermilab batch system, you will need to have a [Condor|http://www.cs.wisc.edu/condor/] job scheduler running. Talk to your site administrator about setting up this software, which can be configured as part of the [VDT|http://vdt.cs.wisc.edu/].
 {info}
 
 h2. Example Grid Jobs
  
  
 h3. Submitting the First Example Jobs
  
 Now you should be all setup to submit a test job to make sure that everything is working. Cut and paste the following lines into your terminal window. This will submit a grid job which starts 5 separate processes. The processes will just execute sleep for 10 seconds before terminating. Since no output is created the sleep_grid.out.$(Cluster).$(Process) and sleep_grid.err.$(Cluster).$(Process) files should be empty.
  
 (*Note\!*: $(Cluster) represents the job number and $(Process) represents the (5) process  numbers)
 The condor log files are:   sleep_grid.log.\$(Cluster).\$(Process)
 {noformat}cat > sleep_grid << +EOF
 universe = grid
 type = gt2
 globusscheduler = fngp-osg.fnal.gov/jobmanager-condor
 executable = /bin/sleep
 transfer_output = true
 transfer_error = true
 transfer_executable = true
 log = sleep_grid.log.\$(Cluster).\$(Process)
 notification = NEVER
 output = sleep_grid.out.\$(Cluster).\$(Process)
 error = sleep_grid.err.\$(Cluster).\$(Process)
 stream_output = false
 stream_error = false
 ShouldTransferFiles = YES
 WhenToTransferOutput = ON_EXIT
 globusrsl = (jobtype=single)(maxwalltime=999)
 Arguments = 10
 queue 5
 +EOF
  
  
 condor_submit sleep_grid
 {noformat}&nbsp;The second example is an exploration job where the job reports the run time environment it encounters and the file systems that are mounted. This is very often useful to find out what is available on the worker nodes :-) . So have a look at&nbsp; env_grid.out.$(Cluster).$(Process).
  
 *Note\!*: The grid job doesn't inherit the run time environment from your interactive session\!
 {noformat}rm -f env_grid.sh
 cat > env_grid.sh << +EOF
 #!/bin/sh -f
 printenv
 pwd
 cd \${_CONDOR_SCRATCH_DIR}
 pwd
 #
 # This sets up the environment for osg in case we want to
 # use grid services like srmcp
 #
 . $OSG_GRID/setup.sh
 source \${VDT_LOCATION}/setup.sh
 printenv
 /bin/df
 +EOF
 chmod +x env_grid.sh
  
 rm -f env_grid.run
 cat > env_grid.run << +EOF
 universe = grid
 type = gt2
 globusscheduler = fngp-osg.fnal.gov/jobmanager-condor
 executable = ./env_grid.sh
 transfer_output = true
 transfer_error = true
 transfer_executable = true
 log = env_grid.log.\$(Cluster).\$(Process)
 notification = NEVER
 output = env_grid.out.\$(Cluster).\$(Process)
 error = env_grid.err.\$(Cluster).\$(Process)
 stream_output = false
 stream_error = false
 ShouldTransferFiles = YES
 WhenToTransferOutput = ON_EXIT
 globusrsl = (jobtype=single)(maxwalltime=999)
 queue
 +EOF
  
 condor_submit env_grid.run
  
 {noformat}
 
 h3. Submitting a Job running SLIC
  
 Now finally let's run SLIC :-) . We will use the SLIC installation and a data set that are available on the GRID worker nodes. As in the previous examples cut and paste the contends below:
 {noformat}rm -f slic_grid.csh
 cat > slic_grid.csh << +EOF
 #!/bin/csh
 echo start
 /bin/date
 cd \${_CONDOR_SCRATCH_DIR}
 setenv LABELRUN slic_grid-\${ClusterProcess}
 setenv TARFILE \${LABELRUN}-results.tar
 echo \${TARFILE}
 echo start
 /bin/date
 mkdir results
 /grid/app/ilc/sid/SimDist/v2r4p2/SimDist/scripts/slic.sh -r 5 \
 -g /grid/app/ilc/detector/SimDist/detectors/sid01/sid01.lcdd \
 -i /grid/data/ilc/detector/LDC/stdhep/ZZ_run10.stdhep -o ./results/ZZ_run10\${LABELRUN} >& \
 ./results/ZZ_run10\${LABELRUN}.lis
 ls -lh results
 /bin/date
 echo "build output tarball: " \${TARFILE}
 tar -cf \${TARFILE} results
 echo done
 +EOF
 chmod +x slic_grid.csh
  
 rm -f slic_grid.run
 cat > slic_grid.run << +EOF
 universe = grid
 type = gt2
 globusscheduler = fngp-osg.fnal.gov/jobmanager-condor
 executable = ./slic_grid.csh
 transfer_output = true
 transfer_error = true
 transfer_executable = true
 environment = "ClusterProcess=\$(Cluster)-\$(Process)"
 transfer_output_files = slic_grid-\$(Cluster)-\$(Process)-results.tar
 log = slic_grid.log.\$(Cluster).\$(Process)
 notification = NEVER
 output = slic_grid.out.\$(Cluster).\$(Process)
 error = slic_grid.err.\$(Cluster).\$(Process)
 stream_output = false
 stream_error = false
 ShouldTransferFiles = YES
 WhenToTransferOutput = ON_EXIT
 globusrsl = (jobtype=single)(maxwalltime=999)
 queue
 +EOF
  
 condor_submit slic_grid.run
  
 {noformat}
 
 h3. Running Commands directly on the Head Node
  
 To run some commands directly on the grid head nodes use a syntax like this:
 {noformat}globus-job-run fngp-osg.fnal.gov/jobmanager-condor /bin/ls /grid/app
 globus-job-run fngp-osg.fnal.gov/jobmanager-condor /usr/bin/printenv
 globus-job-run fngp-osg.fnal.gov/jobmanager-condor /bin/df
 {noformat}The examples above show how to check what grid applications are installed, the runtime environment of a job and what file systems are mounted. To check for available SLIC/[SimDist|http://confluence.slac.stanford.edu/display/ilc/Simulator+Distribution] distributions type:
 {noformat}globus-job-run fngp-osg.fnal.gov/jobmanager-condor /bin/ls /grid/app/ilc/detector/SimDist/
 {noformat}
h3. Checking and Killing&nbsp; your Jobs releasing held jobs
  
h3. Checking and Killing your Jobs, releasing held jobs
  
  
 You can see the status of all jobs using the following command:
 {noformat}condor_q
 {noformat}or
 {noformat}condor_q -globus
 {noformat}Or to check the jobs submitted by user <username>:
 {noformat}condor_q -submitter <username>
 {noformat}{color:#000000}You can view information about all requests with the following command:{color}
 {noformat}condor_status -submitters
 {noformat}{color:#000000}To cancel a job type condor_rm followed by the job number:{color}
 {noformat}condor_rm <job number>
 {noformat}{color:#000000}Condor can put a job into held state when e.g. the proxy expires while the{color} job is running. In that case the job still might be running fine on the worker node but even after successsful completion there will not be any log files etc. copied back. To remedy that situation renew the proxy and then release the jobs.&nbsp;
  
 {noformat}voms-proxy-init -valid 72:00 -voms ilc:/ilc/sid
 condor_release -all
 {noformat}