You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 68 Next »

Overview

This example shows how to run jobs and especially  SLIC, the Simulator for the Linear Collider, on the FermiGrid which is part of the Open Science Grid. SLIC is a Geant4-based simulations package that uses an XML geometry input format called LCDD to describe geometry, sensitive detectors and readout geometry. 

Note! The examples are written in a way that one can just cut and paste from your browser window to the terminal session on ILCSIM. Do not cut and paste into an editor (It wouldn't work unless you remove a lot of back slashes: \ ). Instead copy and edit the files that are created and modify them to to fit your own needs.

Prerequisites

  1. Obtain a DOE grid certificate from http://security.fnal.gov/pki/Get-Personal-DOEGrids-Cert.html, which also explains how to export the certificate from the browser and how to deal with directory permissions, etc.
  2. Register with the ILC VO (Virtual organization) at http://cd-amr.fnal.gov/ilc/ilcsim/ilcvo-registration.shtml that will guide you to: https://voms.fnal.gov:8443/vomrs/ilc/vomrs
  3. Get an account on ILCSIM and ILCSIM2, using the following form  http://cd-amr.fnal.gov/ilc/ilcsim/ilcsim.shtml. This machines serve as a portals to the grid.

Setting up your own gateway to the grid is beyond the scope of this write-up. It involves installing and configuring the Virtual Data Toolkit (VDT) , installing a host certificate for the gateway machine, etc. For an administrative guide see the Fermi Grid web page.

Connecting from an External Site

Kerberos Configuration

Fermilab uses Kerberos for external authentication. This section assumes that you have a Fermilab Kerberos principal. Follow these instructions if you need an account at Fermilab and are authorized to obtain one.

Assuming that your machine has recent versions of SSH and Kerberos and you will not be using a Cryptocard, download Fermilab's official Kerberos configuration file. You will need root permissions to execute the following command which backs up your current configuration and replaces it with the new one.

cd /etc
cp krb5.conf krb5.conf.bkp
wget http://security.fnal.gov/krb5.conf

Logging In

Logging in to a Fermilab gateway machine should now work provided that Kerberos has been configured correctly.

Initialize the Kerberos session.

kinit -f USERNAME

Now ssh to a gateway machine. You may need to use ssh_config as your SSH configuration file for the Kerberos ticket forwarding to work correctly.

ssh -F ssh_config USERNAME@flxi02.fnal.gov

From the node inside Fermilab, execute kinit again.

kinit -f USERNAME

Now you should be able to connect to the ilcsim node.

ssh ilcsim

Grid

If you want to use a grid-enabled node outside Fermilab, you may only need to add a single configuration file containing the ilc VOMS server information.

The following should be put into a file that we will call ilc-fermilab-voms.

"ilc" "fermigrid2.fnal.gov" "15023" "/DC=org/DC=doegrids/OU=Services/CN=host/fermigrid2.fnal.gov" "ilc"

Now a certificate can be obtained by referencing this configuration when running voms-proxy-init.

voms-proxy-init ilc:/ilc -userconf ./ilc-fermilab-voms

The file ilc-fermilab-voms must be owned by your account. The command will fail if this file has some other owner.

Provided that your local node is configured correctly, this should allow you to start a grid session on an external node outside Fermilab.

Setting up the Environment at Fermilab

To set up the environment and to get the necessary grid proxy certificate, log onto ILCSIM and issue the following commands:

source /fnal/ups/grid/setup.sh
voms-proxy-init -voms ilc:/ilc/detector
# give passwd etc.

To check the status of the proxy:

voms-proxy-info -all

Submitting the First Example Jobs

Now you should be all setup to submit a test job to make sure that everything is working. Cut and paste the following lines into your terminal window. This will submit a grid job which starts 5 separate processes. The processes will just execute sleep for 10 seconds before terminating. Since no output is created the sleep_grid.out.$(Cluster).$(Process) and sleep_grid.err.$(Cluster).$(Process) files should be empty.

(Note!: $(Cluster) represents the job number and $(Process) represents the (5) process  numbers)
The condor log files are:   sleep_grid.log.\$(Cluster).\$(Process)

cat > sleep_grid << +EOF
universe = grid
type = gt2
globusscheduler = fngp-osg.fnal.gov/jobmanager-condor
executable = /bin/sleep
transfer_output = true
transfer_error = true
transfer_executable = true
log = sleep_grid.log.\$(Cluster).\$(Process)
notification = NEVER
output = sleep_grid.out.\$(Cluster).\$(Process)
error = sleep_grid.err.\$(Cluster).\$(Process)
stream_output = false
stream_error = false
ShouldTransferFiles = YES
WhenToTransferOutput = ON_EXIT
globusrsl = (jobtype=single)(maxwalltime=999)
Arguments = 10
queue 5
+EOF


condor_submit sleep_grid

 The second example is an exploration job where the job reports the run time environment it encounters and the file systems that are mounted. This is very often useful to find out what is available on the worker nodes (smile) . So have a look at  env_grid.out.$(Cluster).$(Process).

Note!: The grid job doesn't inherit the run time environment from your interactive session!

rm -f env_grid.sh
cat > env_grid.sh << +EOF
#!/bin/sh -f
printenv
cd \${_CONDOR_SCRATCH_DIR}
pwd
#
# This sets up the environment for osg in case we want to
# use grid services like srmcp
#
. $OSG_GRID/setup.sh
source \${VDT_LOCATION}/setup.sh
printenv
/bin/df
+EOF
chmod +x env_grid.sh

rm -f env_grid.run
cat > env_grid.run << +EOF
universe = grid
type = gt2
globusscheduler = fngp-osg.fnal.gov/jobmanager-condor
executable = ./env_grid.sh
transfer_output = true
transfer_error = true
transfer_executable = true
log = env_grid.log.\$(Cluster).\$(Process)
notification = NEVER
output = env_grid.out.\$(Cluster).\$(Process)
error = env_grid.err.\$(Cluster).\$(Process)
stream_output = false
stream_error = false
ShouldTransferFiles = YES
WhenToTransferOutput = ON_EXIT
globusrsl = (jobtype=single)(maxwalltime=999)
queue
+EOF

condor_submit env_grid.run

Submitting a Job running SLIC

Now finally let's run SLIC (smile) . We will use the SLIC installation and a data set that are available on the GRID worker nodes. As in the previous examples cut and paste the contends below:

rm -f slic_grid.csh
cat > slic_grid.csh << +EOF
#!/bin/csh
echo start
/bin/date
setenv LABELRUN slic_grid-\${ClusterProcess}
setenv TARFILE \${LABELRUN}-results.tar
echo \${TARFILE}
echo start
/bin/date
mkdir results
/grid/app/ilc/detector/SimDist/Oct-31-2007/SimDist/scripts/slic.sh -r 5 \
-g /grid/app/ilc/detector/SimDist/detectors/sid01/sid01.lcdd            \
-i /grid/data/ilc/detector/LDC/stdhep/ZZ_run10.stdhep -o ./results/ZZ_run10\${LABELRUN} >& \
./results/ZZ_run10\${LABELRUN}.lis
ls -lh results
/bin/date
echo "build output tarball: " \${TARFILE}
tar -cf \${TARFILE} results
echo done
+EOF
chmod +x slic_grid.csh

rm -f slic_grid.run
cat > slic_grid.run << +EOF
universe = grid
type = gt2
globusscheduler = fngp-osg.fnal.gov/jobmanager-condor
executable = ./slic_grid.csh
transfer_output = true
transfer_error = true
transfer_executable = true
environment = "ClusterProcess=\$(Cluster)-\$(Process)"
transfer_output_files = slic_grid-\$(Cluster)-\$(Process)-results.tar
log = slic_grid.log.\$(Cluster).\$(Process)
notification = NEVER
output = slic_grid.out.\$(Cluster).\$(Process)
error = slic_grid.err.\$(Cluster).\$(Process)
stream_output = false
stream_error = false
ShouldTransferFiles = YES
WhenToTransferOutput = ON_EXIT
globusrsl = (jobtype=single)(maxwalltime=999)
queue
+EOF

condor_submit slic_grid.run

Running Commands directly on the Head Node

To run some commands directly on the grid head nodes use a syntax like this:

globus-job-run fngp-osg.fnal.gov/jobmanager-condor /bin/ls /grid/app
globus-job-run fngp-osg.fnal.gov/jobmanager-condor /usr/bin/printenv
globus-job-run fngp-osg.fnal.gov/jobmanager-condor /bin/df

The examples above show how to check what grid applications are installed, the runtime environment of a job and what file systems are mounted. To check for available SLIC/SimDist distributions type:

globus-job-run fngp-osg.fnal.gov/jobmanager-condor /bin/ls /grid/app/ilc/detector/SimDist/

Checking and Killing your Jobs

You can see the status of all jobs using the following command:

condor_q
or
condor_q -globus

Or to check the jobs submitted by user <username>:

condor_q  -submitter <username>

You can view information about all requests with the following command:

condor_status  -submitters

To cancel a job type condor_rm followed by the job number:

condor_rm <job number>

Submitting Jobs to Fermilab Externally

As long as the you have a valid grid session, then the only additional step for submitting jobs externally to the Fermi grid nodes is to specify the job scheduler URL.

globus-job-submit fngp-osg.fnal.gov/jobmanager-condor myJob.run

The file myJob.run must be a Condor JDL file.

  • No labels