Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin

...

Warning

The example scripts should be pasted directly into the terminal on ILCSIMdetsim. Do not use an editor, as the escape characters will not be interpreted correctly.

...

  1. Obtain a DOE grid certificate from http://security.fnal.gov/pki/Get-Personal-DOEGrids-Cert.html, which also explains how to export the certificate from the browser, dealing with directory permissions, etc. 
  2. Register with the ILC VO (Virtual organization) at http://cd-amr.fnal.gov/ilc/ilcsim/ilcvo-registration.shtml, which will guide you to: https://voms.fnal.gov:8443/vomrs/ilc/vomrs 
  3. Get an account on ILCSIM and ILCSIM2 detsim, using the following form  http://cd-amr.fnal.gov/ilc/ilcsim/ilcsim.shtml. This machines serve as a portal to the grid.
    Note

    Setting up your own gateway to the grid is beyond the scope of this write-up. It involves installing and configuring the Virtual Data Toolkit (VDT) , installing a host certificate for the gateway machine, etc. For an administrative guide see the Fermi Grid web page.

...

This variable can be added to your shell profile or setup in a script and the configuration file it points to will override the one in /etc.

Connecting to

...

detsim

Initialize the Kerberos session.

No Format
kinit -f USERNAME@FNAL.GOV

Connect to ilcsim detsim using ssh

.

No Format
ssh USERNAME@ilcsimUSERNAME@detsim.fnal.gov

You may need to use ssh_config for the SSH configuration file.

No Format
ssh -F ssh_config USERNAME@ilcsimUSERNAME@detsim.fnal.gov

Setup the Grid Tools

Setup the grid tools in a bash shell.

No Format
source /fnal/ups/grid/setup.sh

Setup the grid tools in tcsh or csh.

No Format
source /fnal/ups/grid/setup.csh

Session Certificate and quotas

...

If you want to submit jobs from a node other than ILCSIMdetsim, the ilc VOMS server information needs to be explicitly provided.

...

No Format
cat > sleep_grid << +EOF
universe = grid
typeGridResource = gt2
globusscheduler = fngp-osg fnpcosg1.fnal.gov/jobmanager-condor
executable = /bin/sleep
transfer_output = true
transfer_error = true
transfer_executable = true
log = sleep_grid.log.\$(Cluster).\$(Process)
notification = NEVER
output = sleep_grid.out.\$(Cluster).\$(Process)
error = sleep_grid.err.\$(Cluster).\$(Process)
stream_output = false
stream_error = false
ShouldTransferFiles = YES
WhenToTransferOutput = ON_EXIT
globusrsl = (jobtype=single)(maxwalltime=999)
Arguments = 10
queue 5
+EOF


condor_submit sleep_grid

...

No Format
rm -f env_grid.sh
cat > env_grid.sh << +EOF
#!/bin/sh -f
printenv
pwd
cd \${_CONDOR_SCRATCH_DIR}
pwd
#
# This sets up the environment for osg in case we want to
# use grid services like srmcp
#
. $OSG_GRID/setup.sh
source \${VDT_LOCATION}/setup.sh
printenv
/bin/df
+EOF
chmod +x env_grid.sh

rm -f env_grid.run
cat > env_grid.run << +EOF
universe = grid
typeGridResource = gt2
globusscheduler = fngp-osg fnpcosg1.fnal.gov/jobmanager-condor
executable = ./env_grid.sh
transfer_output = true
transfer_error = true
transfer_executable = true
log = env_grid.log.\$(Cluster).\$(Process)
notification = NEVER
output = env_grid.out.\$(Cluster).\$(Process)
error = env_grid.err.\$(Cluster).\$(Process)
stream_output = false
stream_error = false
ShouldTransferFiles = YES
WhenToTransferOutput = ON_EXIT
globusrsl = (jobtype=single)(maxwalltime=999)
queue
+EOF

condor_submit env_grid.run

...

No Format
rm -f slic_grid.csh
cat > slic_grid.csh << +EOF
#!/bin/csh
echo start
/bin/date
cd \${_CONDOR_SCRATCH_DIR}
setenv LABELRUN slic_grid-\${ClusterProcess}
setenv TARFILE \${LABELRUN}-results.tar
echo \${TARFILE}
echo start
/bin/date
mkdir results
/grid/app/ilc/sid/SimDist/v2r4p2/SimDist/scripts/slic.sh -r 5   \
-g /grid/app/ilc/detector/SimDist/detectors/sid01/sid01.lcdd            \
-i /grid/data/ilc/detector/LDC/stdhep/ZZ_run10.stdhep -o ./results/ZZ_run10\${LABELRUN} >& \
./results/ZZ_run10\${LABELRUN}.lis
ls -lh results
/bin/date
echo "build output tarball: " \${TARFILE}
tar -cf \${TARFILE} results
echo done
+EOF
chmod +x slic_grid.csh

rm -f slic_grid.run
cat > slic_grid.run << +EOF
universe = grid
typeGridResource = gt2
globusscheduler = fngp-osg fnpcosg1.fnal.gov/jobmanager-condor
executable = ./slic_grid.csh
transfer_output = true
transfer_error = true
transfer_executable = true
environment = "ClusterProcess=\$(Cluster)-\$(Process)"
transfer_output_files = slic_grid-\$(Cluster)-\$(Process)-results.tar
log = slic_grid.log.\$(Cluster).\$(Process)
notification = NEVER
output = slic_grid.out.\$(Cluster).\$(Process)
error = slic_grid.err.\$(Cluster).\$(Process)
stream_output = false
stream_error = false
ShouldTransferFiles = YES
WhenToTransferOutput = ON_EXIT
globusrsl = (jobtype=single)(maxwalltime=999)
queue
+EOF

condor_submit slic_grid.run

...

No Format
condor_q  -submitter <username>

You can view information about all requests with the following command:

No Format
condor_status  -submitters

To cancel a job type condor_rm followed by the job number:

No Format
condor_rm <job number>

Condor can put a job into held state when e.g. the proxy expires while the job is running. In that case the job still might be running fine on the worker node but even after successsful completion there will not be any log files etc. copied back. To remedy that situation renew the proxy and then release the jobs. 

...