...
Warning |
---|
The example scripts should be pasted directly into the terminal on ILCSIMdetsim. Do not use an editor, as the escape characters will not be interpreted correctly. |
...
- Obtain a DOE grid certificate from http://security.fnal.gov/pki/Get-Personal-DOEGrids-Cert.html, which also explains how to export the certificate from the browser, dealing with directory permissions, etc.
- Register with the ILC VO (Virtual organization) at http://cd-amr.fnal.gov/ilc/ilcsim/ilcvo-registration.shtml, which will guide you to: https://voms.fnal.gov:8443/vomrs/ilc/vomrs
- Get an account on ILCSIM and ILCSIM2 detsim, using the following form http://cd-amr.fnal.gov/ilc/ilcsim/ilcsim.shtml. This machines serve as a portal to the grid.
Note Setting up your own gateway to the grid is beyond the scope of this write-up. It involves installing and configuring the Virtual Data Toolkit (VDT) , installing a host certificate for the gateway machine, etc. For an administrative guide see the Fermi Grid web page.
Setup and Configuration
Kerberos
...
Download the file.
No Format |
---|
wget http://security.fnal.gov/krb5.conf
|
Set the environment variable KRB5_CONFIG to point to the Fermilab configuration file.
No Format |
---|
export KRB5_CONFIG=`pwd`/krb5.conf
|
This variable can be added to your shell profile or setup in a script and the configuration file it points to will override the one in /etc.
Logging In
...
Connecting to detsim
Initialize the Kerberos session.
No Format |
---|
kinit -f USERNAME@FNAL.GOV
|
Connect to detsim using ssh
Now ssh to a gateway machine at Fermilab.
No Format |
---|
ssh USERNAME@flxi02USERNAME@detsim.fnal.gov |
You may need to use ssh_config as your for the SSH configuration file for the Kerberos ticket forwarding to work correctly.
No Format |
---|
ssh -F ssh_config USERNAME@flxi02USERNAME@detsim.fnal.gov |
Session Certificate and quotas
Finally, obtain a session certificate Now you should be able to connect to the ilcsim node .
No Format |
---|
ssh ilcsim
|
Setup the Grid Tools
To set up the environment and to get the necessary grid proxy certificate, issue the following commands on the ILCSIM machine (assumes a bash shell).
voms-proxy-init -voms ilc:/ilc/sid
|
By default the proxy is valid for 12 hours, which is probably too short for your job. To obtain a proxy that is valid for 72 hours, issue the command:
No Format |
---|
voms-proxy-init -valid 72:00 |
No Format |
source /fnal/ups/grid/setup.sh voms-proxy-init -voms ilc:/ilc/detector # give passwd etc. /ilc/sid |
The following groups should have a higher limit on the number of current jobs than the generic /ilc/detector so feel free to substitute them in the voms-proxy-init command. The following webpage list the quotas on the general FERMIGRID: http://fermigrid.fnal.gov/grid_users/fermigrid_gpgrid.pdf
- /ilc/sid - SiD
- /ilc/ilddet - ILC Large Detector
To check the status of the proxy:
No Format |
---|
voms-proxy-info -all |
To check quotas and to check how many slots are already taken:
No Format |
---|
condor_config_val GROUP_QUOTA_group_siddet -name fnpc5x1.fnal.gov -pool fnpccm1.fnal.gov condor_userprio -all -pool fnpccm1.fnal.gov |
Running from an External Site
If you want to submit jobs from a node other than ILCSIMdetsim, the ilc VOMS server information needs to be explicitly provided.
The following should be put into a file, ilc-fermilab-voms.
No Format |
---|
"ilc" "fermigrid2voms.fnal.gov" "15023" "/DC=org/DC=doegrids/OU=Services/CN=hosthttp/fermigrid2voms.fnal.gov" "ilc" |
Now a valid grid certificate can be obtained by referencing this configuration file.
No Format |
---|
voms-proxy-init ilc:/ilc -userconf ./ilc-fermilab-voms
|
...
(Note!: $(Cluster) represents the job number and $(Process) represents the (5) process numbers)
The condor log files are: sleep_grid.log.\$(Cluster).\$(Process)
No Format |
---|
cat > sleep_grid << +EOF universe = grid typeGridResource = gt2 globusscheduler = fngp-osg fnpcosg1.fnal.gov/jobmanager-condor executable = /bin/sleep transfer_output = true transfer_error = true transfer_executable = true log = sleep_grid.log.\$(Cluster).\$(Process) notification = NEVER output = sleep_grid.out.\$(Cluster).\$(Process) error = sleep_grid.err.\$(Cluster).\$(Process) stream_output = false stream_error = false ShouldTransferFiles = YES WhenToTransferOutput = ON_EXIT globusrsl = (jobtype=single)(maxwalltime=999) Arguments = 10 queue 5 +EOF condor_submit sleep_grid |
...
Note!: The grid job doesn't inherit the run time environment from your interactive session!
No Format |
---|
rm -f env_grid.sh cat > env_grid.sh << +EOF #!/bin/sh -f printenv pwd cd \${_CONDOR_SCRATCH_DIR} pwd # # This sets up the environment for osg in case we want to # use grid services like srmcp # . $OSG_GRID/setup.sh source \${VDT_LOCATION}/setup.sh printenv /bin/df +EOF chmod +x env_grid.sh rm -f env_grid.run cat > env_grid.run << +EOF universe = grid typeGridResource = gt2 globusscheduler = fngp-osg fnpcosg1.fnal.gov/jobmanager-condor executable = ./env_grid.sh transfer_output = true transfer_error = true transfer_executable = true log = env_grid.log.\$(Cluster).\$(Process) notification = NEVER output = env_grid.out.\$(Cluster).\$(Process) error = env_grid.err.\$(Cluster).\$(Process) stream_output = false stream_error = false ShouldTransferFiles = YES WhenToTransferOutput = ON_EXIT globusrsl = (jobtype=single)(maxwalltime=999) queue +EOF condor_submit env_grid.run |
...
Now finally let's run SLIC . We will use the SLIC installation and a data set that are available on the GRID worker nodes. As in the previous examples cut and paste the contends below:
No Format |
---|
rm -f slic_grid.csh cat > slic_grid.csh << +EOF #!/bin/csh echo start /bin/date cd \${_CONDOR_SCRATCH_DIR} setenv LABELRUN slic_grid-\${ClusterProcess} setenv TARFILE \${LABELRUN}-results.tar echo \${TARFILE} echo start /bin/date mkdir results /grid/app/ilc/detectorsid/SimDist/Oct-31-2007v2r4p2/SimDist/scripts/slic.sh -r 5 \ -g /grid/app/ilc/detector/SimDist/detectors/sid01/sid01.lcdd \ -i /grid/data/ilc/detector/LDC/stdhep/ZZ_run10.stdhep -o ./results/ZZ_run10\${LABELRUN} >& \ ./results/ZZ_run10\${LABELRUN}.lis ls -lh results /bin/date echo "build output tarball: " \${TARFILE} tar -cf \${TARFILE} results echo done +EOF chmod +x slic_grid.csh rm -f slic_grid.run cat > slic_grid.run << +EOF universe = grid typeGridResource = gt2 globusscheduler = fngp-osg fnpcosg1.fnal.gov/jobmanager-condor executable = ./slic_grid.csh transfer_output = true transfer_error = true transfer_executable = true environment = "ClusterProcess=\$(Cluster)-\$(Process)" transfer_output_files = slic_grid-\$(Cluster)-\$(Process)-results.tar log = slic_grid.log.\$(Cluster).\$(Process) notification = NEVER output = slic_grid.out.\$(Cluster).\$(Process) error = slic_grid.err.\$(Cluster).\$(Process) stream_output = false stream_error = false ShouldTransferFiles = YES WhenToTransferOutput = ON_EXIT globusrsl = (jobtype=single)(maxwalltime=999) queue +EOF condor_submit slic_grid.run |
...
To run some commands directly on the grid head nodes use a syntax like this:
No Format |
---|
globus-job-run fngp-osg.fnal.gov/jobmanager-condor /bin/ls /grid/app
globus-job-run fngp-osg.fnal.gov/jobmanager-condor /usr/bin/printenv
globus-job-run fngp-osg.fnal.gov/jobmanager-condor /bin/df
|
The examples above show how to check what grid applications are installed, the runtime environment of a job and what file systems are mounted. To check for available SLIC/SimDist distributions type:
No Format |
---|
globus-job-run fngp-osg.fnal.gov/jobmanager-condor /bin/ls /grid/app/ilc/detector/SimDist/
|
Checking and Killing your Jobs, releasing held jobs
You can see the status of all jobs using the following command:
No Format |
---|
condor_q
|
or
No Format |
---|
condor_q -globus
|
Or to check the jobs submitted by user <username>:
No Format |
---|
condor_q -submitter <username>
|
You can view information about all requests with the following command:
No Format |
---|
condor_status -submitters
|
To cancel a job type condor_rm followed by the job number:
No Format |
---|
condor_rm <job number> |
Condor can put a job into held state when e.g. the proxy expires while the job is running. In that case the job still might be running fine on the worker node but even after successsful completion there will not be any log files etc. copied back. To remedy that situation renew the proxy and then release the jobs.
No Format |
---|
voms-proxy-init -valid 72:00 -voms ilc:/ilc/sid
condor_release -all
|