...
Warning |
---|
The example scripts should be pasted directly into the terminal on ILCSIMdetsim. Do not use an editor, as the escape characters will not be interpreted correctly. |
...
- Obtain a DOE grid certificate from http://security.fnal.gov/pki/Get-Personal-DOEGrids-Cert.html, which also explains how to export the certificate from the browser, dealing with directory permissions, etc.
- Register with the ILC VO (Virtual organization) at http://cd-amr.fnal.gov/ilc/ilcsim/ilcvo-registration.shtml, which will guide you to: https://voms.fnal.gov:8443/vomrs/ilc/vomrs
- Get an account on ILCSIM and ILCSIM2 detsim, using the following form http://cd-amr.fnal.gov/ilc/ilcsim/ilcsim.shtml. This machines serve as a portal to the grid.
Note Setting up your own gateway to the grid is beyond the scope of this write-up. It involves installing and configuring the Virtual Data Toolkit (VDT) , installing a host certificate for the gateway machine, etc. For an administrative guide see the Fermi Grid web page.
...
This variable can be added to your shell profile or setup in a script and the configuration file it points to will override the one in /etc.
Connecting to
...
detsim
Initialize the Kerberos session.
No Format |
---|
kinit -f USERNAME@FNAL.GOV |
Connect to ilcsim detsim using ssh
.
No Format |
---|
ssh USERNAME@ilcsimUSERNAME@detsim.fnal.gov |
You may need to use ssh_config for the SSH configuration file.
No Format |
---|
ssh -F ssh_config USERNAME@ilcsimUSERNAME@detsim.fnal.gov |
Setup the Grid Tools
Setup the grid tools in a bash shell.
No Format |
---|
source /fnal/ups/grid/setup.sh
|
Setup the grid tools in tcsh or csh.
No Format |
---|
source /fnal/ups/grid/setup.csh
|
Session Certificate and quotas
...
If you want to submit jobs from a node other than ILCSIMdetsim, the ilc VOMS server information needs to be explicitly provided.
...
No Format |
---|
cat > sleep_grid << +EOF universe = grid typeGridResource = gt2 globusscheduler = fngp-osg fnpcosg1.fnal.gov/jobmanager-condor executable = /bin/sleep transfer_output = true transfer_error = true transfer_executable = true log = sleep_grid.log.\$(Cluster).\$(Process) notification = NEVER output = sleep_grid.out.\$(Cluster).\$(Process) error = sleep_grid.err.\$(Cluster).\$(Process) stream_output = false stream_error = false ShouldTransferFiles = YES WhenToTransferOutput = ON_EXIT globusrsl = (jobtype=single)(maxwalltime=999) Arguments = 10 queue 5 +EOF condor_submit sleep_grid |
...
No Format |
---|
rm -f env_grid.sh cat > env_grid.sh << +EOF #!/bin/sh -f printenv pwd cd \${_CONDOR_SCRATCH_DIR} pwd # # This sets up the environment for osg in case we want to # use grid services like srmcp # . $OSG_GRID/setup.sh source \${VDT_LOCATION}/setup.sh printenv /bin/df +EOF chmod +x env_grid.sh rm -f env_grid.run cat > env_grid.run << +EOF universe = grid typeGridResource = gt2 globusscheduler = fngp-osg fnpcosg1.fnal.gov/jobmanager-condor executable = ./env_grid.sh transfer_output = true transfer_error = true transfer_executable = true log = env_grid.log.\$(Cluster).\$(Process) notification = NEVER output = env_grid.out.\$(Cluster).\$(Process) error = env_grid.err.\$(Cluster).\$(Process) stream_output = false stream_error = false ShouldTransferFiles = YES WhenToTransferOutput = ON_EXIT globusrsl = (jobtype=single)(maxwalltime=999) queue +EOF condor_submit env_grid.run |
...
No Format |
---|
rm -f slic_grid.csh cat > slic_grid.csh << +EOF #!/bin/csh echo start /bin/date cd \${_CONDOR_SCRATCH_DIR} setenv LABELRUN slic_grid-\${ClusterProcess} setenv TARFILE \${LABELRUN}-results.tar echo \${TARFILE} echo start /bin/date mkdir results /grid/app/ilc/sid/SimDist/v2r4p2/SimDist/scripts/slic.sh -r 5 \ -g /grid/app/ilc/detector/SimDist/detectors/sid01/sid01.lcdd \ -i /grid/data/ilc/detector/LDC/stdhep/ZZ_run10.stdhep -o ./results/ZZ_run10\${LABELRUN} >& \ ./results/ZZ_run10\${LABELRUN}.lis ls -lh results /bin/date echo "build output tarball: " \${TARFILE} tar -cf \${TARFILE} results echo done +EOF chmod +x slic_grid.csh rm -f slic_grid.run cat > slic_grid.run << +EOF universe = grid typeGridResource = gt2 globusscheduler = fngp-osg fnpcosg1.fnal.gov/jobmanager-condor executable = ./slic_grid.csh transfer_output = true transfer_error = true transfer_executable = true environment = "ClusterProcess=\$(Cluster)-\$(Process)" transfer_output_files = slic_grid-\$(Cluster)-\$(Process)-results.tar log = slic_grid.log.\$(Cluster).\$(Process) notification = NEVER output = slic_grid.out.\$(Cluster).\$(Process) error = slic_grid.err.\$(Cluster).\$(Process) stream_output = false stream_error = false ShouldTransferFiles = YES WhenToTransferOutput = ON_EXIT globusrsl = (jobtype=single)(maxwalltime=999) queue +EOF condor_submit slic_grid.run |
...
No Format |
---|
condor_q -submitter <username> |
You can view information about all requests with the following command:
No Format |
---|
condor_status -submitters |
To cancel a job type condor_rm followed by the job number:
No Format |
---|
condor_rm <job number> |
Condor can put a job into held state when e.g. the proxy expires while the job is running. In that case the job still might be running fine on the worker node but even after successsful completion there will not be any log files etc. copied back. To remedy that situation renew the proxy and then release the jobs.
...