...
Download the file.
No Format |
---|
wget http://security.fnal.gov/krb5.conf
|
Set the environment variable KRB5_CONFIG to point to the Fermilab configuration file.
No Format |
---|
export KRB5_CONFIG=`pwd`/krb5.conf
|
...
Initialize the Kerberos session.
No Format |
---|
kinit -f USERNAME@FNAL.GOV
|
Connect to ilcsim using ssh.
No Format |
---|
ssh USERNAME@ilcsim.fnal.gov
|
You may need to use ssh_config for the SSH configuration file.
No Format |
---|
ssh -F ssh_config USERNAME@ilcsim.fnal.gov
|
...
Setup the grid tools in a bash shell.
No Format |
---|
source /fnal/ups/grid/setup.sh
|
Setup the grid tools in tcsh or csh.
No Format |
---|
source /fnal/ups/grid/setup.csh
|
...
Finally, obtain a session certificate .
No Format |
---|
voms-proxy-init -voms ilc:/ilc/sid
|
By default the proxy is valid for 12 hours, which is probably too short for your job. To obtain a proxy that is valid for 72 hours, issue the command:
No Format |
---|
voms-proxy-init -valid 72:00 -voms ilc:/ilc/sid
|
...
To check the status of the proxy:
No Format |
---|
voms-proxy-info -all
|
To check quotas and to check how many slots are already taken:
No Format |
---|
condor_config_val GROUP_QUOTA_group_siddet -name fnpc5x1.fnal.gov -pool fnpccm1.fnal.gov
condor_userprio -all -pool fnpccm1.fnal.gov
|
...
The following should be put into a file, ilc-fermilab-voms.
No Format |
---|
"ilc" "voms.fnal.gov" "15023" "/DC=org/DC=doegrids/OU=Services/CN=http/voms.fnal.gov" "ilc"
|
Now a valid grid certificate can be obtained by referencing this configuration file.
No Format |
---|
voms-proxy-init ilc:/ilc -userconf ./ilc-fermilab-voms
|
...
(Note!: $(Cluster) represents the job number and $(Process) represents the (5) process numbers)
The condor log files are: sleep_grid.log.\$(Cluster).\$(Process)
No Format |
---|
cat > sleep_grid << +EOF
universe = grid
type = gt2
globusscheduler = fngp-osg.fnal.gov/jobmanager-condor
executable = /bin/sleep
transfer_output = true
transfer_error = true
transfer_executable = true
log = sleep_grid.log.\$(Cluster).\$(Process)
notification = NEVER
output = sleep_grid.out.\$(Cluster).\$(Process)
error = sleep_grid.err.\$(Cluster).\$(Process)
stream_output = false
stream_error = false
ShouldTransferFiles = YES
WhenToTransferOutput = ON_EXIT
globusrsl = (jobtype=single)(maxwalltime=999)
Arguments = 10
queue 5
+EOF
condor_submit sleep_grid
|
...
Note!: The grid job doesn't inherit the run time environment from your interactive session!
No Format |
---|
rm -f env_grid.sh
cat > env_grid.sh << +EOF
#!/bin/sh -f
printenv
pwd
cd \${_CONDOR_SCRATCH_DIR}
pwd
#
# This sets up the environment for osg in case we want to
# use grid services like srmcp
#
. $OSG_GRID/setup.sh
source \${VDT_LOCATION}/setup.sh
printenv
/bin/df
+EOF
chmod +x env_grid.sh
rm -f env_grid.run
cat > env_grid.run << +EOF
universe = grid
type = gt2
globusscheduler = fngp-osg.fnal.gov/jobmanager-condor
executable = ./env_grid.sh
transfer_output = true
transfer_error = true
transfer_executable = true
log = env_grid.log.\$(Cluster).\$(Process)
notification = NEVER
output = env_grid.out.\$(Cluster).\$(Process)
error = env_grid.err.\$(Cluster).\$(Process)
stream_output = false
stream_error = false
ShouldTransferFiles = YES
WhenToTransferOutput = ON_EXIT
globusrsl = (jobtype=single)(maxwalltime=999)
queue
+EOF
condor_submit env_grid.run
|
...
Now finally let's run SLIC . We will use the SLIC installation and a data set that are available on the GRID worker nodes. As in the previous examples cut and paste the contends below:
No Format |
---|
rm -f slic_grid.csh
cat > slic_grid.csh << +EOF
#!/bin/csh
echo start
/bin/date
cd \${_CONDOR_SCRATCH_DIR}
setenv LABELRUN slic_grid-\${ClusterProcess}
setenv TARFILE \${LABELRUN}-results.tar
echo \${TARFILE}
echo start
/bin/date
mkdir results
/grid/app/ilc/sid/SimDist/v2r4p2/SimDist/scripts/slic.sh -r 5 \
-g /grid/app/ilc/detector/SimDist/detectors/sid01/sid01.lcdd \
-i /grid/data/ilc/detector/LDC/stdhep/ZZ_run10.stdhep -o ./results/ZZ_run10\${LABELRUN} >& \
./results/ZZ_run10\${LABELRUN}.lis
ls -lh results
/bin/date
echo "build output tarball: " \${TARFILE}
tar -cf \${TARFILE} results
echo done
+EOF
chmod +x slic_grid.csh
rm -f slic_grid.run
cat > slic_grid.run << +EOF
universe = grid
type = gt2
globusscheduler = fngp-osg.fnal.gov/jobmanager-condor
executable = ./slic_grid.csh
transfer_output = true
transfer_error = true
transfer_executable = true
environment = "ClusterProcess=\$(Cluster)-\$(Process)"
transfer_output_files = slic_grid-\$(Cluster)-\$(Process)-results.tar
log = slic_grid.log.\$(Cluster).\$(Process)
notification = NEVER
output = slic_grid.out.\$(Cluster).\$(Process)
error = slic_grid.err.\$(Cluster).\$(Process)
stream_output = false
stream_error = false
ShouldTransferFiles = YES
WhenToTransferOutput = ON_EXIT
globusrsl = (jobtype=single)(maxwalltime=999)
queue
+EOF
condor_submit slic_grid.run
|
...
To run some commands directly on the grid head nodes use a syntax like this:
No Format |
---|
globus-job-run fngp-osg.fnal.gov/jobmanager-condor /bin/ls /grid/app
globus-job-run fngp-osg.fnal.gov/jobmanager-condor /usr/bin/printenv
globus-job-run fngp-osg.fnal.gov/jobmanager-condor /bin/df
|
The examples above show how to check what grid applications are installed, the runtime environment of a job and what file systems are mounted. To check for available SLIC/SimDist distributions type:
No Format |
---|
globus-job-run fngp-osg.fnal.gov/jobmanager-condor /bin/ls /grid/app/ilc/detector/SimDist/
|
Checking and
...
Killing your Jobs releasing held jobs
You can see the status of all jobs using the following command:
No Format |
---|
condor_q
|
or
No Format |
---|
condor_q -globus
|
Or to check the jobs submitted by user <username>:
No Format |
---|
condor_q -submitter <username>
|
You can view information about all requests with the following command:
No Format |
---|
condor_status -submitters
|
To cancel a job type condor_rm followed by the job number:
No Format |
---|
condor_rm <job number> |
Condor can put a job into held state when e.g. the proxy expires while the job is running. In that case the job still might be running fine on the worker node but even after successsful completion there will not be any log files etc. copied back. To remedy that situation renew the proxy and then release the jobs.
No Format |
---|
voms-proxy-init -valid 72:00 -voms ilc:/ilc/sid
condor_release -all
|