Q: What is the meaning of the LSF ERROR:
Job 813316: </afs/slac/package/pinger/analysis/wrap-analyze-daily.pl \--basedir /nfs/slac/g/net/pinger \--usemetric \--dataset hep \--date 365days \--set_metric 1 # unreachability,conditional_loss_probabaility, 6.2hrs on kiso> in cluster <slac> Done ERROR: The output file name you specified relative to the submit-time current working directory of /u/sf/pinger/ in the bsub -o option is incorrect or not useable for output. This job was running on dole0014 and attempted to write the data to that location but failed. To prevent losing the output, an attempt has been made to temporarily store it in /nfs/farm/knackery/pinger.job.813316.output. Failure to write to the specified output file location would typically be caused by the following kinds of things: o The output file location is out of space or over the user's quota; o The user does not have write privileges to the file; o The directory specified for the file does not exist; o The output file specification did not specify a full path and the current working directory at the time of job submission was not what was intended. o Some system failure occured such that the output file could not be opened or mounted. The file shown above should be readable from any interactive server. Please erase it after you have looked at it. The file will be erased automatically in 72 hours.
Answer
Look at the file indicated:
[cottrell@pinger ~]$ cat /nfs/farm/knackery/pinger.job.813316.output Job </afs/slac/package/pinger/analysis/wrap-analyze-daily.pl --basedir /nfs/slac/g/net/pinger --usemetric --dataset hep --date 365days --set_metric 1 # unreachability,conditional_loss_probabaility, 6.2hrs on kiso> was submitted from host <lnxcron> by user <pinger> in cluster <slac>. Job was executed on host(s) <dole0014>, in queue <xxl>, as user <pinger> in cluster <slac>. </u/sf/pinger> was used as the home directory. </u/sf/pinger> was used as the working directory. Started at Sat Aug 17 08:10:05 2013 Results reported at Mon Aug 19 20:05:49 2013 Cannot open your job file: /u/sf/pinger/.lsbatch/1376752201.813316 Successfully completed. Resource usage summary: CPU time : 6574.51 sec. Max Memory : 607 MB Average Memory : 457.35 MB Total Requested Memory : - Delta Memory : - (Delta: the difference between total requested memory and actual max usage.) Max Swap : 1002 MB Max Processes : 6 Max Threads : 7 PS: Unable to read output data from the stdout buffer file </u/sf/pinger/.lsbatch/1376752201.813316.out>: your job was probably aborted prematurely.
Looking at /u/sf/pinger/.lsbatch
[cottrell@pinger ~]$ ls /u/sf/pinger/.lsbatch/ 1369482901.324134* [cottrell@pinger ~]$ ls -ld /u/sf/pinger/.lsbatch drwx------ 2 pinger sf 2048 Aug 17 23:52 /u/sf/pinger/.lsbatch/ [cottrell@pinger ~]$ df ~pinger Filesystem 1K-blocks Used Available Use% Mounted on AFS 9000000 0 9000000 0% /afs
Looking at the trscrontab the entry is:
lnxcron;700 10 08 * * * /usr/local/bin/bsub -q xxl /afs/slac/package/pinger/analysis/wrap-analyze-daily.pl --basedir /nfs/slac/g/net/pinger -- usemetric --dataset hep --date 365days --set_metric 1 # unreachability,conditional_loss_probabaility, 6.2hrs on kiso
The token is set for 700 minutes or 11 hrs and 40 mins. However the job ran for 12 hours. We need to increase the token in the trscrontab.