Q: What is the meaning of the LSF ERROR: 
Job 813316: </afs/slac/package/pinger/analysis/wrap-analyze-daily.pl \--basedir /nfs/slac/g/net/pinger \--usemetric \--dataset hep \--date 365days \--set_metric 1 # unreachability,conditional_loss_probabaility, 6.2hrs on kiso> in cluster <slac> Done

ERROR:  The output file name you specified

relative to the submit-time current working directory of
    /u/sf/pinger/
in the bsub -o option is incorrect or not useable for output.
This job was running on dole0014 and attempted to write the data to
that location but failed.

To prevent losing the output, an attempt has been made to temporarily
store it in
    /nfs/farm/knackery/pinger.job.813316.output.

Failure to write to the specified output file location would typically
be caused by the following kinds of things:
 o The output file location is out of space or over the user's quota;
 o The user does not have write privileges to the file;
 o The directory specified for the file does not exist;
 o The output file specification did not specify a full path and the
   current working directory at the time of job submission was not
   what was intended.
 o Some system failure occured such that the output file could not
   be opened or mounted.

The file shown above should be readable from any interactive server.
Please erase it after you have looked at it.
The file will be erased automatically in 72 hours.
Answer

Look at the file indicated:

[cottrell@pinger ~]$ cat /nfs/farm/knackery/pinger.job.813316.output
Job </afs/slac/package/pinger/analysis/wrap-analyze-daily.pl --basedir /nfs/slac/g/net/pinger --usemetric --dataset hep --date 365days --set_metric 1 # unreachability,conditional_loss_probabaility, 6.2hrs on kiso> was submitted from host <lnxcron> by user <pinger> in cluster <slac>.
Job was executed on host(s) <dole0014>, in queue <xxl>, as user <pinger> in cluster <slac>.
</u/sf/pinger> was used as the home directory.
</u/sf/pinger> was used as the working directory.
Started at Sat Aug 17 08:10:05 2013
Results reported at Mon Aug 19 20:05:49 2013
Cannot open your job file: /u/sf/pinger/.lsbatch/1376752201.813316
Successfully completed.

Resource usage summary:

    CPU time :               6574.51 sec.
    Max Memory :             607 MB
    Average Memory :         457.35 MB
    Total Requested Memory : -
    Delta Memory :           -
    (Delta: the difference between total requested memory and actual max usage.)
    Max Swap :               1002 MB

    Max Processes :          6
    Max Threads :            7


PS:

Unable to read output data from the stdout buffer file </u/sf/pinger/.lsbatch/1376752201.813316.out>: your job was probably aborted prematurely.

Looking at /u/sf/pinger/.lsbatch

[cottrell@pinger ~]$ ls /u/sf/pinger/.lsbatch/
1369482901.324134*
[cottrell@pinger ~]$ ls -ld /u/sf/pinger/.lsbatch
drwx------ 2 pinger sf 2048 Aug 17 23:52 /u/sf/pinger/.lsbatch/
[cottrell@pinger ~]$ df ~pinger
Filesystem           1K-blocks      Used Available Use% Mounted on
AFS                    9000000         0   9000000   0% /afs

Looking at the trscrontab the entry is:

lnxcron;700 10 08 * * * /usr/local/bin/bsub -q xxl /afs/slac/package/pinger/analysis/wrap-analyze-daily.pl --basedir /nfs/slac/g/net/pinger --
usemetric --dataset hep --date 365days --set_metric 1 # unreachability,conditional_loss_probabaility, 6.2hrs on kiso

The token is set for 700 minutes or 11 hrs and 40 mins. However the job ran for 12 hours. We need to increase the token in the trscrontab.

  • No labels