Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Only one delivery can process a run at a time. This is enforced by a lock file in the run directory on u52/L1. If there are permanent failures in the run and another part of the run is waiting, it has to be removed by hand. It should never be removed unless the only failures in the run are permanent ones, or there's a deadlock. Even then you have to wear a helmet and sign a waiver.

AFS buffers & idle threads

When the AFS servers where we keep temporary files hiccup, it's usually because they ran low on idle threads. It is possible to monitor this value and intervene to stave off disaster. Unfortunately, it's only available from Nagios, which only works inside SLAC's firewall.

  • Using ssh as a proxy

Useful resources

  • List of failed runs. Report here failed runs and their status. When a run is fixed (meaning: no red cross appears anywhere anymore), you can delete it from the list (the runs listed as 'ignore error (it is fixed)' are the ones that were fixed with a fake delivery, so they still appear as failed in the Data Processing page.
  • AFS hiccups. We keep track here of the AFS hiccups (or at least we try to). You can add date/time, LFS job ID and the 'missing' file.
  • LSF hiccups. We keep track here of the LSF hiccups (or at least we try to). You can add date/time, LSF batch host(s) and the reason, if known.
  • List of runs incorrectly marked as InProgress. These are runs where acq status is Complete, but L1 status is InProgress.

...