Preparing XRootD cluster for a partial outage

Not all files in the FGST xrootd cluster are copied to tape and also not all file types will be automatically staged to disk if a file is not found on disk. This means that while xrootd servers are down, for example due to an outage, not all files will be available and in general it is not advised to run jobs that will read files from xrootd.
However, it is possible to have L1 to continue to run as it just writes files and only reads files that were produced by it self. The files that it reads are typically produced in the last few hours.

  1. About 24h before the outage make all xrootd server that will be down readonly.
  2. Shortly before the outage update the server file list /afs/slac.stanford.edu/g/glast/applications/xrootd/config/glast. It must contain the servers that stay up during the outage.
    This list is used by L1 but also some of the client tools (xrdls).

Issues:

  1. If the fraction of servers that go down is too high the redirector will stall and do not redirect any client. It will stay in this state until enough servers connected again. Restarting the redirectors will fix this issue. The threshold is reconfigurable.
  • No labels