Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Issues (mainly based on Tom's input)

Bugs/Reliability issues

  1. unreliability of redirector-server link for writing files. If a server containing an old version of a file is delayed, off the network, hung or shutdown, the redirector writes the new file to a new server. This results in a situation with two copies of different versions of the same file.
  2. unreliability of redirector-server link for reading or removing files. If a server holding the desired file is delayed, off the network, hung or shutdown, there is no way for the redirector to distinguish this situation from a request for a nonexistant file.
    1. Note: propitious use of the dataCatalog might mitigate the ill effects of the items aboce. For example, if all files in xrootd are registered in the dataCatalog, a special purpose script could "verify" all files by using the xrootd "stat" command (this could be done, say, asynchronously after the file's creation). Then, all subsequent xrootd requests to read/write/delete/stat/copy/etc. could be wrapped in a script to first check with the dataCatalog to confirm the initial conditions and then properly report unexpected error situations.
  3. I (Tom) now think that all GLAST xrootd servers should be dedicated to that one function and not shared with other functions, such as /nfs serving. Wilko recently relinquished /u44 (shared with Monte Carlo data disk /u45) but is still using (part of) /u42 whose server is shared by /u43. Further, /u42 itself is itself shared with other /nfs users.
  4. The xrootd servers (redirector and disk servers) have in the past failed to restart after machine reboot (or failed to start talking to each other after reboot), this needs to be fixed.
  5. There is a known bug in the xrootd "rm" command which can result in client code hang.
  6. The "xrdcp -f" command is not working
  7. The "rm" command now only removes an xrootd file from disk - and not HPSS. This is probably okay unless a new version of that file is created, which may cause confusion.
    1. This may become more critical when we add the ability to automatically tar files before archiving them
  8. There has been some evidence that xrootd has scaling issues and we do not know how much load it will sustain. I have suggested (8/22/2007) setting up a Pipeline task expressly for the purpose of hammering on the system to locate its limitations and weaknesses but we need an xrootd expert's help in designing the appropriate tool.
  9. There seems to be no global tool for assessing the status of the entire xrootd infrastructure as used by GLAST (including redirector(s) and server(s)). Perhaps a nagios script or somesuch should be set up for proper monitoring.

Miscellaneous/Management issues

  1. casual way xrootd software components are managed (e.g., Wilko's private directory with no concept of a "release") In fact, I set up a directory with appropriate AFS permissions expressly for this purpose (8/27/2007), /afs/slac.stanford.edu/g/glast/ground/PipelineConfig/xrootd, but it has not yet been used.
  2. Documentation for xrootd is spread over multiple disconnected confluence pages; they should be consolidated in a single place with appropriate links to other, external docs, i.e.,

...

  1. It would be nice if xrootd kept the original date when a file is copied in, and maintained this even if added to/restored from HPSS.
  2. The command syntax, error messages, and returncode for commonly used xrootd commands could be
    improved to be easier to use within (python) scripts.
  3. lack of a reasonable way to list files in xrootd (e.g., "ls" command)
    1. A possible alternative solution would be a nightly job which lists files on all servers, combines them together, and flags inconsistencies (e.g. same file with different size/date).
  4. Ability to migrate (or duplicate) data between disks without going via HPSS.

Note: propitious use of the dataCatalog might mitigate the ill effects of items 3. and 4. For example, if all files in xrootd are registered in the dataCatalog, a special purpose script could "verify" all files by using the xrootd "stat" command (this could be done, say, asynchronously after the file's creation). Then, all subsequent xrootd requests to read/write/delete/stat/copy/etc.
could be wrapped in a script to first check with the dataCatalog to confirm the initial conditions and then properly report unexpected error situations.

  1. There has been some evidence that xrootd has scaling issues and we do not know how much load it will sustain. I have suggested (8/22/2007) setting up a Pipeline task expressly for the purpose of hammering on the system to locate its limitations and weaknesses but we need an xrootd expert's help in designing the appropriate tool.
  1. There seems to be no global tool for assessing the status of the entire xrootd infrastructure as used by GLAST (including redirector(s) and server(s)). Perhaps a nagios script or somesuch should be set up for proper monitoring.