Winter 2017
This page discusses, tracks, and records issues relating to the use of HPSS as a transparent backup for certain type of Fermi files stored in XROOTD. The motivation is to explore whether overall disk space and associated costs of life cycle replacement can be reduced by relying more on tape (HPSS). Currently, HPSS is used in two ways: 1) a list-driven 2nd copy backup for critical files; and, 2) single copy archive for old Pass7 data which is automatically retrieved to disk when a user attempts access. This page is looking to explore the automatic archive/purge-from-disk/retrieve cycle as it might be applied to large, less frequently access files, e.g., recon, and how it would impact users attempted to access archived files. It is expected such a system might have a significant impact on any future large-scale reprocessing projects.
Contacts: Wilko Kroeger, Nicola Omodei, Tom Glanzman, Richard Dubois
Initial Assessment
- Understand what steps would be necessary to implement the above scheme
- Test the system to validate the concepts
- Estimate manpower required to implement and maintain
- Estimate project cost savings
A first pass at identifying the steps needed for implementation yielded the following:
- Activating XROOTD storage assets for HPSS caching
- Use of a pre-stage command for hauling in large numbers of files, e.g., for a large reprocessing
- Automatic staging for small and/or non time-critical requests
- latency for individual files could range from 10s of seconds to minutes due to heavy load on tape system
- Activating (developing) automatic scheme for purging unused files from disk after writing to HPSS.
Testing
In terms of testing these features, several have already seen test or even production use.
HPSS is already used for archiving critical files, e.g., spacecraft downlinks from Goddard. It serves as a second backup copy in the event of catastrophic disk failure. (Note that due to RAID technology, single or even double disk failures are not fatal. However, in early 2016, SLAC lost an entire file server due to an overheating problem and while most data was recovered, some was lost.) This scheme has been in production since launch (and maybe before). There are no new "links" to be established between XROOTD and HPSS.
During the Pass8 reprocessing, the old Pass7 data were migrated to HPSS and removed from disk. However, they remained in the XROOTD system and users who attempted to access these files would trigger an automatic retrieval mechanism which involved a request to mount a tape, read the file and place the file back on disk where the user would have normal access. This system has been in place for several years, initially with only a single XROOTD server (wain017) authorized to make the retrieval request. Wilko has reported that the system has seen use, even to the extent that he needed to clear space for additional retrieval requests. Depending on the situation (batch vs. interactive), users might have seen a delay during file OPEN actions if that file only resided on tape.
Already existing is a "pre-stage" command which, armed with a list of archived/purged files, brings them all back to disk en masse. This is a likely scenario for a massive future reprocessing project. The pre-stage command has been tested in a limited way (prior to the Pass8 reprocessing).
A possible future test might include: preparing a pre-stage list of one month's recon files and then call them all back to disk to see how well the system operates and on what time-scale. This would emulate the preparation for a merit-level reprocessing project. (For the purposes of the test, old Pass7 recon files could be used. After the test, we just purge them from disk once again.)
Other Development Needs
More specific development needs were outlined by Wilko.
Developing and implementing a disk purge policy
- possibly based on some combination of file type, size, last access time
Developing and implementing a policy for dual-copy files
Tape, like disk space, is not immune to possible data loss (ageing, loss, mechanial problems, etc.)
Developing some new monitoring tools to ensure the system is running smoothly
- overall usage statistics, latencies