Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Policy by Folder

Red text indicates a change from the previous policy.

...

Space

Size

Backup

Lifetime

Comment

xtc

Unlimited

Tape archive

2 4 months

Raw data

usrdaq

Unlimited

Tape archive

2 4 months

Raw data from users' DAQ systems

hdf5

Unlimited

Tape archive

2 4 months

Data translated to HDF5

scratch

Unlimited

None1

month4 months

Temporary data (lifetime not guaranteed)

xtc/hdf5

10TB

n/a

2 years

Selected XTC and HDF5 runs

ftc

10TB

None

2 years

Filtered, translated, compressed

results

2TB

Tape backup

2 years

res

10TB

Tape

6 months

Analysis results

User home

20GB

Disk + tape

Indefinite

User code

Tape archive

Unlimited

Two copies

10 years

Raw data

...

In the past couple of years we have observed a few non ideal some aspects of the LCLS data retention policy which are not ideal:

  1. All experiments are treated equally even if a few institutions copy the data home and don't need to have the data on disk at SLAC: ideally we would reserve that disk space for other experiments that do rely on it.
  2. Some folders and the different storage classes (short, medium and long term) were not always understood properly understood or used properly  (eg ftc was often treated as scratch).
  3. It's been extremely hard to maintain the promise on the lifetime of preserving all the data on disk under the conditions of limited spacefor their supposed lifetime: this has proved particularly tricky for scratch where the users can easily write tens of terabytes in a few hours.

The policy proposed below relies on two enabling technologies:

  1. We can constantly monitor the usage of the file system without affecting its performance.
  2. We can actually enforce the quota on a specific experimental directory at the file system level (as opposed to just alert users who use too much space).

The addition of these capabilities allow us to enforce a policy that adapts to the actual usage of the system.

Proposed Policy

Raw data storage (xtc/ and hdf5/ folders)

...

  1. Deleting data too early, ie when files are still being actively accesses, can cause large, and concurrent, restore operations from tape which affect negatively the performance of the system.

We have studied the data usage over time and we have observed that:

  • The rate at which data are accessed starts decreasing around 130 days after the experiment ends, both for raw and generated data.

Proposed Policy

Based on the observations above we propose to:

  • Eliminate the short and medium storage classes.
  • Switch from an experiment wide policy, where all files from one experiment stay on disk for a fixed duration of time after the end of the experiment, to a run wide policy where the expiration status of the files is determined by the access pattern of the files of a specific run

...

  • . This will be enforced by the constant monitoring of the file systems to determine which files were

...

  • accessed and when. We will be also tracking if the files were actually analyzed or just "touched" to dodge the policy.

The actual expiration threshold will be calculated dynamically based on the amount of free space available on the corresponding file system at the time of the cleanup procedure. Files with the more recent access times will stay on disk. The clean up process will be removing older files until it achieves the goal of preserving 20% of free space on the file system. The cleanup will affect whole runs (not individual files) which were determined as "expired". An automatic notification message will be sent to the PI (or all members?) of the affected experiment after each cleanup.

User data folders (res/, scratch/ and /ftc)

The amount of storage available for scratch/ will be increased by redistributing the freed storage resources of the ftc/ file systems. Like for the raw data folders (xtc and hdf5) the retention of files on scratch will switch from the fixed term expiration model to the last access time based model. The cleanup algorithm will be eliminating files with the older access times until the desired goal of having 20% of free space is met.

To compensate for a loss of the guaranteed longer term storage for experiments due to  the elimination of ftc/ and change in the file expiration terms of scratch/  we will increase the amount of quota on the res/ file systems. Note that res/ is the only user data folder which is backed up to tape. We will also implement the enforcement of the per-experiment storage usage quota for res/ (something which we couldn't do before). The new mechanism will prevent members of an experiment from writing new files onto their res/ folder after the experiment's quota limit is reached.

The existing 24 months expiration policy will be respected after the change. Though, at the same time we're changing the expiration enforcement technology for the file system by switching to Lustre HSM/HPSS. The technology  will allow expired files to stay in the file system _namespace_ (be visible to users with the 'ls' command) while the actual file content will be eliminated from the disk. The files which are not actually on disk will be automatically restored from tape at a first attempt to open them. This process is supposed to be transparent to a user application except an extra delay before the file content will be available to a user process. When a process hits this type of files it will just hang at the file open stage before the file is brought back from tape to disk.

Note that the last measure may also allow an experiment to exceed the actual amount of data associated with res/. The per-experiment quota will only apply to files which are on disk only. The older files which were copied to tape and whose content is not present on disk will be not counted toward the experiment's hard quota. We may also allow users to control the file migration process by giving them a command line and/or a Web-based tool for inspecting a status of files on disk and requesting transitions (eliminating files from disk or restoring from tape). This should prevent inefficient use of the batch system by analysis jobs hitting those not-actually-on-disk files.

...

  • Change the lifetime for raw and scratch data disks to 4 months. (For raw data the cleanup will affect all files, i.e. all streams and chunks, which make up one run, rather than individual files.)
  • Eliminate the /ftc folder.
  • Rename the usr folder as usrdaq.
  • Rename the res folder as results.
  • Increase the quota of the results folder to 2TB.

Notes

  • The backup of the results and home folders is fundamentally different from the archiving of the xtc/hdf5/usrdaq folders. In the latter case the folders are frozen after the end of the experiments and their contents are stored on tape once. In the case of the res and home, files which are deleted from disk are eventually, i.e. after a long enough time, also deleted from tape.
  • Please do not store under the scratch folder data that you cannot recreate because this directory is not backed up and the oldest files on scratch may be deleted at any time to make space for data from new experiments.