You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Storage Classes

Space

Size

Backup

Lifetime

Comment

xtc

Unlimited

Tape archive

2 months

Raw data

usr

Unlimited

Tape archive

2 months

Raw data from users' DAQ systems

hdf5

Unlimited

Tape archive

2 months

Data translated to HDF5

scratch

Unlimited

None

1 month

Temporary data

res

10TB

Tape

6 months

Analysis results

User home

20GB

Disk + tape

Indefinite

User code

Tape archive

Unlimited

Two copies

10 years

Raw data

Rationale for Proposed Policy

In the past couple of years we have observed a few non ideal aspects of the LCLS data retention policy:

  1. All experiments are treated equally even if a few institutions copy the data home and don't need to have the data on disk at SLAC: ideally we would reserve that disk space for other experiments that do rely on it.
  2. Some folders and the different storage classes (short, medium and long term) were not always understood or used properly (eg ftc was often treated as scratch).
  3. It's been extremely hard to maintain the promise on the lifetime of the data on disk under the conditions of limited space: this has proved particularly tricky for scratch where the users can easily write tens of terabytes in a few hours.

The policy proposed below relies on two enabling technologies:

  1. We can constantly monitor the usage of the file system without affecting its performance.
  2. We can actually enforce the quota on a specific experimental directory at the file system level (as opposed to just alert users who use too much space).

The addition of these capabilities allow us to enforce a policy that adapts to the actual usage of the system.

Proposed Policy

Raw data storage (xtc/ and hdf5/ folders)

The SHORT-TERM or MEDIUM-TERM storage classes are eliminated. We're switching from the guaranteed unconditional stay of files on disk for a fixed duration of time to an access time based algorithm for determining the expiration status of the files. This will be enforced by the constant monitoring of the file systems to determine which files were access and when. We will be also tracking if the files were actually analyzed or just "touched" to dodge the policy.

The actual expiration threshold will be calculated dynamically based on the amount of free space available on the corresponding file system at the time of the cleanup procedure. Files with the more recent access times will stay on disk. The clean up process will be removing older files until it achieves the goal of preserving 20% of free space on the file system. The cleanup will affect whole runs (not individual files) which were determined as "expired". An automatic notification message will be sent to the PI (or all members?) of the affected experiment after each cleanup.

User data folders (res/, scratch/ and /ftc)

The amount of storage available for scratch/ will be increased by redistributing the freed storage resources of the ftc/ file systems. Like for the raw data folders (xtc and hdf5) the retention of files on scratch will switch from the fixed term expiration model to the last access time based model. The cleanup algorithm will be eliminating files with the older access times until the desired goal of having 20% of free space is met.

To compensate for a loss of the guaranteed longer term storage for experiments due to  the elimination of ftc/ and change in the file expiration terms of scratch/  we will increase the amount of quota on the res/ file systems. Note that res/ is the only user data folder which is backed up to tape. We will also implement the enforcement of the per-experiment storage usage quota for res/ (something which we couldn't do before). The new mechanism will prevent members of an experiment from writing new files onto their res/ folder after the experiment's quota limit is reached.

The existing 24 months expiration policy will be respected after the change. Though, at the same time we're changing the expiration enforcement technology for the file system by switching to Lustre HSM/HPSS. The technology  will allow expired files to stay in the file system _namespace_ (be visible to users with the 'ls' command) while the actual file content will be eliminated from the disk. The files which are not actually on disk will be automatically restored from tape at a first attempt to open them. This process is supposed to be transparent to a user application except an extra delay before the file content will be available to a user process. When a process hits this type of files it will just hang at the file open stage before the file is brought back from tape to disk.

Note that the last measure may also allow an experiment to exceed the actual amount of data associated with res/. The per-experiment quota will only apply to files which are on disk only. The older files which were copied to tape and whose content is not present on disk will be not counted toward the experiment's hard quota. We may also allow users to control the file migration process by giving them a command line and/or a Web-based tool for inspecting a status of files on disk and requesting transitions (eliminating files from disk or restoring from tape). This should prevent inefficient use of the batch system by analysis jobs hitting those not-actually-on-disk files.
The /ftc folder will be eliminated. The policy for the /usr folder will not change.
  • No labels