Agenda

  • #Requirements for controlling how files are archived
  • Archiving small files
    • htar vs tar
    • db vs file
    • mpsstage script
      • how to modify for accessing files from archives
      • how to handle simultaneous requests for different from from the same archive
  • File access control (nfs group vs file)
  • Plan of attack and action items

Requirements for controlling how files are archived

  • Need to be able to control how files are handled based on folder and file type
    • Perhaps via regular expression – although Wilko is worried about performance
  • Need to be able to specify:
    • File should never be archived
    • File should be archived immediately
    • File should be archived once it is a certain age
  • Need to be able specify minimum size for non-tar archiving
  • Need to be able to specify disposition of disk file after archived
    • Delete immediately
    • Never delete
    • Delete once it is a certain age

Somehow we need a mechanism for modifying these rules fairly easily, for example when a MC task is running the default maybe not to archive it, but once the files have been verified then we may want to immediately archive them, and perhaps delete some from disk.

Archive configuration file

Strawman proposal

/glast maxTarSize=500m archive=1d noDelete
/glast/MC noArchive
/glast/MC/task-55 archive=0d
/glast/MC/task-55/.*-recon.root deleteAfterAchive=0d

  • No labels

5 Comments

  1. Follow up meeting on February 5th

    • 250TB, data server
      • 1250 – tapes purchase in progress
      • Can check status with Teri
    • htar
      • Makes info file
      • Wilko unable to compile it
        • Maybe it can be compiled in afs package area
        • John has given Wilko access to package area
    • Attempt to pipe command using tar and pftp did not work
      • pftp is part of SDK software (customized by Andy/Bill)
      • Wilko will send example of failure to Lance
      • Lance will contact SDK support
    • Wilko/Tony work with Karen to set up migration database tables
      • No progress on glast style config file
    • tcpip queue size
      • Queue size was increased from default of 127 to 1024 on wain006
        • Count of dropped connections did not change (few 1000 per day)
        • xrootd client was not restarted
          • Should try restarting the server
    • deleting files from big run
      • Taking a long time – zfs – system IO rate becomes very large
        • Lance could look at this while it is happening
        • Memory used for caching may need to be adjusted?
      • Some problems with restarting xrootd (took 10 minutes)
      • Wilko will continue to delete files
  2. I've reserved the SLUO conference room for Feb 12, 10:30-12:00, for another meeting.

  3. February 12

    • Database tables have been defined for keeping track of tarred files
      • Karen has sent request to db-admin to set up glast_xroot table space
        • Lance asks if all HPSS files should be in database?
    • Connect to oracle using python (CX_ORACLE) is understood
      • Perhaps use SQLAlchemy – but probably overkill?
        • Need to build CX_ORACLE for solaris (ISOC uses Linux)
    • Need to decide whether to use scratch space on xroot servers for temp tar files?
    • Wilko had no time to look at htar or pftp since last week
      • Andrew can look at building htar/pftp for Solaris (sparc and x86)
    • TCPIP queue size
      • Wilko discovered queue size was hard-wired in xrootd daemon to 15
      • Wilko has recompiled with 64 instead of 15 but daemons are not restarted yet
    • All files tom wanted deleted are done
      • Didn't really learn anything new about why system slows down so much
      • Stefan Luitz may have some experience with OS level parameters
        • Wilko will check with Stefan
      • There is now a test "thumper" for experimenting with.
  4. February 19

    Here are some very rough notes from today's meeting.
    Present: AndyH (on phone), Lance, Wilko, Tom, John

    (1) Wilko reported that xrootd on wain06 crashed. Andy will look at the core file. Wilko also reported some strange messages in dmesg; but said that Lance had previously looked at them and did not think they were a problem.

    (2) More zfs conversions: Tom and Lance discussed time and work involved. We should expect more requests will be coming.

    (3) htar & pfpt: Andrew may has built htar for solaris/sparc.
    Wilko has not had a chance to test it yet. We don't really need it.
    Andrew also compile the htar code for solaris/x86, but we don't have the x86 hpss libraries to link with. Should be possible to build from source; but might be a lot of work.
    Andy reported that he will redo "stuff" so we can use the standard version of pftp instead of the SLAC-modified version.

    (3a) Wilko: plan is to use tar and Oracle DB instead. But for that, need the solaris/x86 Oracle libraries, so can build the python (cx_Oracle) oracle interface. John will work with Ian to get those installed. Hope to have something working by 2/26 (for solaris/sparc?).

    (4) More discussion of deleting xrootd files; need more informative return codes? problem with rewriting files (hence unique names). Need more info on use cases. Invite Warren next week?

    (5) What else for hpss? Need tapes (there is a pool). Was some hpss feature disabled to speed up something? (Wilko?)

    (6) discussion of storage classes, file families, tracking tape usage and reporting.

    (7) Tom will check to see if there are top-level file systems that do not need to be ever migrated to hpss. Interest in selective migration/deletion/retention/purging. Wilko will document current situation (in confluence) and maybe suggest what enhancements are feasible.

    (8) There will be a copy of the science data on Goddard. Maybe housekeeping too? Is there anything which needs two tape copies at SLAC? (pftp can handle this?)

    (9) Lance estimated that there were currently 600 free tapes in the silos. Another 1200 are here and waiting to be added and initialized.

    Meet again next week.

    John

  5. February 26

    • Duplicate files
      • Wilko has made a list of duplicate files in OpsSim directory
      • There are 2100 file created between Feb 13 10pm and Feb 14 afternoon
        • Suspicion is that wain007 filled up (due to misconfiguration?)
          • Andy doesn't think this can explain the problem
            • Believes it should fail to write file if disk is full
          • Misconfiguration was due to all stuff from Lyon going to wain007
            • Otherwise should stop when only 10GB(question) left
      • Andy and Wilko will look into the cause
      • Wilko will work with Tom to clean up file duplicates
    • Running through the OpsSim directory takes one hour on each machine (1.5 million files)
      • xrootd has some notification mechanism which could maybe used to produce a list of recently written files (or deleted or updated)
    • Not clear xrd client is load balancing between the two redirectors
      • Maybe the code is broken
    • Is it possible to get a return code from xrdcp when a server is down
      • Normally server keeps record of existing server for 10 minutes
        • This should be increased to 20 minutes, since ranger only attempts restart every 15 minutes
        • Note that server will not allow program to procede it if knows that file is on server that is down
          • This information is only cached for about 30 minutes
    • Could have multiple redirectors with different policies
      • Documentation exists on configuration options, but it is terse
      • It is possible to (for example) prohibit reads/writes while a server is down
    • Getting a return code from xrdcp does not seem to be possible right now
      • Tom will look at configuration
    • xrootd server crashed on wain006. Andy found a bug. We are not yet using this version.
      • There are more changes in CVS which we do not yet have.
        • There is a new cmsd (more scalable) but oldb is still there.
      • Sounds like not critical to deploy new version
    • Potentially could be SATA timeouts – there is an OS patch (but maybe just turns off messages)
      • Lance is currently testing these patches