Mechanism

Latest proposal (12/2/22): maybe we don't need to support the inprogress idea and can use endrun+timeouts for live mode.  This is made possible by the fact that the smd files are not chunked.  Plan: keep support for both .inprogress files and .xtc files as shown in the algorithm below.  Prefer the endrun+timeout mode, but keep .inprogress mode because we already have it.  May need to rethink live mode running at remote sites (where "holes" can appear because of parallel transfers).

Currently the data mover gives .xtc2 files an additional .inprogress extension while they are being moved.  This introduces some additional complexity but has these advantages:

  • (we now think this is incorrect since the smd file which regulates psana's analysis is not chunked) allows psana2 to know when a file is complete so, for example, it can go on to the next chunk without timing out (e.g. if different chunk files are transferred in parallel, so the existence of chunk N+1 is not an indication that chunk N is complete).
  • allows users to know when run moving is complete (or stuck/crashed).

This pseudo-code attempts to capture the logic of handling the in-progress files.  This reflects the handling of one chunk:

Inode tells if two files are the same file (before conversion aka mv). If psana2 sees .xtc2 file with the same name as .inprogress but different inode no. (this could be because something gone wrong ...), it should crash (rare case). 

f = None
if live_mode:
   f = open('.xtc2.inprogress')
   inode_save = f.inode()
if not f:
   f = open('.xtc2')
   live_mode = False # only use live mode on in-progress files
reading_timeout_countdown=10
while 1:
   bytes=f.read()
   if len(bytes)==0:
       if live_mode:
           # could this be an issue with 300K cores reading near end-of-file
           # and hitting the metadata server
           if (dot_xtc2_exists() and inode_same(inode_save)):
               if not inode_same(inode_same): raise() # maybe.  two movers, unlikely to happen
               # file renamed from .xtc2.inprogress to .xtc2
               # file complete. read remaining bytes. (or set live_mode
               # to false and redo whole loop)
               while 1:
                   bytes=f.read()
                   if bytes==0: break # eof
           sleep(1)
           # in a mode without the above renaming, could also check for
           # a new chunk file existing, assuming the mover moves them sequentially
           read_timeout_countdown -= 1
           if read_timeout_countdown==0: break # timeout, only if movers stuck/restarted: be noisy in this case
       else:
           break # no more data in non-live mode
   else:
       read_timeout_countdown = 10
       if 'endrun' in bytes: break  # eof
f.close()

Handling of Disable

  • the big-data is written by a multi-threaded file writer (BufferedFileWriterMT) with a 100ms timeout
  • the smd is written by single-thread file writer (BufferedFileWriter) with no timeout (means that data, like a disable transition, can get "stuck" here).  transitions getting stuck here is OK, since no big-data cores will try to fseek to those events
  • in addition to the 100ms timeout on the big-data flushing, there are two mechanism for flushing smd and big-data data to disk (note that SlowUpdate will typically keep the file-flushing going even for very slow detectors):
    • the drp write buffers fills up
    • if the difference between two subsequent timestamps of events/transitions becomes large enough
  • Conclusion: We currently cannot handle long user-initiated disable times elegantly without increasing the psana live-mode timeout.  In future we could imagine "watching" for disable and increasing the timeout dynamically, but we still a need a timeout (cannot be infinite) because then psana could hang forever.  Could imagine watching for file locks (assumes that we use a file-system that supports system-wide locks).

Locks vs .inprogress Methods

  • locks are used by the DAQ, .inprogress by the movers
  • ideally would have just one mechanism to make psana's job easier
  • locks are nice because if the daq crashes then they automatically get released (vs. a "dangling" .inprogress)
    • if movers crash, mover will try to transfer the file again (from the beginning and cleans up the old .inprogress file) until it succeeds.
    • could we have the DAQ do something similar?  have the DAQ restart cleanup the old .inprogress files?  feels messy.
  • locks don't work everywhere: local to the node so don't work on external accesses.  on lustre it works globally, but not weka.  no guarantee (not a posix feature)
  • could we have movers use endrun+timeouts for the movers?  data movement to remote sites can have long latencies, so timeouts could get long
  • for remote sites, parallel file moving will create "holes" and needs another mechanism to regulate file access so the holes don't cause problems
    • maybe live mode isn't needed if runs are short enough? or maybe all live-mode is only done at slac on a subset of the data if it's too compute-intensive for slac
  • options for remote live-mode:
    • have remote site read files from slac?
    • transfer files with "holes" to remote site
  • do we need parallel file transfers since the daq splits large streams into multiple streams of 5GB/s?
    • wilko says we will still need it because transfers to, say, JLAB, won't reach 5GB/s
    • parallel transfers not only useful for the network, but also for the writing to the filesystems where you need multiple cores writing for improved performance
  • for chunking:
    • we don't chunk the smd files, which is what drives psana's live-mode analysis, so that solves the live-mode chunking problem.

options:

  • daq and movers use .inprogress
  • daq and movers use locks
  • daq uses locks and movers use .inprogress
  • eliminate .inprogress and use endrun+timeouts

Weka Filesystem Bug Workaround

Discussion with wilko, mona, cpo on sept. 7 2023

The issue: weka can corrupt data when reading files in live mode.  Watching the file size with fstat doesn't seem to be sufficient to protect against this.  Instead try to create an algorithm using the fstat file modification time ("mtime").

problem:
- if we only use the mtime we sleep even if we're slow and in the middle of a file
- it's not clear how to modify the algorithm to only sleep near the end of the file (how do we define what "near end of file" means? (for sleeping))
    1. could it be a size in bytes?  will the same number work for smd/1Hz and bigdata/1MHz?
    2. could use the dgram timestamps (within ~10s of mtime, has to be larger than the batch
       timespan) to say we're near the end of the file
       - possibility: whenever we iterate to a datagram, if its timestamp is close to    mtime: sleep
       - fails if wilko    copies data to nersc "later" and we try    to use live-mode. but maybe
           we can    ignore that case (could    in principle could also    program    in extra wait time)
       - when we go to internal    timing the timestamps are not current. could we    ignore this case?
  o could we put a non-zero value "at the end" so we can check for corruption? not clear how to do this

Leaning towards (2) at the moment.

could data get corrupted by movers going from ffb to offline? (lcls2 only).  maybe we should avoid live-mode data-moving?  Wilko does have a checksum that he watches to catch such corruption.

  • No labels