Mechanism

Latest proposal (12/2/22): maybe we don't need to support the inprogress idea and can use endrun+timeouts for live mode. This is made possible by the fact that the smd files are not chunked. Plan: keep support for both .inprogress files and .xtc files as shown in the algorithm below. Prefer the endrun+timeout mode, but keep .inprogress mode because we already have it. May need to rethink live mode running at remote sites (where "holes" can appear because of parallel transfers).

Currently the data mover gives .xtc2 files an additional .inprogress extension while they are being moved. This introduces some additional complexity but has these advantages:

(we now think this is incorrect since the smd file which regulates psana's analysis is not chunked) allows psana2 to know when a file is complete so, for example, it can go on to the next chunk without timing out (e.g. if different chunk files are transferred in parallel, so the existence of chunk N+1 is not an indication that chunk N is complete).
allows users to know when run moving is complete (or stuck/crashed).

This pseudo-code attempts to capture the logic of handling the in-progress files. This reflects the handling of one chunk:

Inode tells if two files are the same file (before conversion aka mv). If psana2 sees .xtc2 file with the same name as .inprogress but different inode no. (this could be because something gone wrong ...), it should crash (rare case).

f = None
if live_mode:
   f = open('.xtc2.inprogress')
   inode_save = f.inode()
if not f:
   f = open('.xtc2')
   live_mode = False # only use live mode on in-progress files
reading_timeout_countdown=10
while 1:
   bytes=f.read()
   if len(bytes)==0:
       if live_mode:
           # could this be an issue with 300K cores reading near end-of-file
           # and hitting the metadata server
           if (dot_xtc2_exists() and inode_same(inode_save)):
               if not inode_same(inode_same): raise() # maybe.  two movers, unlikely to happen
               # file renamed from .xtc2.inprogress to .xtc2
               # file complete. read remaining bytes. (or set live_mode
               # to false and redo whole loop)
               while 1:
                   bytes=f.read()
                   if bytes==0: break # eof
           sleep(1)
           # in a mode without the above renaming, could also check for
           # a new chunk file existing, assuming the mover moves them sequentially
           read_timeout_countdown -= 1
           if read_timeout_countdown==0: break # timeout, only if movers stuck/restarted: be noisy in this case
       else:
           break # no more data in non-live mode
   else:
       read_timeout_countdown = 10
       if 'endrun' in bytes: break  # eof
f.close()

Handling of Disable

the big-data is written by a multi-threaded file writer (BufferedFileWriterMT) with a 100ms timeout
the smd is written by single-thread file writer (BufferedFileWriter) with no timeout (means that data, like a disable transition, can get "stuck" here). transitions getting stuck here is OK, since no big-data cores will try to fseek to those events
in addition to the 100ms timeout on the big-data flushing, there are two mechanism for flushing smd and big-data data to disk (note that SlowUpdate will typically keep the file-flushing going even for very slow detectors):
- the drp write buffers fills up
- if the difference between two subsequent timestamps of events/transitions becomes large enough
Conclusion: We currently cannot handle long user-initiated disable times elegantly without increasing the psana live-mode timeout. In future we could imagine "watching" for disable and increasing the timeout dynamically, but we still a need a timeout (cannot be infinite) because then psana could hang forever. Could imagine watching for file locks (assumes that we use a file-system that supports system-wide locks).

Locks vs .inprogress Methods

locks are used by the DAQ, .inprogress by the movers
ideally would have just one mechanism to make psana's job easier
locks are nice because if the daq crashes then they automatically get released (vs. a "dangling" .inprogress)
- if movers crash, mover will try to transfer the file again (from the beginning and cleans up the old .inprogress file) until it succeeds.
- could we have the DAQ do something similar? have the DAQ restart cleanup the old .inprogress files? feels messy.
locks don't work everywhere: local to the node so don't work on external accesses. on lustre it works globally, but not weka. no guarantee (not a posix feature)
could we have movers use endrun+timeouts for the movers? data movement to remote sites can have long latencies, so timeouts could get long
for remote sites, parallel file moving will create "holes" and needs another mechanism to regulate file access so the holes don't cause problems
- maybe live mode isn't needed if runs are short enough? or maybe all live-mode is only done at slac on a subset of the data if it's too compute-intensive for slac
options for remote live-mode:
- have remote site read files from slac?
- transfer files with "holes" to remote site
do we need parallel file transfers since the daq splits large streams into multiple streams of 5GB/s?
- wilko says we will still need it because transfers to, say, JLAB, won't reach 5GB/s
- parallel transfers not only useful for the network, but also for the writing to the filesystems where you need multiple cores writing for improved performance
for chunking:
- we don't chunk the smd files, which is what drives psana's live-mode analysis, so that solves the live-mode chunking problem.

options:

daq and movers use .inprogress
daq and movers use locks
daq uses locks and movers use .inprogress
eliminate .inprogress and use endrun+timeouts

Weka Filesystem Bug Workaround

Discussion with wilko, mona, cpo on sept. 7 2023

The issue: weka can corrupt data when reading files in live mode. Watching the file size with fstat doesn't seem to be sufficient to protect against this. Instead try to create an algorithm using the fstat file modification time ("mtime").

problem:
- if we only use the mtime we sleep even if we're slow and in the middle of a file
- it's not clear how to modify the algorithm to only sleep near the end of the file (how do we define what "near end of file" means? (for sleeping))
1. could it be a size in bytes? will the same number work for smd/1Hz and bigdata/1MHz?
2. could use the dgram timestamps (within ~10s of mtime, has to be larger than the batch
timespan) to say we're near the end of the file
- possibility: whenever we iterate to a datagram, if its timestamp is close to mtime: sleep
- fails if wilko copies data to nersc "later" and we try to use live-mode. but maybe
we can ignore that case (could in principle could also program in extra wait time)
- when we go to internal timing the timestamps are not current. could we ignore this case?
o could we put a non-zero value "at the end" so we can check for corruption? not clear how to do this

Leaning towards (2) at the moment.

could data get corrupted by movers going from ffb to offline? (lcls2 only). maybe we should avoid live-mode data-moving? Wilko does have a checksum that he watches to catch such corruption.

Page tree

Live Mode

Mechanism

Handling of Disable

Locks vs .inprogress Methods

Weka Filesystem Bug Workaround