Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Valerio is going to move psana2 on s3df to spack in the next few weeks
  • Ric has the "graphs" approach to kernel launching is working.  Tracking down a tricky segfault after 300 events.
  • Stefano working on streams.  Having trouble reproducing previous compilation: LC is broken with spack (unhappy with flags).  Getting advice from Gabriel and Valerio.  Looks like old versions of compiler are being picked up (gcc4).  Valerio and Gabriel provided guidance for how to fix that.

Jan. 21, 2025

  • Ric has DAQ running robustly, albeit only at 10kHz at the moment.  fixed out-of-order event issue.  Ric worries about cpu→gpu communication, or 3 kernel launch (with graphs).  Currently 4 streams.  Will try profiler.
    • Ric found a way to run cuda graphs without sudo, and perhaps that is impacting performance?  Ric found that the "handshake word" can be cleared by using a cuda kernel instead of a cuda API call.  There is also a second "write enable" register that needs to written to on every event.  Clearing of handshake is does at end of one of kernels (which is communicating with the cpu) and then the driver api can also be called at that point from the cpu.
  • Jeremy talked about having 1 CPU-thread per GPU-stream (that's how the coda-graph test program was written).  Changed to have all GPU-streams handled by 1 thread (to solve the event-ordering problem).  Could this affect performance?  Will look at the profiler output for this.  Ric has an idea for how to do multi-cpu-thread-again, but does complicate the code (and more task switches?).  And this issue crops up "per KCU" and so will get worse with multiple KCU's. (e.g. 4 KCU's each with 4 GPU-streams would give us 16 threads).  Scales poorly with more DMA buffers.
  • Stefano identified a bug in the rate calculation.  Now back to getting 60GB/s with LC for 1 segment 352x384 (single-precision) with 4 streams which is great news.
  • cpo points out that we could perhaps batch over events and reuse the "integrating detector" idea (roughly) in psana, if necessary.
  • Ric may be ready to have Gabriel launch his calibration kernel in the gpu-branch of the code
  • Ric worries that merging the gpu-branch could disrupt the main-branch (MemPool in cpu drp's, now broken into two pieces: MemPoolCpu and MemPoolGpu that inherit from MemPool).  Ric thinks it could be OK, but we need to try.  We should run rix or tmo out of Ric's area.