Valerio is going to move psana2 on s3df to spack in the next few weeks
Ric has the "graphs" approach to kernel launching is working. Tracking down a tricky segfault after 300 events.
Stefano working on streams. Having trouble reproducing previous compilation: LC is broken with spack (unhappy with flags). Getting advice from Gabriel and Valerio. Looks like old versions of compiler are being picked up (gcc4). Valerio and Gabriel provided guidance for how to fix that.

Jan. 21, 2025

Ric has DAQ running robustly, albeit only at 10kHz at the moment. fixed out-of-order event issue. Ric worries about cpu→gpu communication, or 3 kernel launch (with graphs). Currently 4 streams. Will try profiler.
- Ric found a way to run cuda graphs without sudo, and perhaps that is impacting performance? Ric found that the "handshake word" can be cleared by using a cuda kernel instead of a cuda API call. There is also a second "write enable" register that needs to written to on every event. Clearing of handshake is does at end of one of kernels (which is communicating with the cpu) and then the driver api can also be called at that point from the cpu.
Jeremy talked about having 1 CPU-thread per GPU-stream (that's how the coda-graph test program was written). Changed to have all GPU-streams handled by 1 thread (to solve the event-ordering problem). Could this affect performance? Will look at the profiler output for this. Ric has an idea for how to do multi-cpu-thread-again, but does complicate the code (and more task switches?). And this issue crops up "per KCU" and so will get worse with multiple KCU's. (e.g. 4 KCU's each with 4 GPU-streams would give us 16 threads). Scales poorly with more DMA buffers.
Stefano identified a bug in the rate calculation. Now back to getting 60GB/s with LC for 1 segment 352x384 (single-precision) with 4 streams which is great news.
cpo points out that we could perhaps batch over events and reuse the "integrating detector" idea (roughly) in psana, if necessary.
Ric may be ready to have Gabriel launch his calibration kernel in the gpu-branch of the code
Ric worries that merging the gpu-branch could disrupt the main-branch (MemPool in cpu drp's, now broken into two pieces: MemPoolCpu and MemPoolGpu that inherit from MemPool). Ric thinks it could be OK, but we need to try. We should run rix or tmo out of Ric's area.

Jan. 28, 2025

Ric tried with the GPU dma-buffer release and that is working. Switched to atomic operations away from volatile. Rate has been boosted from 10kHz to 15kHz with all the latest improvements. Ric has a "multiple buffer" idea: allows cpu teb-packet sending to proceed in parallel with gpu processing (and an early buffer release): eliminated loop over pcie transactions. Profiler is still on the list to understand 15kHz limit. Still have a "sudo" issue that needs to be addressed. Could we use "capabilities" to have processes run with privilege? Depends on the filesystem where executables reside. Could use some advice on sudo issue, but not the highest priority at the moment.
Stefano played around with threads per block, but always gets about 60GB/s for 1 segment. Still waiting for cuSZ to catch up to LC. Stefano feels blocked waiting for Robert/Jiannan to make progress on streams and Huffman-tree changes, reusing Huffman-tree across events ... texture memory is read-only global memory and so a little faster ... use for calibration constants? Peakfinder8 from pyFAI and perhaps look at angular integration performance again with the profiler? Why is LC faster than angular integration?
Stefano will post updated LC results (including threads/block study and profiler output) here: Benchmarks

Feb. 3, 2025

Stefano will add code used for LC measurements to the Benchmarks confluence page, perhaps along with a fork of Martin's code to add streaming.
Trying to resurrect the pyFAI environment in spack for peakfinder8 (did it for the angular integration). Tricky, since it has a lot of dependencies. Gabriel pointed Stefano to dataset to use for peak finder tests: xpptut15 run 670. Will run this on H100 using intermediate numpy or h5 files.
H100 node still not installed. Larry is purchasing another one?
cpo suggests we should ask TID if they could setup/run a 5-fpga benchmark in lab3 on dev06? And we would reuse it on the H100 when possible. like the cpu version of the "PRBS".
Ric was looking at testdma performance with Jeremy's code. Ric got a "queue" of multiple dma buffers working: 4 one-event buffers (same as last week, constant in the firmware?) and GPU copies timing-header/dma-error-results/teb-input-data into new "intermediate buffers" (call them "teb buffers"?, allocated by "cudaMallocManaged", used by the CPU, currently have 8, but can have as many as Ric wants). Currently pebble-buffers have the same number as intermediate buffers, but this isn't a requirement. Calibration (actually calibration+peakfinding?) would have similar buffers, but earlier in the pipeline ("calibration buffers"?) for storing calibrated images. But still getting 15kHz. testdma with graph-mode enabled didn't work (early exit after 1 event). Tops out at 15kHz. Need more one-event buffers? Aim for 15kHz demo at mtg with TID. Matt used dev06 for timing, but hopefully Jeremy/Mudit can use it now.
List of buffers:
- one-event buffers (gpu only)
- calibration buffers (gpu only)
- intermediate buffers ("teb buffers"?, shared with CPU)
- pebble (on the CPU)
- gpu-only buffers for output of data reduction (not written yet, but perhaps part of the calibration buffers). Perhaps "calibration buffers" are becoming a gpu-only-pebble?
List of processes:
- dma
- calibration
- peak-finding to compute the teb input
- data-reduction (e.g. peak-finding, SZ compression, ROI...)

Page tree

Versions Compared

Old Version 27

New Version Current

Key

Jan. 21, 2025

Jan. 28, 2025

Feb. 3, 2025

Page tree

Page History

Versions Compared

Old Version 27

New Version Current

Key

Jan. 21, 2025

Jan. 28, 2025

Feb. 3, 2025