You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 10 Next »

Update May 29: Upon further analysis (correcting for an error in the data saving format), for LC-GPU using A5000 compression speeds for segments of (352x384) pixels is:

segments

Gbytes/s

1

6

2

12

4

21

6

31

8

38

16

60

8x16

102

100x16

111

May 14 and 20, 2024

  • Stefano showed 68GB/s LC compression speed with 7.5x compression (raw uint16 vs. compressed float32) for 1 segment.  88GB/s for 16 segments.
    • Stefano will get the raw kernel launches to Gabriel so Gabriel can incorporate in his scripts
    • Stefano will post his results to confluence and we will share with Martin
    • Stefano will also continue to work on cuSZ
    • Stefano has verified that different data compresses at the same 68GB/s for 1 segment
    • cupy in spack supports up to cuda 12.1 but not 12.3, Robert is working on it
  • Gabriel results with ped/gain/common-mode using 2-byte "fake" raw data:
    • 1 segment (352x384) - 11-12 us -> 22.755 GB/s
    • 6 segments (6x352x384) - 40 us -> 40.109 GB/s
    • 12 segments (12x352x384) - 72 us -> 44.598 GB/s
    • maybe we don't need common-mode in the drp?  do it offline?  Gabriel points out that this is different than what is currently being done.  cpo has a gut feeling it's still OK.
      • have asked Silke for her opinion about leaving out common mode in drp
      • (From Valerio) In my opinion it could work, and pedestal/gain is not going to improve with time (while common mode can!) so I think what you suggest, Chris, makes perfect sense. However, Silke has a more practical perspective, and I am curious what she thinks

      • (from Silke) I agree with Valerio. For many detectors, the gain from the common mode is not as crucial and for a given type of physics/detector setup it may not even be necessary. While there have been times where gain/pedestal did improve 'later', those are the cases we will try to monitor and avoid. "common mode only when in fixed gain": is that a LCLS2 behavior? I thought for LCLS1 we did calculate common-mode even when switching/mixed gain, but ignore the lower-gain pixels.

      • another advantage of no-common-mode in drp: calibrations become more easily "reversible" (can be redone with better constants later)
  • Gabriel will investigate
    • multiple streams
      • cpo thinks multiple-events maps best to multiple-streams
      • multiple-segments should be done in one stream?
      • the writing of data with GPU-direct-storage might be in a different stream?
    • combining LC with the calibration (should this be one kernel? or kernels-launching-kernels (dynamic parallelism)? or (less likely) split up across streams?)

June 3, 2024

  • Gabriel looked at launching calibrate/compress in one kernel or two.  Two kernels feels better: more modular (#blocks, #threads, use of shared-memory).  Continuing to look at streaming.
  • cpo should ask TID if they have "TDet" firmware that can receive timing packets.  Maybe we don't need this and can just reuse existing TDet firmware?
  • have asked TID if the gpu/cpu switchyard exists.  Ric would like see if we can run the daq entirely on the cpu with the datagpu driver.
  • need a timing fiber
  • lower priority: ideally need mellanox IB fixed
  • could also ideally test a second node by adding a kcu1500

July 5, 2024

With Stefano and Ric (Gabriel was absent) to try to set up a roadmap of future projects.

Priorities:

  • get gpu working in daq-tst-dev06
  • get first-pass TDet firmware from Mudit

drp-executable infrastructure?
 - makefile support (gabriel?)
 - switch to spack (gabriel?)
 - gpu-direct to weka (gabriel?)
 - dgram creation/handling not optimal for gpu? use cpu? (gabriel)
 - reverse BEB-packet iteration not optimal for gpu?
 - tid fpga->gpu data transfer
 - slow transitions (e.g. configure) go to cpu
 - fast transitions (l1accept, slowupdate) go to gpu
 - (later) have to make the teb/meb work with "trigger primitives"
   o has to be on the CPU to communicate with IB via libfabric
   o do need the meb buffer-number stuff to work "soon-ish"
   o full data needs to go from GPU to CPU for meb purposes on a fraction of the L1s

algorithms? (stefano)
 - LC, cuSZ (work on cpu decompression in psana)
 - ROI
 - others like radial average with masks (look at pyfai), cube? not that many?
 - peakfinders (peakfinder8 might already run on a GPU)

  • No labels