You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 13 Next »

To Do List

With Stefano and Ric (Gabriel was absent) to try to set up a roadmap of future projects.

Priorities:

  • get first-pass TDet firmware from Mudit
  • drp-executable infrastructure?
    • makefile support (Gabriel?).  Ric feels this will be fairly straightforward
      • switch to spack (Gabriel?).  our make system may depend on conda.  Gabriel thinks switch to gpu/spack may be straightforward.
    • gpu-direct to weka (gabriel?)
    • dgram creation/handling not optimal for gpu? use cpu? (ric)
    • reverse BEB-packet iteration not optimal for gpu? (low priority)
    • slow transitions (e.g. configure) go to cpu, fast transitions (l1accept, slowupdate) go to gpu (ric)
    • have to make the teb/meb work with "trigger primitives" (ric)
      • has to be on the CPU to communicate with IB via libfabric
      • do need the meb buffer-number stuff to work "soon-ish"
      • full data needs to go from GPU to CPU for meb purposes on a fraction of the L1s

algorithms? (stefano)

  • managing calibration constants on the GPU
    • cpu has access to these (for the portion of the detector that the drp node sees) in psana via drp-python (as a python dictionary in det.calibconst)
    • need to "copy the python dictionary" with cudaMemCopy to the GPU: pedestal/gain/bad-pixels.  have to figure out how to do this elegantly.  constants can change at every beginrun transition.
    • the dictionary is detector-specific.
    • could be a std::map?  Ric says that std::map may not be implemented on GPU.
  • LC, cuSZ (work on cpu decompression in psana)
  • ROI (trivial)
  • others like radial average with masks (standalone and pyfai)
  • cube
  • peakfinders (GPU peakfinder8 is in pyFAI)

LC performance

Update May 29: Upon further analysis (correcting for an error in the data saving format), for LC-GPU using A5000 compression speeds for segments of (352x384) pixels is:

segments

Gbytes/s

1

6

2

12

4

21

6

31

8

38

16

60

8x16

102

100x16

111

May 14 and 20, 2024

  • Stefano showed 68GB/s LC compression speed with 7.5x compression (raw uint16 vs. compressed float32) for 1 segment.  88GB/s for 16 segments.
    • Stefano will get the raw kernel launches to Gabriel so Gabriel can incorporate in his scripts
    • Stefano will post his results to confluence and we will share with Martin
    • Stefano will also continue to work on cuSZ
    • Stefano has verified that different data compresses at the same 68GB/s for 1 segment
    • cupy in spack supports up to cuda 12.1 but not 12.3, Robert is working on it
  • Gabriel results with ped/gain/common-mode using 2-byte "fake" raw data:
    • 1 segment (352x384) - 11-12 us -> 22.755 GB/s
    • 6 segments (6x352x384) - 40 us -> 40.109 GB/s
    • 12 segments (12x352x384) - 72 us -> 44.598 GB/s
    • maybe we don't need common-mode in the drp?  do it offline?  Gabriel points out that this is different than what is currently being done.  cpo has a gut feeling it's still OK.
      • have asked Silke for her opinion about leaving out common mode in drp
      • (From Valerio) In my opinion it could work, and pedestal/gain is not going to improve with time (while common mode can!) so I think what you suggest, Chris, makes perfect sense. However, Silke has a more practical perspective, and I am curious what she thinks

      • (from Silke) I agree with Valerio. For many detectors, the gain from the common mode is not as crucial and for a given type of physics/detector setup it may not even be necessary. While there have been times where gain/pedestal did improve 'later', those are the cases we will try to monitor and avoid. "common mode only when in fixed gain": is that a LCLS2 behavior? I thought for LCLS1 we did calculate common-mode even when switching/mixed gain, but ignore the lower-gain pixels.

      • another advantage of no-common-mode in drp: calibrations become more easily "reversible" (can be redone with better constants later)
  • Gabriel will investigate
    • multiple streams
      • cpo thinks multiple-events maps best to multiple-streams
      • multiple-segments should be done in one stream?
      • the writing of data with GPU-direct-storage might be in a different stream?
    • combining LC with the calibration (should this be one kernel? or kernels-launching-kernels (dynamic parallelism)? or (less likely) split up across streams?)

June 3, 2024

  • Gabriel looked at launching calibrate/compress in one kernel or two.  Two kernels feels better: more modular (#blocks, #threads, use of shared-memory).  Continuing to look at streaming.
  • cpo should ask TID if they have "TDet" firmware that can receive timing packets.  Maybe we don't need this and can just reuse existing TDet firmware?
  • have asked TID if the gpu/cpu switchyard exists.  Ric would like see if we can run the daq entirely on the cpu with the datagpu driver.
  • need a timing fiber
  • lower priority: ideally need mellanox IB fixed
  • could also ideally test a second node by adding a kcu1500

Aug 5, 2024

  • Stefano is achieving 140GB/s on with his standalone angular integration (includes masking) for a 16Mpixel image.
  • Stefano will work on pyFAI
  • Stefano and cpo talked with Jiannan and Robert:
    • suggested using graphs for launching LC kernels (Stefano will look at this)
    • were worried about "outliers" getting clipped in LC.  But talked with Martin and he said there is no clipping.
    • priorities for Robert/Jiannan: make sure cupy bug is fixed so Stefano benchmark cuSZ, streams with cuSZ, support cpu-decompression
  • Ric is working with Mudit
    • having difficulty enabling pcie bifurcation on dev06
      • we think we need to have two gpus to take advantage of bifurcation
      • notional idea of kcu/gpu structure: have multiple kcu's (~6 existing kcu's to get 50GB/s into GPU, only if pcie bus can do it).  pcie bandwidths are here: https://en.wikipedia.org/wiki/PCI_Express
    • supermicro update utility seems to report non-bifurcated when we think it is?
    • suggested mudit write a pgpread-style program for gpu: mudit's stuff doesn't work with gpu yet
    • ric working on drp-gpu code (targeting cpu side first)
    • tried running mudit firmware on gpu001 (non-bifurcated). saw events coming into cpu with pgpread. turning on bifurcation worked on gpu001.  dev06 bifurcation-issue seems to be specific to that machine (matt claims it used to work though)
    • Jeremy has example of cuda graphs if that ends up being useful

Aug 12, 2024

  • Stefano working on cuSZ (Robert has provided cupy fix), pyFAI, and pi-slices azimuthal integration with standalone kernel
  • Ric getting to the point where he needs Mudit's stuff to work.
    • Makefile system is kludgy
    • Can get through phase1 of configure
    • Any interaction with firmware messes up dma's to CPU.  register map assumed to be same as interlard test?  only know the mode-select address for certain.


  • No labels