Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • 12GB/s from pyFAI?  https://www.silx.org/doc/pyFAI/latest/usage/tutorial/Parallelization/MultiGPU.html (see conclusion)
  • Stefano getting worrisome numbers for standalone angular integration (as low as 3GB/s but up to 50GB/s if we batch)
    • will run the profiler
    • pyFAI is not yet running on GPU: will work on that
  • Gabriel moved the GPU nodes to the FEE alcove xpm10. 
    • 1 free connection to xpm10 in the BOS
    • all streams are blocked on dma single-event-buffers, so need to create a separate stream to transfer calibration constants on beginrun
      • maybe make this "maintenance" stream permanent?
    • couldn't launch kernels on anything except default stream because cuda context hadn't been bound to CPU threads.  async GPU processes failed with no error.
    • set up basic skeleton for transferring the calibration constants to a  gpuMemPool (transitions, constants (maybe on the beginrun transition?)) and launches a dummy calibration on L1Accepts
    • did both the full serial number and a subset (2 segments) to fetch calibration 
    • calibration constant dictionary hierarchy: pedestal and gain array (one per gain range).  need status/geometry/.  could try to do something general?  map a python dictionary to a std::map?
  • Fundamental idea in CPU-DRP is a MemPool with a file-descriptor
    • changing it affects many things
    • put many file-descriptors in there?  not easy.
    • GPUMemPool and CPUMemPool
      • both inherit from base MemPool which inherits pebble/transitions
      • derived classes handle file-descriptor and dma buffers
      • the cudacontext broke the symmetry between the cpu/gpu side
    • separated mempool into a PGP side and a pebble side.
      • got cpu-drp to run with the new structure
      • tried on gpu-drp, but issues arose.  made progress over weekend.
      • missing a lot of code for multiple file descriptors

Nov. 25, 2024

  • Gabriel battled with a bug where the logging broke when mpi4py initializes
    • worked around it by setting an environment variable that stopped mpi4py initialization
  • changed buffer-size given to the datagpu driver to be able to holda full detector segment (since we don't yet have a tdetsim.service file)
  • found that we could have dynamic kernel parameters defined by virtual functions
    • grid, blocksize (potentially 3D arrays, but currently using 144grid size and 192 block size (size of 1 asic)).
    • running with 4 pixels per thread sequentially (from a previous early study).  this factor of 4 brings us up to the size of a full segment (4 asics)
  • worry about the 5us kernel-launch overhead
    • cuda-graphs reduce this overhead.
  • python dictionary is mapped to a gpu-structure
  • waiting for Ric's latest
  • on the to-do list: cufile (goal is 5GB/s per GPU, 2 GPUs, 10GB/s/node), cuda-graphs
  • seeing messages about getting cached memory for the DMA.  ask TID about this.
  • Stefano running pyFAI on GPU
    • much worse performance (24GB/s for 4kx4k) than standalone (126GB/s).  Gabriel mentions that pyFAI may do additional tasks (e.g. solid-angle correction)
    • pixel "interpolation/smoothing" is not the reason
    • for 1 segment 384x384 get 6GB/s standalone and 1GB/s with pyFAI
    • Sparse-matrix-matrix with 10 events was 3GB/s
    • Sparse-matrix-vector with 16 events was 50GB/s
  • profiler work for standalone pyFAI is in-progress
    • for standalone version using cusparse called by cupy
  • cuSZ is also in-progress
  • possible future projects: optimize standalone angular integration, peakfinder8