12GB/s from pyFAI? https://www.silx.org/doc/pyFAI/latest/usage/tutorial/Parallelization/MultiGPU.html (see conclusion)
Stefano getting worrisome numbers for standalone angular integration (as low as 3GB/s but up to 50GB/s if we batch)
- will run the profiler
- pyFAI is not yet running on GPU: will work on that
Gabriel moved the GPU nodes to the FEE alcove xpm10.
- 1 free connection to xpm10 in the BOS
- all streams are blocked on dma single-event-buffers, so need to create a separate stream to transfer calibration constants on beginrun
  - maybe make this "maintenance" stream permanent?
- couldn't launch kernels on anything except default stream because cuda context hadn't been bound to CPU threads. async GPU processes failed with no error.
- set up basic skeleton for transferring the calibration constants to a gpuMemPool (transitions, constants (maybe on the beginrun transition?)) and launches a dummy calibration on L1Accepts
- did both the full serial number and a subset (2 segments) to fetch calibration
- calibration constant dictionary hierarchy: pedestal and gain array (one per gain range). need status/geometry/. could try to do something general? map a python dictionary to a std::map?
Fundamental idea in CPU-DRP is a MemPool with a file-descriptor
- changing it affects many things
- put many file-descriptors in there? not easy.
- GPUMemPool and CPUMemPool
  - both inherit from base MemPool which inherits pebble/transitions
  - derived classes handle file-descriptor and dma buffers
  - the cudacontext broke the symmetry between the cpu/gpu side
- separated mempool into a PGP side and a pebble side.
  - got cpu-drp to run with the new structure
  - tried on gpu-drp, but issues arose. made progress over weekend.
  - missing a lot of code for multiple file descriptors

Nov. 25, 2024

Gabriel battled with a bug where the logging broke when mpi4py initializes
- worked around it by setting an environment variable that stopped mpi4py initialization
changed buffer-size given to the datagpu driver to be able to holda full detector segment (since we don't yet have a tdetsim.service file)
found that we could have dynamic kernel parameters defined by virtual functions
- grid, blocksize (potentially 3D arrays, but currently using 144grid size and 192 block size (size of 1 asic)).
- running with 4 pixels per thread sequentially (from a previous early study). this factor of 4 brings us up to the size of a full segment (4 asics)
worry about the 5us kernel-launch overhead
- cuda-graphs reduce this overhead.
python dictionary is mapped to a gpu-structure
waiting for Ric's latest
on the to-do list: cufile (goal is 5GB/s per GPU, 2 GPUs, 10GB/s/node), cuda-graphs
seeing messages about getting cached memory for the DMA. ask TID about this.
Stefano running pyFAI on GPU
- much worse performance (24GB/s for 4kx4k) than standalone (126GB/s). Gabriel mentions that pyFAI may do additional tasks (e.g. solid-angle correction)
- pixel "interpolation/smoothing" is not the reason
- for 1 segment 384x384 get 6GB/s standalone and 1GB/s with pyFAI
- Sparse-matrix-matrix with 10 events was 3GB/s
- Sparse-matrix-vector with 16 events was 50GB/s
profiler work for standalone pyFAI is in-progress
- for standalone version using cusparse called by cupy
cuSZ is also in-progress
possible future projects: optimize standalone angular integration, peakfinder8

Page tree

Versions Compared

Old Version 21

New Version 22

Key

Nov. 25, 2024

Page tree

Page History

Versions Compared

Old Version 21

New Version 22

Key

Nov. 25, 2024