Page History
...
- 12GB/s from pyFAI? https://www.silx.org/doc/pyFAI/latest/usage/tutorial/Parallelization/MultiGPU.html (see conclusion)
- Stefano getting worrisome numbers for standalone angular integration (as low as 3GB/s but up to 50GB/s if we batch)
- will run the profiler
- pyFAI is not yet running on GPU: will work on that
- Gabriel moved the GPU nodes to the FEE alcove xpm10.
- 1 free connection to xpm10 in the BOS
- all streams are blocked on dma single-event-buffers, so need to create a separate stream to transfer calibration constants on beginrun
- maybe make this "maintenance" stream permanent?
- couldn't launch kernels on anything except default stream because cuda context hadn't been bound to CPU threads. async GPU processes failed with no error.
- set up basic skeleton for transferring the calibration constants to a gpuMemPool (transitions, constants (maybe on the beginrun transition?)) and launches a dummy calibration on L1Accepts
- did both the full serial number and a subset (2 segments) to fetch calibration
- calibration constant dictionary hierarchy: pedestal and gain array (one per gain range). need status/geometry/. could try to do something general? map a python dictionary to a std::map?
- Fundamental idea in CPU-DRP is a MemPool with a file-descriptor
- changing it affects many things
- put many file-descriptors in there? not easy.
- GPUMemPool and CPUMemPool
- both inherit from base MemPool which inherits pebble/transitions
- derived classes handle file-descriptor and dma buffers
- the cudacontext broke the symmetry between the cpu/gpu side
- separated mempool into a PGP side and a pebble side.
- got cpu-drp to run with the new structure
- tried on gpu-drp, but issues arose. made progress over weekend.
- missing a lot of code for multiple file descriptors
Nov. 25, 2024
- Gabriel battled with a bug where the logging broke when mpi4py initializes
- worked around it by setting an environment variable that stopped mpi4py initialization
- changed buffer-size given to the datagpu driver to be able to holda full detector segment (since we don't yet have a tdetsim.service file)
- found that we could have dynamic kernel parameters defined by virtual functions
- grid, blocksize (potentially 3D arrays, but currently using 144grid size and 192 block size (size of 1 asic)).
- running with 4 pixels per thread sequentially (from a previous early study). this factor of 4 brings us up to the size of a full segment (4 asics)
- worry about the 5us kernel-launch overhead
- cuda-graphs reduce this overhead.
- python dictionary is mapped to a gpu-structure
- waiting for Ric's latest
- on the to-do list: cufile (goal is 5GB/s per GPU, 2 GPUs, 10GB/s/node), cuda-graphs
- seeing messages about getting cached memory for the DMA. ask TID about this.
- Stefano running pyFAI on GPU
- much worse performance (24GB/s for 4kx4k) than standalone (126GB/s). Gabriel mentions that pyFAI may do additional tasks (e.g. solid-angle correction)
- pixel "interpolation/smoothing" is not the reason
- for 1 segment 384x384 get 6GB/s standalone and 1GB/s with pyFAI
- Sparse-matrix-matrix with 10 events was 3GB/s
- Sparse-matrix-vector with 16 events was 50GB/s
- profiler work for standalone pyFAI is in-progress
- for standalone version using cusparse called by cupy
- cuSZ is also in-progress
- possible future projects: optimize standalone angular integration, peakfinder8
Overview
Content Tools