Timing tests
#2024-05-14
Calibration (Appropriate Types)
- Switch to using 16-bit integers from 32-bit integers
- Output image is in float32
************************** Single Detector Segment ***************************** ******************************************************************************** Block Radix Calibration Results: (Results by Row) ******************************************************************************** Using number of rows: 352 Using number of entries per row: 384 Sorting keys per thread: 4 Using # of threads: 96 Execution time: 0.01188
- Single detector segment result: 11 us → roughly 22.755 GB/s
**************************** Six Detector Segments ***************************** ******************************************************************************** Block Radix Calibration Results: (Results by Row) ******************************************************************************** Using number of rows: 2112 Using number of entries per row: 384 Sorting keys per thread: 4 Using # of threads: 96 Execution time: 0.0404352
- Six detector segments result: 40.04 us → roughly 40.109 GB/s
************************** Twelve Detector Segment ***************************** ******************************************************************************** Block Radix Calibration Results: (Results by Row) ******************************************************************************** Using number of rows: 4224 Using number of entries per row: 384 Sorting keys per thread: 4 Using # of threads: 96 Execution time: 0.0727408
- Twelve detector segments result: 72.74 us → Roughly 44.598 GB/s
#2024-05-08 and 09
Just Common-Mode
- May be viable to use fewer threads, doing more work per thread for common-mode which is the costly operation
- Using
BLOCK_LOAD_WARP_TRANSPOSE
and equivalents
******************************************************************************** Block Radix Common-Mode Results: ******************************************************************************** Using number of rows: 352 Using number of entries per row: 384 Sorting keys per thread: 1 Using # of threads: 384 Execution time: 0.0375776 ******************************************************************************** Block Radix Common-Mode Results: ******************************************************************************** Using number of rows: 352 Using number of entries per row: 384 Sorting keys per thread: 2 Using # of threads: 192 Execution time: 0.0227328 ******************************************************************************** Block Radix Common-Mode Results: ******************************************************************************** Using number of rows: 352 Using number of entries per row: 384 Sorting keys per thread: 4 Using # of threads: 96 Execution time: 0.0158656
- Result with
BLOCK_LOAD_TRANSPOSE
******************************************************************************** Block Radix Common-Mode Results: ******************************************************************************** Using number of rows: 352 Using number of entries per row: 384 Sorting keys per thread: 1 Using # of threads: 384 Execution time: 0.0372272 ******************************************************************************** Block Radix Common-Mode Results: ******************************************************************************** Using number of rows: 352 Using number of entries per row: 384 Sorting keys per thread: 2 Using # of threads: 192 Execution time: 0.0224752 ******************************************************************************** Block Radix Common-Mode Results: ******************************************************************************** Using number of rows: 352 Using number of entries per row: 384 Sorting keys per thread: 4 Using # of threads: 96 Execution time: 0.0156112 ******************************************************************************** Block Radix Common-Mode Results: ******************************************************************************** Using number of rows: 352 Using number of entries per row: 384 Sorting keys per thread: 8 Using # of threads: 48 Execution time: 0.0159136 ******************************************************************************** Block Radix Common-Mode Results: ******************************************************************************** Using number of rows: 352 Using number of entries per row: 384 Sorting keys per thread: 16 Using # of threads: 24 Execution time: 0.014384
- With
BLOCK_LOAD_DIRECT
******************************************************************************** Block Radix Common-Mode Results: ******************************************************************************** Using number of rows: 352 Using number of entries per row: 384 Sorting keys per thread: 1 Using # of threads: 384 Execution time: 0.0373696 ******************************************************************************** Block Radix Common-Mode Results: ******************************************************************************** Using number of rows: 352 Using number of entries per row: 384 Sorting keys per thread: 2 Using # of threads: 192 Execution time: 0.0221168 ******************************************************************************** Block Radix Common-Mode Results: ******************************************************************************** Using number of rows: 352 Using number of entries per row: 384 Sorting keys per thread: 4 Using # of threads: 96 Execution time: 0.0154032 ******************************************************************************** Block Radix Common-Mode Results: ******************************************************************************** Using number of rows: 352 Using number of entries per row: 384 Sorting keys per thread: 8 Using # of threads: 48 Execution time: 0.0160192 ******************************************************************************** Block Radix Common-Mode Results: ******************************************************************************** Using number of rows: 352 Using number of entries per row: 384 Sorting keys per thread: 16 Using # of threads: 24 Execution time: 0.0146352
Including Full Calibration
- As before pedestal/gain correction (the actual add/multiply) is less than common-mode, but as you process too many per thread (with this memory access pattern), the time goes back up
- Each thread is responsible for
KEYS_PER_THREAD
pedestal and gain corrections - loops through these.- Formula is each thread processes the following indices from a row of the image:
[threadIdx.x*KEYS_PER_THREAD, threadIdx.x*KEYS_PER_THREAD + KEYS_PER_THREAD)
- Formula is each thread processes the following indices from a row of the image:
******************************************************************************** Block Radix Calibration Results: ******************************************************************************** Using number of rows: 352 Using number of entries per row: 384 Sorting keys per thread: 1 Using # of threads: 384 Execution time: 0.0386048 ******************************************************************************** Block Radix Calibration Results: ******************************************************************************** Using number of rows: 352 Using number of entries per row: 384 Sorting keys per thread: 2 Using # of threads: 192 Execution time: 0.0236032 ******************************************************************************** Block Radix Calibration Results: ******************************************************************************** Using number of rows: 352 Using number of entries per row: 384 Sorting keys per thread: 4 Using # of threads: 96 Execution time: 0.0177088 ******************************************************************************** Block Radix Calibration Results: ******************************************************************************** Using number of rows: 352 Using number of entries per row: 384 Sorting keys per thread: 8 Using # of threads: 48 Execution time: 0.0183696 ******************************************************************************** Block Radix Calibration Results: ******************************************************************************** Using number of rows: 352 Using number of entries per row: 384 Sorting keys per thread: 16 Using # of threads: 24 Execution time: 0.020056
#2024-05-07
Sort/Median Timing on more "realistic" sizes
- Use slightly more relevant sizes
- Using 384 entries → length of one row for a single epix10ka panel
********************************************************************************* Block Radix Median Results: ********************************************************************************* Median: 49 Execution time: 0.011864 ********************************************************************************* Thrust Sort Median Results: ********************************************************************************* Median: 49 Execution time: 0.275501
- Using 352 entries → length of one column for a single epix10ka panel
********************************************************************************* Block Radix Median Results: ********************************************************************************* Median: 52 Execution time: 0.0113584 ********************************************************************************* Thrust Sort Median Results: ********************************************************************************* Median: 52 Execution time: 0.248819
Common-mode proxy using just block radix
- Split rows by block and attempt common-mode proxy → Each block has 384 threads, 352 blocks run
********************************************************************************* Block Radix Common-Mode Results: ********************************************************************************* Median: 49 (Final row) Execution time: 0.0384512
- Called as:
blockRadixCommonMode<numentries,1,int><<<nrows,numentries,numentries*sizeof(int)>>>(devArr, devMedian, devResult);
- Data loaded by row:
BlockLoadT(temp.load).Load(data+row, thread_keys, numItems);
- Data loaded by row:
- Called as:
Calibration proxy
- Adding a pedestal subtraction and calibration step has only small effect on the overall time
- Assume all pixels good (e.g. no mask)
******************************************************************************** Block Radix Calibration Results: ******************************************************************************** Execution time: 0.0386928
#2024-05-06
Median calculation methods
- Initial attempt at comparing different sort methods with the aim of calculating a median
- Using
cub
has block-wide primitives (and others) and provides some routines as well - UseBlockRadixSort
- Using also
thrust::sort
- higher level thancub
. Somewhat easier to use.
- Using
- Try comparing using single block for now (i.e.
gridDim=1
). Both methods should be straightforward to adapt to parellel block execution. thrust
seems to be intended to be called mostly from the host-side. That said, inspecting the source shows that at least some of the functionality is compiled for both host and device, and the tests below compiled and ran.- Packages like
cupy
rely onthrust
- hence the comparison. - It may not be the most relevant in the long run if we want to remove the CPU from the equation entirely.
- Packages like
Method 1 - cub::BlockRadixSort
template <int THREADS, int ITEMS_X_THREAD, class T> __global__ void blockRadixMedian(const T* __restrict__ data, T* __restrict__ median_out, T* __restrict__ sorted_out) { typedef cub::BlockLoad< T, THREADS, ITEMS_X_THREAD, cub::BLOCK_LOAD_TRANSPOSE> BlockLoadT; typedef cub::BlockStore< T, THREADS, ITEMS_X_THREAD, cub::BLOCK_STORE_TRANSPOSE> BlockStoreT; typedef cub::BlockRadixSort< T, THREADS, ITEMS_X_THREAD> BlockRadixSortT; __shared__ union { typename BlockLoadT::TempStorage load; typename BlockStoreT::TempStorage store; typename BlockRadixSortT::TempStorage sort; } temp; size_t numItems = THREADS*ITEMS_X_THREAD; extern __shared__ T sorted[]; __shared__ T median; if (!threadIdx.x) { median = 0; } T thread_keys[ITEMS_X_THREAD]; // Copy data over BlockLoadT(temp.load).Load(data, thread_keys); __syncthreads(); // Sort data BlockRadixSortT(temp.sort).Sort(thread_keys); __syncthreads(); // Store in output BlockStoreT(temp.store).Store(sorted, thread_keys); __syncthreads(); sorted_out[threadIdx.x] = sorted[threadIdx.x]; if (!threadIdx.x) { size_t idx = numItems/2 - 1; if (numItems % 2 == 0) { median = (sorted[idx] + sorted[idx+1])/2; } else { median = sorted[idx]; } *median_out = median; } }
Method 2 - thrust::sort
template <int THREADS, int ITEMS_X_THREAD, class T> __global__ void thrustSortMedian(const T* __restrict__ data, T* __restrict__ median_out, T* __restrict__ sorted_out) { extern __shared__ T sorted[]; __shared__ T median; size_t numItems = THREADS*ITEMS_X_THREAD; // Copy data to mutable shared memory sorted[threadIdx.x] = data[threadIdx.x]; __syncthreads(); // Sort data if (!threadIdx.x) { // CDP? thrust::sort(thrust::device, sorted, sorted + numItems); } __syncthreads(); // Store in output sorted_out[threadIdx.x] = sorted[threadIdx.x]; if (!threadIdx.x) { size_t idx = numItems/2 - 1; if (numItems % 2 == 0) { median = (sorted[idx] + sorted[idx+1])/2; } else { median = sorted[idx]; } *median_out = median; } }
Results
- Used same timing method as the timing loop below
- Tested the time to sort 128 values in a single block
thrust
is slower - CDP overhead? There are probably obvious improvements here that I'm missing.
********************************************************************************* Block Radix Median Results: ********************************************************************************* Median: 47 Execution time: 0.0085376 ********************************************************************************* Thrust Sort Median Results: ********************************************************************************* Median: 47 Execution time: 0.12403
#2024-04-30
Kernel Launches
- Attempt testing overhead for launching CUDA kernels
- Internet perusing suggests ~5 us (various developer forums)
- Launching empty kernels as a proxy
- Want to see the additional overhead for making use of CUDA dynamic parallelism (CDP - Kernels launching kernels)
__global__ void testOverheadKernel() {} __global__ void testOverheadKernelCDP() { testOverheadKernel<<<gridDim,blockDim>>>(); } /// Timing loop -- Equivalent for all kernels tested cudaEvent_t start, stop; float run_time_ms; long double kernelLaunch_ms = 0; for (size_t i=0; i < NUM_ITER; ++i) { run_time_ms = 0; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start, 0); testOverheadKernel<<<...>>>(); cudaEventRecord(stop, 0); cudaEventSynchronize(stop); // Block CPU execution until event recorded cudaEventElapsedTime(&run_time_ms, start, stop); kernelLaunch_ms += run_time_ms; } kernelLaunch_ms /= NUM_ITER;
- Preliminary results show ~5 us launch overhead (consistent)
- CDP seems to be costly
************************************************ Empty Kernel Launches (Overhead timing) - in ms ************************************************ testOverheadKernel (Overhead proxy): 0.0054336 testOverheadKernelCDP (CDP cost+overhead): 0.0249248 (subtract overhead above): 0.0194912
- Overhead for CDP gets worse with subsequent calls??
************************************************ Empty Kernel Launches (Overhead timing) - in ms ************************************************ testOverheadKernel (Overhead proxy): 0.00512 testOverheadKernelCDP (CDP cost+overhead): 0.0253376 (subtract overhead above): 0.0202176 testOverheadKernelCDP_Twice (Launch two kernels): 0.59792 (subtract launch of 1): 0.572582 testOverheadKernelCDP_Twice_sync (Launch two kernels, synchronous): 0.600832 # NOTE: kernels are empty, so perhaps don't expect difference here vs. previous (subtract launch of 1): 0.575494
Datagram Creation
- Test (time) cost of datagram construction on GPU - not easily vectorized.
- Test cases:
- Otherwise empty kernel calls
__device__
function which constructs datagram including allocating memory (createDgramMalloc
) - Otherwise empty kernel calls
__device__
function which constructs datagram but does NOT allocate memory (createDgramNoMalloc
) - Kernel constructs datagram including allocating memory (
createDgramKernelMalloc
) - Kernel constructs datagram but does NOT allocate memory (
createDgramKernelNoMalloc
)
- Otherwise empty kernel calls
- Preliminary results:
********************************************* Datagram Creation Timing - in ms ********************************************* createDgramMalloc (device function, allocates): 0.0123296 createDgramNoMalloc (device function, NO allocation): 0.0057856 createDgramKernelMalloc (Kernel, allocates): 0.296768 createDgramKernelNoMalloc (Kernel, NO allocation): 0.117888
Overview
Content Tools