Timing tests

#2024-05-14

Calibration (Appropriate Types)

Switch to using 16-bit integers from 32-bit integers
Output image is in float32

************************** Single Detector Segment *****************************
********************************************************************************
                        Block Radix Calibration Results:
                               (Results by Row)
********************************************************************************
Using number of rows: 352
Using number of entries per row: 384
	Sorting keys per thread: 4
	Using # of threads: 96
Execution time: 0.01188

Single detector segment result: 11 us → roughly 22.755 GB/s

**************************** Six Detector Segments *****************************
********************************************************************************
                        Block Radix Calibration Results:
                               (Results by Row)
********************************************************************************
Using number of rows: 2112
Using number of entries per row: 384
	Sorting keys per thread: 4
	Using # of threads: 96
Execution time: 0.0404352

Six detector segments result: 40.04 us → roughly 40.109 GB/s

************************** Twelve Detector Segment *****************************
********************************************************************************
                        Block Radix Calibration Results:
                               (Results by Row)
********************************************************************************
Using number of rows: 4224
Using number of entries per row: 384
	Sorting keys per thread: 4
	Using # of threads: 96
Execution time: 0.0727408

Twelve detector segments result: 72.74 us → Roughly 44.598 GB/s

#2024-05-08 and 09

Just Common-Mode

May be viable to use fewer threads, doing more work per thread for common-mode which is the costly operation
Using BLOCK_LOAD_WARP_TRANSPOSE and equivalents

********************************************************************************
                        Block Radix Common-Mode Results:
********************************************************************************
Using number of rows: 352
Using number of entries per row: 384
	Sorting keys per thread: 1
	Using # of threads: 384
Execution time: 0.0375776
********************************************************************************
                        Block Radix Common-Mode Results:
********************************************************************************
Using number of rows: 352
Using number of entries per row: 384
	Sorting keys per thread: 2
	Using # of threads: 192
Execution time: 0.0227328
********************************************************************************
                        Block Radix Common-Mode Results:
********************************************************************************
Using number of rows: 352
Using number of entries per row: 384
	Sorting keys per thread: 4
	Using # of threads: 96
Execution time: 0.0158656

Result with BLOCK_LOAD_TRANSPOSE

********************************************************************************
                        Block Radix Common-Mode Results:
********************************************************************************
Using number of rows: 352
Using number of entries per row: 384
	Sorting keys per thread: 1
	Using # of threads: 384
Execution time: 0.0372272
********************************************************************************
                        Block Radix Common-Mode Results:
********************************************************************************
Using number of rows: 352
Using number of entries per row: 384
	Sorting keys per thread: 2
	Using # of threads: 192
Execution time: 0.0224752
********************************************************************************
                        Block Radix Common-Mode Results:
********************************************************************************
Using number of rows: 352
Using number of entries per row: 384
	Sorting keys per thread: 4
	Using # of threads: 96
Execution time: 0.0156112
********************************************************************************
                        Block Radix Common-Mode Results:
********************************************************************************
Using number of rows: 352
Using number of entries per row: 384
	Sorting keys per thread: 8
	Using # of threads: 48
Execution time: 0.0159136
********************************************************************************
                        Block Radix Common-Mode Results:
********************************************************************************
Using number of rows: 352
Using number of entries per row: 384
	Sorting keys per thread: 16
	Using # of threads: 24
Execution time: 0.014384

With BLOCK_LOAD_DIRECT

********************************************************************************
                        Block Radix Common-Mode Results:
********************************************************************************
Using number of rows: 352
Using number of entries per row: 384
	Sorting keys per thread: 1
	Using # of threads: 384
Execution time: 0.0373696
********************************************************************************
                        Block Radix Common-Mode Results:
********************************************************************************
Using number of rows: 352
Using number of entries per row: 384
	Sorting keys per thread: 2
	Using # of threads: 192
Execution time: 0.0221168
********************************************************************************
                        Block Radix Common-Mode Results:
********************************************************************************
Using number of rows: 352
Using number of entries per row: 384
	Sorting keys per thread: 4
	Using # of threads: 96
Execution time: 0.0154032
********************************************************************************
                        Block Radix Common-Mode Results:
********************************************************************************
Using number of rows: 352
Using number of entries per row: 384
	Sorting keys per thread: 8
	Using # of threads: 48
Execution time: 0.0160192
********************************************************************************
                        Block Radix Common-Mode Results:
********************************************************************************
Using number of rows: 352
Using number of entries per row: 384
	Sorting keys per thread: 16
	Using # of threads: 24
Execution time: 0.0146352

Including Full Calibration

As before pedestal/gain correction (the actual add/multiply) is less than common-mode, but as you process too many per thread (with this memory access pattern), the time goes back up
Each thread is responsible for KEYS_PER_THREAD pedestal and gain corrections - loops through these.
- Formula is each thread processes the following indices from a row of the image: [threadIdx.x*KEYS_PER_THREAD, threadIdx.x*KEYS_PER_THREAD + KEYS_PER_THREAD)

********************************************************************************
                        Block Radix Calibration Results:
********************************************************************************
Using number of rows: 352
Using number of entries per row: 384
	Sorting keys per thread: 1
	Using # of threads: 384
Execution time: 0.0386048
********************************************************************************
                        Block Radix Calibration Results:
********************************************************************************
Using number of rows: 352
Using number of entries per row: 384
	Sorting keys per thread: 2
	Using # of threads: 192
Execution time: 0.0236032
********************************************************************************
                        Block Radix Calibration Results:
********************************************************************************
Using number of rows: 352
Using number of entries per row: 384
	Sorting keys per thread: 4
	Using # of threads: 96
Execution time: 0.0177088
********************************************************************************
                        Block Radix Calibration Results:
********************************************************************************
Using number of rows: 352
Using number of entries per row: 384
	Sorting keys per thread: 8
	Using # of threads: 48
Execution time: 0.0183696
********************************************************************************
                        Block Radix Calibration Results:
********************************************************************************
Using number of rows: 352
Using number of entries per row: 384
	Sorting keys per thread: 16
	Using # of threads: 24
Execution time: 0.020056

#2024-05-07

Sort/Median Timing on more "realistic" sizes

Use slightly more relevant sizes
Using 384 entries → length of one row for a single epix10ka panel

*********************************************************************************
                           Block Radix Median Results:
*********************************************************************************
Median: 49
Execution time: 0.011864
*********************************************************************************
                           Thrust Sort Median Results:
*********************************************************************************
Median: 49
Execution time: 0.275501

Using 352 entries → length of one column for a single epix10ka panel

*********************************************************************************
                           Block Radix Median Results:
*********************************************************************************
Median: 52
Execution time: 0.0113584
*********************************************************************************
                           Thrust Sort Median Results:
*********************************************************************************
Median: 52
Execution time: 0.248819

Common-mode proxy using just block radix

Split rows by block and attempt common-mode proxy → Each block has 384 threads, 352 blocks run

*********************************************************************************
                       Block Radix Common-Mode Results:
*********************************************************************************
Median: 49 (Final row)
Execution time: 0.0384512

- Called as: blockRadixCommonMode<numentries,1,int><<<nrows,numentries,numentries*sizeof(int)>>>(devArr, devMedian, devResult);
  - Data loaded by row: BlockLoadT(temp.load).Load(data+row, thread_keys, numItems);

Calibration proxy

Adding a pedestal subtraction and calibration step has only small effect on the overall time
- Assume all pixels good (e.g. no mask)

********************************************************************************
                        Block Radix Calibration Results:
********************************************************************************
Execution time: 0.0386928

#2024-05-06

Median calculation methods

Initial attempt at comparing different sort methods with the aim of calculating a median
- Using cub has block-wide primitives (and others) and provides some routines as well - Use BlockRadixSort
- Using also thrust::sort - higher level than cub . Somewhat easier to use.
Try comparing using single block for now (i.e. gridDim=1 ). Both methods should be straightforward to adapt to parellel block execution.
thrust seems to be intended to be called mostly from the host-side. That said, inspecting the source shows that at least some of the functionality is compiled for both host and device, and the tests below compiled and ran.
- Packages like cupy rely on thrust - hence the comparison.
- It may not be the most relevant in the long run if we want to remove the CPU from the equation entirely.

Method 1 - cub::BlockRadixSort

template <int THREADS, int ITEMS_X_THREAD, class T>
__global__
void blockRadixMedian(const T* __restrict__ data, T* __restrict__ median_out,
                            T* __restrict__ sorted_out) {
    typedef cub::BlockLoad<
        T, THREADS, ITEMS_X_THREAD, cub::BLOCK_LOAD_TRANSPOSE> BlockLoadT;
    typedef cub::BlockStore<
        T, THREADS, ITEMS_X_THREAD, cub::BLOCK_STORE_TRANSPOSE> BlockStoreT;
    typedef cub::BlockRadixSort<
        T, THREADS, ITEMS_X_THREAD> BlockRadixSortT;
    
    __shared__ union { 
        typename BlockLoadT::TempStorage      load;
        typename BlockStoreT::TempStorage     store;
        typename BlockRadixSortT::TempStorage sort;
    } temp;

    size_t numItems = THREADS*ITEMS_X_THREAD;
    extern __shared__ T sorted[];
    
    __shared__ T median;
    if (!threadIdx.x) {
        median = 0;
    }
    T thread_keys[ITEMS_X_THREAD];

    // Copy data over
    BlockLoadT(temp.load).Load(data, thread_keys);
    __syncthreads();
    
    // Sort data
    BlockRadixSortT(temp.sort).Sort(thread_keys);
    __syncthreads();

    // Store in output
    BlockStoreT(temp.store).Store(sorted, thread_keys);
    __syncthreads();
    sorted_out[threadIdx.x] = sorted[threadIdx.x];

    if (!threadIdx.x) {
        size_t idx = numItems/2 - 1;
        if (numItems % 2 == 0) {
            median = (sorted[idx] + sorted[idx+1])/2;
        } else {
            median = sorted[idx];
        }
        *median_out = median;
    }
}

Method 2 - thrust::sort

template <int THREADS, int ITEMS_X_THREAD, class T>
__global__
void thrustSortMedian(const T* __restrict__ data, T* __restrict__ median_out,
                            T* __restrict__ sorted_out) {
    extern __shared__ T sorted[];
    __shared__ T median;

    size_t numItems = THREADS*ITEMS_X_THREAD;

    // Copy data to mutable shared memory
    sorted[threadIdx.x] = data[threadIdx.x];
    __syncthreads();
     
    // Sort data
    if (!threadIdx.x) {
        // CDP?
        thrust::sort(thrust::device, sorted, sorted + numItems);
    }
    __syncthreads();

    // Store in output
    sorted_out[threadIdx.x] = sorted[threadIdx.x];

    if (!threadIdx.x) {
        size_t idx = numItems/2 - 1;
        if (numItems % 2 == 0) {
            median = (sorted[idx] + sorted[idx+1])/2;
        } else {
            median = sorted[idx];
        }
	 	*median_out = median;
    }
}

Results

Used same timing method as the timing loop below
Tested the time to sort 128 values in a single block
thrust is slower - CDP overhead? There are probably obvious improvements here that I'm missing.

*********************************************************************************
                           Block Radix Median Results:
*********************************************************************************
Median: 47
Execution time: 0.0085376
*********************************************************************************
                           Thrust Sort Median Results:
*********************************************************************************
Median: 47
Execution time: 0.12403

#2024-04-30

Kernel Launches

Attempt testing overhead for launching CUDA kernels
- Internet perusing suggests ~5 us (various developer forums)
Launching empty kernels as a proxy
Want to see the additional overhead for making use of CUDA dynamic parallelism (CDP - Kernels launching kernels)

__global__
void testOverheadKernel() {}

__global__
void testOverheadKernelCDP() {
    testOverheadKernel<<<gridDim,blockDim>>>();
}

/// Timing loop -- Equivalent for all kernels tested
cudaEvent_t start, stop;
float run_time_ms;
long double kernelLaunch_ms = 0;
for (size_t i=0; i < NUM_ITER; ++i) {
    run_time_ms = 0;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start, 0);
    testOverheadKernel<<<...>>>();
    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop); // Block CPU execution until event recorded
    cudaEventElapsedTime(&run_time_ms, start, stop);
    kernelLaunch_ms += run_time_ms;
}
kernelLaunch_ms /= NUM_ITER;

Preliminary results show ~5 us launch overhead (consistent)
CDP seems to be costly

************************************************
Empty Kernel Launches (Overhead timing) - in ms
************************************************
testOverheadKernel (Overhead proxy): 0.0054336
testOverheadKernelCDP (CDP cost+overhead): 0.0249248
	 (subtract overhead above): 0.0194912

Overhead for CDP gets worse with subsequent calls??

************************************************
Empty Kernel Launches (Overhead timing) - in ms
************************************************
testOverheadKernel (Overhead proxy): 0.00512
testOverheadKernelCDP (CDP cost+overhead): 0.0253376
	 (subtract overhead above): 0.0202176
testOverheadKernelCDP_Twice (Launch two kernels): 0.59792
	 (subtract launch of 1): 0.572582
testOverheadKernelCDP_Twice_sync (Launch two kernels, synchronous): 0.600832     # NOTE: kernels are empty, so perhaps don't expect difference here vs. previous
	 (subtract launch of 1): 0.575494

Datagram Creation

Test (time) cost of datagram construction on GPU - not easily vectorized.
Test cases:
- Otherwise empty kernel calls __device__ function which constructs datagram including allocating memory ( createDgramMalloc )
- Otherwise empty kernel calls __device__ function which constructs datagram but does NOT allocate memory ( createDgramNoMalloc )
- Kernel constructs datagram including allocating memory ( createDgramKernelMalloc )
- Kernel constructs datagram but does NOT allocate memory ( createDgramKernelNoMalloc )
Preliminary results:

*********************************************
Datagram Creation Timing - in ms
*********************************************
createDgramMalloc (device function, allocates): 0.0123296
createDgramNoMalloc (device function, NO allocation): 0.0057856
createDgramKernelMalloc (Kernel, allocates): 0.296768
createDgramKernelNoMalloc (Kernel, NO allocation): 0.117888

Page tree

Benchmarks