dynamic shared memory: extra launch parameters: stream and amount of shared memory to allocate dynamically instruction replay overhead: "sum" (different columns have different denominators) of next 3 columns: o can replay because needed to fetch multiple cache lines per global memory access instruction (e.g. because of cache-line misalignment) o can replay because needed to fetch multiple cache lines per local memory access instruction (NOTE: this is "LOCAL MEMORY CACHE REPLAY OVERHEAD") o shared memory bank conflict global memory store efficiency: measure of stored bytes vs. "real" stored bytes (should only be <100% if we have cache-line misalignments) local memory overhead: measures local memory accesses (stack traffic, register spill traffic) warp execution efficiency: measure of branch-divergence (percentage of threads that are active in a warp) global memory load efficiency: measure of loaded bytes vs. "real" loaded bytes (should only be <100% if we have cache-line misalignments) achieved occupancy: this is from "tail" from the numerology of number of waves of blocks instructions issued: number of warps instructions issed to all SMs. compare to 1.15GHz*#SMs*duration (maximum of 1) NOTe internally fermi: really runs 2 half-warps over 2 clocks, but the above math still works out for the simple-minded. NOTE: executed: first time , issued: includes replays |
We have put a version of GPAW in a local SVN repository on suncatgpu01. To access it, use the following:
svn co svn://localhost svngpaw |
You can put whatever you want for the last argument (local directory name).