Page History

...

how to use constants memory
- constants memory: broadcast same 4 bytes to all threads of a warp, if the request is synchronized. a performance penalty if they don't. must explicitly call it out with _constant_ (for kepler "immediates" are stored in the constants memory if large enough, otherwise instruction).
how to use texture memory
- textures: if used in a "2D or 3D" manner, can only store 4 bytes. in kepler: can use ldg. memory has been ordered in a strange way ("snake") to allow better accesses to multi-dimensional stuff. ugly with double precision, because of the 4-byte size.
what does the 150GB/s mem bandwidth number mean?
- it is sum of read/write bandwidth (each is 75GB/s)
optimization tricks: pre-fetch etc.
- we get 85GB/s out of 150GB/s on 2075. use cudaDMA?
- philippe measures 84% memory bandwidth (154GB/s) on K20
what does a queued warp do? (does it pre-fetch the memory)
- yes, but can do better (e.g. cudaDMA)
reducing number of registers in kernel (does compiler typically do this optimally?)
- can control register usage using launch bounds
how to learn with nvvp if we're memory/flops limited
- philippe just counts instructions and measures MB/s by running code (no NVVP). He has some special code that counts instructions for him in complicated cases.

understanding the nvvp columns

ECC decreases memory performance 20%. (118GB/s for 2075)
106GB/s is "quite good"
90% is the highest

maybe we should turn off ECC? Will lose statistics.

Code Block

dynamic shared memory:  extra launch parameters: stream and amount of shared memory to allocate dynamically 
instruction replay overhead: 
  "sum" (different columns have different denominators) of next 3 columns: 
  o can replay because needed to fetch multiple cache lines per global memory access instruction  (e.g. because of cache-line misalignment) 
  o can replay because needed to fetch multiple cache lines per local memory access instruction (NOTE: this is "LOCAL MEMORY CACHE REPLAY OVERHEAD") 
  o shared memory bank conflict 
global memory store efficiency:  measure of stored bytes vs. "real" stored bytes (should only be <100% if we have cache-line misalignments) 
local memory overhead: measures local memory accesses (stack traffic, register spill traffic) 
warp execution efficiency:  measure of branch-divergence (percentage of threads that are active in a warp) 
global memory load efficiency:  measure of loaded bytes vs. "real" loaded bytes (should only be <100% if we have cache-line misalignments) 
achieved occupancy: this is from "tail" from the numerology of number of waves of blocks 
instructions issued: number of warps instructions issed to all SMs.  compare to 1.15GHz*#SMs*duration (maximum of 1) 
 
NOTe internally fermi: really runs 2 half-warps over 2 clocks, but the above math still works out for the simple-minded. 
 
NOTE: executed: first time , issued: includes replays

maybe we should turn off ECC? Will lose statistics.

best way to associate right GPU with right core (e.g. "taskset", "numactl")
ask about zher speedup numbers: for 4kx4k why does gemm improve by x30 but zher improves by x6?
- gemm with large sizes is compute limited, which GPU does well. zher is memory limited.
using automake with cuda and c in one library?
- no good answer
nvidia-proxy allocation: free up memory?
- proxy doesn't provide a good way to free up memory

...

Child pages

Versions Compared

Old Version 151

New Version 152

Key