Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • how to use constants memory
    • constants memory: broadcast same 4 bytes to all threads of a warp, if the request is synchronized. a performance penalty if they don't. must explicitly call it out with _constant_ (for kepler "immediates" are stored in the constants memory if large enough, otherwise instruction).
  • how to use texture memory
    • textures: if used in a "2D or 3D" manner, can only store 4 bytes. in kepler: can use ldg. memory has been ordered in a strange way ("snake") to allow better accesses to multi-dimensional stuff. ugly with double precision, because of the 4-byte size.
  • what does the 150GB/s mem bandwidth number mean?
    • it is sum of read/write bandwidth (each is 75GB/s)
  • optimization tricks: pre-fetch etc.
    • we get 85GB/s out of 150GB/s on 2075. use cudaDMA?
    • philippe measures 84% memory bandwidth (154GB/s) on K20
  • what does a queued warp do? (does it pre-fetch the memory)
    • yes, but can do better (e.g. cudaDMA)
  • reducing number of registers in kernel (does compiler typically do this optimally?)
    • can control register usage using launch bounds
  • how to learn with nvvp if we're memory/flops limited
    • philippe just counts instructions and measures MB/s by running code (no NVVP). He has some special code that counts instructions for him in complicated cases.
  • understanding the nvvp columns
    • ECC decreases memory performance 20%. (118GB/s for 2075)
    • 106GB/s is "quite good"
    • 90% is the highest
    • maybe we should turn off ECC? Will lose statistics.
      Code Block
      dynamic shared memory:  extra launch parameters: stream and amount of shared memory to allocate dynamically 
      instruction replay overhead: 
        "sum" (different columns have different denominators) of next 3 columns: 
        o can replay because needed to fetch multiple cache lines per global memory access instruction  (e.g. because of cache-line misalignment) 
        o can replay because needed to fetch multiple cache lines per local memory access instruction (NOTE: this is "LOCAL MEMORY CACHE REPLAY OVERHEAD") 
        o shared memory bank conflict 
      global memory store efficiency:  measure of stored bytes vs. "real" stored bytes (should only be <100% if we have cache-line misalignments) 
      local memory overhead: measures local memory accesses (stack traffic, register spill traffic) 
      warp execution efficiency:  measure of branch-divergence (percentage of threads that are active in a warp) 
      global memory load efficiency:  measure of loaded bytes vs. "real" loaded bytes (should only be <100% if we have cache-line misalignments) 
      achieved occupancy: this is from "tail" from the numerology of number of waves of blocks 
      instructions issued: number of warps instructions issed to all SMs.  compare to 1.15GHz*#SMs*duration (maximum of 1) 
       
      NOTe internally fermi: really runs 2 half-warps over 2 clocks, but the above math still works out for the simple-minded. 
       
      NOTE: executed: first time , issued: includes replays 
      
      maybe we should turn off ECC? Will lose statistics.
  • best way to associate right GPU with right core (e.g. "taskset", "numactl")
  • ask about zher speedup numbers: for 4kx4k why does gemm improve by x30 but zher improves by x6?
    • gemm with large sizes is compute limited, which GPU does well. zher is memory limited.
  • using automake with cuda and c in one library?
    • no good answer
  • nvidia-proxy allocation: free up memory?
    • proxy doesn't provide a good way to free up memory

...