Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • how to use constants memory
  • how to use texture memory
  • what does the 150GB/s mem bandwidth number mean?
    • it is sum of read/write bandwidth (each is 75GB/s)
  • optimization tricks: pre-fetch etc.
    • we get 85GB/s out of 150GB/s on 2075. use cudaDMA?
    • philippe measures 84% memory bandwidth (154GB/s) on K20
  • what does a queued warp do? (does it pre-fetch the memory)
    • yes, but can do better (e.g. cudaDMA)
  • reducing number of registers in kernel (does compiler typically do this optimally?)
    • can control register usage using launch bounds
  • how to learn with nvvp if we're memory/flops limited
    • philippe just counts instructions and measures MB/s by running code (no NVVP). He has some special code that counts instructions for him in complicated cases.
  • understanding the nvvp columns
  • best way to associate right GPU with right core (e.g. "taskset", "numactl")
  • ask about zher speedup numbers: for 4kx4k why does gemm improve by x30 but zher improves by x6?
    • gemm with large sizes is compute limited, which GPU does well. zher is memory limited.
  • using automake with cuda and c in one library?
  • swapping out priority: free up memory?
  • proxy gpu allocation only works on K20?

...