Page History

how to use constants memory
how to use texture memory
optimization tricks: pre-fetch etc.
- we get 85GB/s out of 150GB/s on 2075. use cudaDMA?
what does a queued warp do? (does it pre-fetch the memory)
- yes
reducing number of registers in kernel (does compiler typically do this optimally?)
- can control register usage using launch bounds
how to learn with nvvp if we're memory/flops limited
- philippe just counts instructions and measures MB/s by running code (no NVVP). He has some special code that counts instructions for him in complicated cases.
understanding the nvvp columns
best way to associate right GPU with right core (e.g. "taskset", "numactl")
ask about zher speedup numbers: for 4kx4k why does gemm improve by x30 but zher improves by x6?
- gemm with large sizes is compute limited, which GPU does well. zher is memory limited.
using automake with cuda and c in one library?
swapping out priority: free up memory?
proxy gpu allocation only works on K20?

...

Versions Compared