To-Do List
1/22/2013
- libxc on gpu (lin)
- work on automake stuff on Thursday
- ping Miguel
- AJ tries simple new-setup Pt system with rmm-diis
- use same code with different setups or vice-versa
- generate residual compared to converged
- cpo compares FFTMixer to dacapo
- rpa (jun)
- merge trunk and print pointers to understand crashes
- rewrite code for the ZHERK
Questions for Nvidia
- how to use constants memory
- constants memory: broadcast same 4 bytes to all threads of a warp, if the request is synchronized. a performance penalty if they don't. must explicitly call it out with _constant_ (for kepler "immediates" are stored in the constants memory if large enough, otherwise instruction).
- how to use texture memory
- textures: if used in a "2D or 3D" manner, can only store 4 bytes. in kepler: can use ldg. memory has been ordered in a strange way ("snake") to allow better accesses to multi-dimensional stuff. ugly with double precision, because of the 4-byte size.
- what does the 150GB/s mem bandwidth number mean?
- it is sum of read/write bandwidth (each is 75GB/s)
- optimization tricks: pre-fetch etc.
- we get 85GB/s out of 150GB/s on 2075. use cudaDMA?
- philippe measures 84% memory bandwidth (154GB/s) on K20
- what does a queued warp do? (does it pre-fetch the memory)
- yes, but can do better (e.g. cudaDMA)
- reducing number of registers in kernel (does compiler typically do this optimally?)
- can control register usage using launch bounds
- how to learn with nvvp if we're memory/flops limited
- philippe just counts instructions and measures MB/s by running code (no NVVP). He has some special code that counts instructions for him in complicated cases.
- understanding the nvvp columns
- ECC decreases memory performance 20%. (118GB/s for 2075)
- 106GB/s is "quite good"
- 90% is the highest
- maybe we should turn off ECC? Will lose statistics.
Code Block dynamic shared memory: extra launch parameters: stream and amount of shared memory to allocate dynamically instruction replay overhead: "sum" (different columns have different denominators) of next 3 columns: o can replay because needed to fetch multiple cache lines per global memory access instruction (e.g. because of cache-line misalignment) o can replay because needed to fetch multiple cache lines per local memory access instruction (NOTE: this is "LOCAL MEMORY CACHE REPLAY OVERHEAD") o shared memory bank conflict global memory store efficiency: measure of stored bytes vs. "real" stored bytes (should only be <100% if we have cache-line misalignments) local memory overhead: measures local memory accesses (stack traffic, register spill traffic) warp execution efficiency: measure of branch-divergence (percentage of threads that are active in a warp) global memory load efficiency: measure of loaded bytes vs. "real" loaded bytes (should only be <100% if we have cache-line misalignments) achieved occupancy: this is from "tail" from the numerology of number of waves of blocks instructions issued: number of warps instructions issed to all SMs. compare to 1.15GHz*#SMs*duration (maximum of 1) NOTe internally fermi: really runs 2 half-warps over 2 clocks, but the above math still works out for the simple-minded. NOTE: executed: first time , issued: includes replays
- best way to associate right GPU with right core (e.g. "taskset", "numactl")
- if numactl settings OK, OS should take care of that. still have to correct taskset/cuCreate
- ask about zher speedup numbers: for 4kx4k why does gemm improve by x30 but zher improves by x6?
- gemm with large sizes is compute limited, which GPU does well. zher is memory limited.
- using automake with cuda and c in one library?
- no good answer
- nvidia-proxy allocation: free up memory?
- proxy doesn't provide a good way to free up memory
...