Page History

...

constants memory
texture memory
optimization tricks: pre-fetch etc.
what does a queued warp do? (does it pre-fetch the memory)
reducing number of registers in kernel (does compiler typically do this optimally?)
how to learn with nvvp if we're memory/flops limited
understanding the nvvp columns
ask about zher speedup numbers: for 4kx4k why does gemm improve by x30 but zher improves by x6?
12/18/2012
libxc on gpu (lin)
- use common work file for CPU/GPU
digest RPA timing measurements (lin)
paper (jun)
redo timing measurements (jun)
understand timing measurements more fully (jun)
dacapo density mixing vs. GPAW (cpo)

understand nvidia zgemm speedup plot (jun/cpo)
- ANSWER: without thread: 29 faster on GPU. With 6 thread openMP get 5, which agrees with nvidia
understand why zher is x6 better on GPU but we see x24 with RPA (will put device sync in code) (jun/cpo)
- ANSWER: CPU is memory bandwidth limited (so faster with 1 core). account for roughly x2, and the other x2 comes from overlapping CPU/GPU computation.
does cuda5 improve ZHER? (jun/cpo) ANSWER: no improvement
libxc on gpu (lin)
- use common work file for CPU/GPU
digest RPA timing measurements (lin)
think about moving lambda calc to GPU (jun) (ANSWER: no need, 10 or 20% improvement, best case)
try multiple surfaces with jacapo/gpaw-pw (aj)
paper (jun)
try calling dacapo density mixing from GPAW (cpo)
make sure all libxc self-tests run
why doesn't marcin's na.py converge, even with fixed density?
can the alphas for the nt_G really be used for the D's?

...

Versions Compared