...
- how large blocks fit onto the SM (esp. wrt shared memory)
- nersc gpu cluster (dirac) usable?
- email addresses of fft/zher guy
- occupany spreadsheet
- registers per thread
- threads per block
- shared memory per block
- access:
- shared memory 10 clock ticks
- global memory: 400*800 clock ticks
- l1*cache/shared memory size cnan be traded off (16/48kB)
- number of streams: 16 for 2.0 devices
- zher improves 50% in cuda 5 ("enable atomics")
- can try GEAM of ZHERK instead of ZHER
- blas2 functions memory bound, blas3 compute bound
- can maybe get access to zher source if we really need it
- 3d ffts might be good
- ucla gpu cluster available?
- openACC ony available for certain compilers (pgi, cray)
- new kepler/cuda5 features:
- dynamic parallelism
- hyper*Q for streams
- slower clock speed more cores
- mem 5x faster (maybe because of more reg space)
- startup time with multi*gpus is a known problem
- nvvp is dying: eclipse
- much better information with nsight
- VASP person says we will win with 100x100x100 grid
- use single precision for pre-conditioning
- universal address space for transfering data
- P2P for multi*gpus within a process
- IPC for multi*gpus between processes