Page History

...

how large blocks fit onto the SM (esp. wrt shared memory)
nersc gpu cluster (dirac) usable?
email addresses of fft/zher guy
occupany spreadsheet
- registers per thread
- threads per block
- shared memory per block
access:
- shared memory 10 clock ticks
- global memory: 400*800 clock ticks
l1*cache/shared memory size cnan be traded off (16/48kB)
number of streams: 16 for 2.0 devices
zher improves 50% in cuda 5 ("enable atomics")
can try GEAM of ZHERK instead of ZHER
blas2 functions memory bound, blas3 compute bound
can maybe get access to zher source if we really need it
3d ffts might be good
ucla gpu cluster available?
openACC ony available for certain compilers (pgi, cray)
new kepler/cuda5 features:
- dynamic parallelism
- hyper*Q for streams
- slower clock speed more cores
- mem 5x faster (maybe because of more reg space)
startup time with multi*gpus is a known problem
nvvp is dying: eclipse
- much better information with nsight
VASP person says we will win with 100x100x100 grid
use single precision for pre-conditioning
universal address space for transfering data
- P2P for multi*gpus within a process
- IPC for multi*gpus between processes

Versions Compared