You are viewing an old version of this page. View the current version.
Compare with Current
View Page History
« Previous
Version 2
Next »
What We Learned at GTC 2013
- met main developer of espresso-GPU
- FD stencils: tricky, memory bandwidth limited, CPUs as good as GPUs if well written
- new cudaPython from continuum analytics
- more about GPUdirect, RDMA (waiting for driver software)
- can use UVA with our current hardware/software
- GPUdirect and RDMA broken by QPI
- MVAPICH support for above (more cutting edge than openMPI)
- LSF support for GPU clusters
- improved "metrics" for nvprof and racecheck analysis in cuda-memcheck in cuda 5.5
- jun won't have to take the sqrt in quite so ugly a way with new cublas
- no GPU scalapack (could implement by having scalapack call GPU lapack routines)
- upcoming multi-gpu cublas
- molecular dynamics easier than DFT on GPU
- GEMM is inefficient for "narrow" vectors (optimization in progress)
What We Learned at GTC 2012
- how large blocks fit onto the SM (esp. wrt shared memory)
- nersc gpu cluster (dirac) usable?
- email addresses of fft/zher guy
- occupany spreadsheet
- registers per thread
- threads per block
- shared memory per block
- access:
- shared memory 10 clock ticks
- global memory: 400*800 clock ticks
- l1*cache/shared memory size cnan be traded off (16/48kB)
- number of streams: 16 for 2.0 devices
- zher improves 50% in cuda 5 ("enable atomics")
- can try GEAM of ZHERK instead of ZHER
- blas2 functions memory bound, blas3 compute bound
- can maybe get access to zher source if we really need it
- 3d ffts might be good
- ucla gpu cluster available?
- openACC ony available for certain compilers (pgi, cray)
- new kepler/cuda5 features:
- dynamic parallelism
- hyper*Q for streams
- slower clock speed more cores
- mem 5x faster (maybe because of more reg space)
- startup time with multi*gpus is a known problem
- nvvp is dying: eclipse
- much better information with nsight
- VASP person says we will win with 100x100x100 grid
- use single precision for pre-conditioning
- read the "best-practices" manual to understand nvvp profiling
- universal address space for transfering data
- P2P for multi*gpus within a process
- IPC for multi*gpus between processes