GPU tech conference

met main developer of espresso-GPU
FD stencils: tricky, memory bandwidth limited, CPUs as good as GPUs if well written
new cudaPython from continuum analytics
more about GPUdirect, RDMA (waiting for driver software)
can use UVA with our current hardware/software
GPUdirect and RDMA broken by QPI
MVAPICH support for above (more cutting edge than openMPI)
LSF support for GPU clusters
improved "metrics" for nvprof and racecheck analysis in cuda-memcheck in cuda 5.5
jun won't have to take the sqrt in quite so ugly a way with new cublas
no GPU scalapack (could implement by having scalapack call GPU lapack routines)
upcoming multi-gpu cublas
molecular dynamics easier than DFT on GPU
GEMM is inefficient for "narrow" vectors (optimization in progress)

2-3x for small systems
mpi+openmp+gpu
tried on 2 GPUs per node
for 8cpu+8gpu maybe run 8 openmpi (not done by anybody, may run into memory bottleneck)
multiple jobs using same GPU OK
- espresso enforces memory management
- prints warning if it runs out of memory
relies on faster kernels, slower kernels maybe use 1 mpi
use magma for diagonalization
matrices on order of 10000x10000
if system is larger need to switch to scalapack instead of magma
might have to disable some gpus to maximize bandwidth (pick up ones with
most bandwidth)
no p2p
gpu code is a "plug-in"
have support for "screenings"

future development possibility:

filippo spiga, main person
girotto

how large blocks fit onto the SM (esp. wrt shared memory)
nersc gpu cluster (dirac) usable?
email addresses of fft/zher guy
occupany spreadsheet
- registers per thread
- threads per block
- shared memory per block
access:
- shared memory 10 clock ticks
- global memory: 400*800 clock ticks
l1*cache/shared memory size cnan be traded off (16/48kB)
number of streams: 16 for 2.0 devices
zher improves 50% in cuda 5 ("enable atomics")
can try GEAM of ZHERK instead of ZHER
blas2 functions memory bound, blas3 compute bound
can maybe get access to zher source if we really need it
3d ffts might be good
ucla gpu cluster available?
openACC ony available for certain compilers (pgi, cray)
new kepler/cuda5 features:
- dynamic parallelism
- hyper*Q for streams
- slower clock speed more cores
- mem 5x faster (maybe because of more reg space)
startup time with multi*gpus is a known problem
nvvp is dying: eclipse
- much better information with nsight
VASP person says we will win with 100x100x100 grid
use single precision for pre-conditioning
read the "best-practices" manual to understand nvvp profiling
universal address space for transfering data
- P2P for multi*gpus within a process
- IPC for multi*gpus between processes