GPUs for GPAW

To-Do List

5/23/2012

understand x/c kernel bottleneck with nvvp (lin)
trying cufft to see what we gain (lin)
use VO as convergence test case (aj)
look at special-metric-weight convergence (aj)
think about FFT cutoff (aj)
GEAM, ZHERK (jun)
build on hopper and cpu/gpu (jun)
work on libxc (cpo)

5/9/2012

rpbe kernel (lin)
- does memcpyasync need cudamallochost?
- fix stream behavior and try with 1,2,4,8,16 streams
- understand stream behaviour with nvvp
zher streams(jun)
- in benchmark, have separately variable nstream/nw
- can we see whether we have 4 or 16 streams?
- understand stream behaviour with nvvp
density mixing (aj)
work on libxc (cpo)

5/2/2012

looking at EXX bottleneck (rewriting) (jun)
use cuda streams for small RPA systems (jun)
libxc integration (cpo)
understand MKL benchmark (jun/cpo)
pycuda (cpo)
understand RPBE kernel: (lin)
- understand "double" problem
- vary np, block_size, nstreams
- loop testfunc many times
- longer term: look at jussi/samuli kernel for ideas

4/25/2012

looking at EXX bottleneck (rewriting) (jun)
postpone work on ZHER stuff until we have news from INCITE (jun)
talk to Frank about computing time applications (cpo)
understand MKL benchmark (jun/cpo)
libxc integration (cpo)

4/18/2012

look at reduced-scope libxc example plus RPBE (lin)
if there is time, benchmark the RPBE kernel (lin)
zher performance improvement with multiple streams (jun)
make INCITE version work (jun/cpo)
move to libxc 1.2 (cpo)

4/11/2012

libxc parallelization (lin)
libxc integration (cpo)
understand missing time in cublas mode (jun/cpo)
how to put the gemm in PW mode in a fairly neat way (lin/cpo)
start working on multiple-alpha kernel (MAZHER) (jun/cpo)
work on INCITE proposal (jun/cpo)

3/28/2012

gemm (lin)
run pt3x3 (cpo)
libxc (cpo, and lin if he finishes gemm)
cher/fft (jun)
fix gpu allocation (cpo)
circular dependency problem with monkhorst_pack (cpo)
mpi failure with cuzher (cpo)

3/21/2012

batch queue for GPU machine (cpo)
fft/gemm/gemv (lin/jun/cpo)
single precision cher instead of zher? (jun/cpo)
new libxc (cpo)
fix libfftw detection (cpo)
improve zher in cuda (long project, jun/cpo)
move "expand" from python into C, post to mailing list? (lin)
look at spin paired (cpo)
run pt3x3 (cpo)

3/14/2012

pycuda compatibility (cpo)
private svn (cpo)
try nvvp/transpose (or C60 with more grid points) for >5 minutes (lin)
send mail to nvidia or list to understand why nvvp profile cuts off after 5 minutes (lin)
understand bottleneck in get_wfs (jun)
implement fft/gemv (cpo)
is there a cuda library for trace like zgeev (cpo)
run a 3x3x3 system to see if bottlenecks stay the same (cpo)
driver hang status (cpo)
understand how to fix gs.py bottlenecks in more detail (lin/cpo) using gpaw profiler:
- pseudo density: density.py: self.calculate_pseudo_density(wfs) (cpo)
- projections: overlap.py: wfs.pt.integrate(psit_nG, P_ani, kpt.q) (cpo)
- RMM-DIIS: eigensolvers/rmm_diis.py: lots of lines (cpo)
- projections: eigensolvers/rmm_diis.py: wfs.pt.integrate(dpsit_xG, P_axi, kpt.q) (lin)
- calc_h_matrix: eigensolvers/eigensolver.py: H_nn = self.operator.calculate_matrix_elements, hamiltonian.xc.correct_hamiltonian_matrix (lin)
- rotate_psi: eigensolvers/eigensolver.py (lin)

Accessing suncatgpu01 SVN

We have put a version of GPAW in a local SVN repository on suncatgpu01. To access it, use the following:

svn co svn://localhost svngpaw

You can put whatever you want for the last argument (local directory name).

General Topics

Stanford CUDA course: http://code.google.com/p/stanford-cs193g-sp2010/
(Everyone) Understand gpaw (read paper)
- what other steps could we parallelize?
- Can we do existing parallelization better? (e.g. use ideas in Todd's GPU papers)
(Everyone) Go through CUDA tutorial here.
- Understand blocks/threads/warps and how they map onto GPU hardware (details of which can be seen with "deviceQuery" command)
(Lin) Find tool to measure:
- memory bandwidth usage
- gpu flops usage
(Jun) :
- Parallelize LCAO/planewave/RPA (zher performance?)? non-rmm-diis eigensolver?
- merge with trunk?
(cpo) :
- Understand code flow
- Understand where the ~23 cuda kernels are used
- Understand which bottlenecks we need to tackle
Do another gpu-gpaw install (to learn)
Understand Pt 3x4x3 CPU/GPU difference versus 3x2x3 (performance scaling with system size)
Can multiple CPU processes win by using the same GPU?
Understand pycuda
Understand gpaw interface to cuda (c/cuda subdirectory)
Read CUDA programming manual here.
Do all gpaw self-tests pass with GPUs?
Can we get bigger bang-per-buck with GeForce instead of Tesla? (don’t need GPUDirect, maybe live with less memory/bandwidth? double precision worse)
Understand cuda better:
- Does Samuli use pinned memory correctly?
- run/understand cuda a bandwidth benchmark
- Could we use GPUdirect for MPI data transfer?
Does the GPU performance scale with the product of gridpoints*bands? Might be a combinatorial effect with the bands, linear with the grid points?
Duplicate Samuli results
Update to most recent version in svn
Understand where gpaw scaling maxes out for Pt 3x4x3
Why is CO on 2 GPUs slower than on 8 CPUs?
Can we do something less precise in the vacuum area? (fewer grid points?)
Do we need a fatter interconnect for GPUs?

What We Learned at GTC

what we learned at gpu conference:

how large blocks fit onto the SM (esp. wrt shared memory)
nersc gpu cluster (dirac) usable?
email addresses of fft/zher guy
occupany spreadsheet
- registers per thread
- threads per block
- shared memory per block
access:
- shared memory 10 clock ticks
- global memory: 400*800 clock ticks
l1*cache/shared memory size cnan be traded off (16/48kB)
number of streams: 16 for 2.0 devices
zher improves 50% in cuda 5 ("enable atomics")
can try GEAM of ZHERK instead of ZHER
blas2 functions memory bound, blas3 compute bound
can maybe get access to zher source if we really need it
3d ffts might be good
ucla gpu cluster available?
openACC ony available for certain compilers (pgi, cray)
new kepler/cuda5 features:
- dynamic parallelism
- hyper*Q for streams
- slower clock speed more cores
- mem 5x faster (maybe because of more reg space)
startup time with multi*gpus is a known problem
nvvp is dying: eclipse
- much better information with nsight
VASP person says we will win with 100x100x100 grid
use single precision for pre-conditioning
read the "best-practices" manual to understand nvvp profiling
universal address space for transfering data
- P2P for multi*gpus within a process
- IPC for multi*gpus between processes

Child pages