GPUs for GPAW

To-Do List

3/14/2012

private svn (cpo)
try nvvp/transpose (or C60 with more grid points) for >5 minutes (lin)
send mail to nvidia or list to understand why nvvp profile cuts off after 5 minutes (lin)
understand bottleneck in get_wfs (jun)
implement fft/gemv (cpo)
is there a cuda library for trace like zgeev (cpo)
run a 3x3x3 system to see if bottlenecks stay the same (cpo)
driver hang status (cpo)
understand how to fix gs.py bottlenecks in more detail (lin/cpo) using gpaw profiler:
- pseudo density: density.py: self.calculate_pseudo_density(wfs) (cpo)
- projections: overlap.py: wfs.pt.integrate(psit_nG, P_ani, kpt.q) (cpo)
- RMM-DIIS: eigensolvers/rmm_diis.py: lots of lines (cpo)
- projections: eigensolvers/rmm_diis.py: wfs.pt.integrate(dpsit_xG, P_axi, kpt.q) (lin)
- calc_h_matrix: eigensolvers/eigensolver.py: H_nn = self.operator.calculate_matrix_elements, hamiltonian.xc.correct_hamiltonian_matrix (lin)
- rotate_psi: eigensolvers/eigensolver.py (lin)

General Topics

Stanford CUDA course: http://code.google.com/p/stanford-cs193g-sp2010/
(Everyone) Understand gpaw (read paper)
- what other steps could we parallelize?
- Can we do existing parallelization better? (e.g. use ideas in Todd's GPU papers)
(Everyone) Go through CUDA tutorial here.
- Understand blocks/threads/warps and how they map onto GPU hardware (details of which can be seen with "deviceQuery" command)
(Lin) Find tool to measure:
- memory bandwidth usage
- gpu flops usage
(Jun) :
- Parallelize LCAO/planewave/RPA (zher performance?)? non-rmm-diis eigensolver?
- merge with trunk?
(cpo) :
- Understand code flow
- Understand where the ~23 cuda kernels are used
- Understand which bottlenecks we need to tackle
Do another gpu-gpaw install (to learn)
Understand Pt 3x4x3 CPU/GPU difference versus 3x2x3 (performance scaling with system size)
Can multiple CPU processes win by using the same GPU?
Understand pycuda
Understand gpaw interface to cuda (c/cuda subdirectory)
Read CUDA programming manual here.
Do all gpaw self-tests pass with GPUs?
Can we get bigger bang-per-buck with GeForce instead of Tesla? (don’t need GPUDirect, maybe live with less memory/bandwidth? double precision worse)
Understand cuda better:
- Does Samuli use pinned memory correctly?
- run/understand cuda a bandwidth benchmark
- Could we use GPUdirect for MPI data transfer?
Does the GPU performance scale with the product of gridpoints*bands? Might be a combinatorial effect with the bands, linear with the grid points?
Duplicate Samuli results
Update to most recent version in svn
Understand where gpaw scaling maxes out for Pt 3x4x3
Why is CO on 2 GPUs slower than on 8 CPUs?
Can we do something less precise in the vacuum area? (fewer grid points?)
Do we need a fatter interconnect for GPUs?

Child pages

GPUs for GPAW

To-Do List

3/14/2012

General Topics