GPUs for GPAW

(Everyone) Understand gpaw (read paper)
- what other steps could we parallelize?
- Can we do existing parallelization better? (e.g. use ideas in Todd's GPU papers)
(Everyone) Go through CUDA tutorial here.
- Understand blocks/threads/warps and how they map onto GPU hardware (details of which can be seen with "deviceQuery" command)
(Lin) Find tool to measure:
- memory bandwidth usage
- gpu flops usage
(Jun) :
- Parallelize LCAO/planewave/RPA?
- merge with trunk?
(cpo) :
- Understand code flow
Understand pycuda
Understand gpaw interface to cuda (c/cuda subdirectory)
Read CUDA programming manual here.
Do all gpaw self-tests pass with GPUs?
Can we get bigger bang-per-buck with GeForce instead of Tesla? (don’t need GPUDirect, maybe live with less memory/bandwidth?)
Understand cuda better
- Does Samuli use pinned memory correctly?
- run/understand cuda a bandwidth benchmark
- Could we use GPUdirect for MPI data transfer?
Duplicate Samuli results
Update to most recent version in svn
Do another gpu-gpaw install (to learn)
Understand where gpaw scaling maxes out for Pt 3x2x3
Understand Pt 3x4x3 CPU/GPU difference
Why is CO on 2 GPUs slower than on 8 CPUs?
Does the GPU performance scale with the product of gridpoints*bands? Might be a combinatorial effect with the bands, linear with the gridpoints?
Can we do something less precise in the vacuum area? (fewer grid points?)
Do we need a fatter interconnect for GPUs?