...
- Stanford CUDA course: http://code.google.com/p/stanford-cs193g-sp2010/
- (Everyone) Understand gpaw (read paper)
- what other steps could we parallelize?
- Can we do existing parallelization better? (e.g. use ideas in Todd's GPU papers)
- (Everyone) Go through CUDA tutorial here.
- Understand blocks/threads/warps and how they map onto GPU hardware (details of which can be seen with "deviceQuery" command)
- (Lin) Find tool to measure:
- memory bandwidth usage
- gpu flops usage
- (Jun) :
- Parallelize LCAO/planewave/RPA (zher performance?)? non-rmm-diis eigensolver?
- merge with trunk?
- (cpo) :
- Understand code flow
- Understand where the ~23 cuda kernels are used
- Understand which bottlenecks we need to tackle
- Do another gpu-gpaw install (to learn)
- Understand Pt 3x4x3 CPU/GPU difference versus 3x2x3 (performance scaling with system size)
- Can multiple CPU processes win by using the same GPU?
- Understand pycuda
- Understand gpaw interface to cuda (c/cuda subdirectory)
- Read CUDA programming manual here.
- Do all gpaw self-tests pass with GPUs?
- Can we get bigger bang-per-buck with GeForce instead of Tesla? (don’t need GPUDirect, maybe live with less memory/bandwidth? double precision worse)
- Understand cuda better:
- Does Samuli use pinned memory correctly?
- run/understand cuda a bandwidth benchmark
- Could we use GPUdirect for MPI data transfer?
- Does the GPU performance scale with the product of gridpoints*bands? Might be a combinatorial effect with the bands, linear with the grid points?
- Duplicate Samuli results
- Update to most recent version in svn
- Understand where gpaw scaling maxes out for Pt 3x4x3
- Why is CO on 2 GPUs slower than on 8 CPUs?
- Can we do something less precise in the vacuum area? (fewer grid points?)
- Do we need a fatter interconnect for GPUs?
What We Learned at GTC
what we learned at gpu conference:
- how large blocks fit onto the SM (esp. wrt shared memory)
- nersc gpu cluster (dirac) usable?
- email addresses of fft/zher guy
- occupany spreadsheet
- registers per thread
- threads per block
- shared memory per block
- access:
- shared memory 10 clock ticks
- global memory: 400*800 clock ticks
- l1*cache/shared memory size cnan be traded off (16/48kB)
- number of streams: 16 for 2.0 devices
- zher improves 50% in cuda 5 ("enable atomics")
- can try GEAM of ZHERK instead of ZHER
- blas2 functions memory bound, blas3 compute bound
- can maybe get access to zher source if we really need it
- 3d ffts might be good
- ucla gpu cluster available?
- openACC ony available for certain compilers (pgi, cray)
- new kepler/cuda5 features:
- dynamic parallelism
- hyper*Q for streams
- slower clock speed more cores
- mem 5x faster (maybe because of more reg space)
- startup time with multi*gpus is a known problem
- nvvp is dying: eclipse
- much better information with nsight
- universal address space for transfering data
- P2P for multi*gpus within a process
- IPC for multi*gpus between processes