You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 29 Next »

To-Do List

3/14/2012
  • private svn (cpo)
  • try nvvp/transpose (or C60 with more grid points) for >5 minutes (lin)
  • send mail to nvidia or list to understand why nvvp profile cuts off after 5 minutes (lin)
  • understand bottleneck in get_wfs (jun)
  • implement fft/gemv (cpo)
  • is there a cuda library for trace like zgeev (cpo)
  • run a 3x3x3 system to see if bottlenecks stay the same (cpo)
  • driver hang status (cpo)
  • understand how to fix gs.py bottlenecks in more detail (lin/cpo) using gpaw profiler:
    • pseudo density: density.py: self.calculate_pseudo_density(wfs) (cpo)
    • projections: overlap.py: wfs.pt.integrate(psit_nG, P_ani, kpt.q) (cpo)
    • RMM-DIIS: eigensolvers/rmm_diis.py: lots of lines (cpo)
    • projections: eigensolvers/rmm_diis.py: wfs.pt.integrate(dpsit_xG, P_axi, kpt.q) (lin)
    • calc_h_matrix: eigensolvers/eigensolver.py: H_nn = self.operator.calculate_matrix_elements, hamiltonian.xc.correct_hamiltonian_matrix (lin)
    • rotate_psi: eigensolvers/eigensolver.py (lin)

General Topics

  • Stanford CUDA course: http://code.google.com/p/stanford-cs193g-sp2010/
  • (Everyone) Understand gpaw (read paper)
    • what other steps could we parallelize?
    • Can we do existing parallelization better? (e.g. use ideas in Todd's GPU papers)
  • (Everyone) Go through CUDA tutorial here.
    • Understand blocks/threads/warps and how they map onto GPU hardware (details of which can be seen with "deviceQuery" command)
  • (Lin) Find tool to measure:
    • memory bandwidth usage
    • gpu flops usage
  • (Jun) :
    • Parallelize LCAO/planewave/RPA (zher performance?)? non-rmm-diis eigensolver?
    • merge with trunk?
  • (cpo) :
    • Understand code flow
    • Understand where the ~23 cuda kernels are used
    • Understand which bottlenecks we need to tackle
  • Do another gpu-gpaw install (to learn)
  • Understand Pt 3x4x3 CPU/GPU difference versus 3x2x3 (performance scaling with system size)
  • Can multiple CPU processes win by using the same GPU?
  • Understand pycuda
  • Understand gpaw interface to cuda (c/cuda subdirectory)
  • Read CUDA programming manual here.
  • Do all gpaw self-tests pass with GPUs?
  • Can we get bigger bang-per-buck with GeForce instead of Tesla? (don’t need GPUDirect, maybe live with less memory/bandwidth? double precision worse)
  • Understand cuda better:
    • Does Samuli use pinned memory correctly?
    • run/understand cuda a bandwidth benchmark
    • Could we use GPUdirect for MPI data transfer?
  • Does the GPU performance scale with the product of gridpoints*bands? Might be a combinatorial effect with the bands, linear with the grid points?
  • Duplicate Samuli results
  • Update to most recent version in svn
  • Understand where gpaw scaling maxes out for Pt 3x4x3
  • Why is CO on 2 GPUs slower than on 8 CPUs?
  • Can we do something less precise in the vacuum area? (fewer grid points?)
  • Do we need a fatter interconnect for GPUs?
  • No labels