You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 45 Next »

To-Do List

5/2/2012
  • looking at EXX bottleneck (rewriting) (jun)
  • use cuda streams for small systems (jun)
4/25/2012
  • looking at EXX bottleneck (rewriting) (jun)
  • postpone work on ZHER stuff until we have news from INCITE (jun)
  • talk to Frank about computing time applications (cpo)
  • understand MKL benchmark (jun/cpo)
  • libxc integration (cpo)
4/18/2012
  • look at reduced-scope libxc example plus RPBE (lin)
  • if there is time, benchmark the RPBE kernel (lin)
  • zher performance improvement with multiple streams (jun)
  • make INCITE version work (jun/cpo)
  • move to libxc 1.2 (cpo)
4/11/2012
  • libxc parallelization (lin)
  • libxc integration (cpo)
  • understand missing time in cublas mode (jun/cpo)
  • how to put the gemm in PW mode in a fairly neat way (lin/cpo)
  • start working on multiple-alpha kernel (MAZHER) (jun/cpo)
  • work on INCITE proposal (jun/cpo)
3/28/2012
  • gemm (lin)
  • run pt3x3 (cpo)
  • libxc (cpo, and lin if he finishes gemm)
  • cher/fft (jun)
  • fix gpu allocation (cpo)
  • circular dependency problem with monkhorst_pack (cpo)
  • mpi failure with cuzher (cpo)
3/21/2012
  • batch queue for GPU machine (cpo)
  • fft/gemm/gemv (lin/jun/cpo)
  • single precision cher instead of zher? (jun/cpo)
  • new libxc (cpo)
  • fix libfftw detection (cpo)
  • improve zher in cuda (long project, jun/cpo)
  • move "expand" from python into C, post to mailing list? (lin)
  • look at spin paired (cpo)
  • run pt3x3 (cpo)
3/14/2012
  • pycuda compatibility (cpo)
  • private svn (cpo)
  • try nvvp/transpose (or C60 with more grid points) for >5 minutes (lin)
  • send mail to nvidia or list to understand why nvvp profile cuts off after 5 minutes (lin)
  • understand bottleneck in get_wfs (jun)
  • implement fft/gemv (cpo)
  • is there a cuda library for trace like zgeev (cpo)
  • run a 3x3x3 system to see if bottlenecks stay the same (cpo)
  • driver hang status (cpo)
  • understand how to fix gs.py bottlenecks in more detail (lin/cpo) using gpaw profiler:
    • pseudo density: density.py: self.calculate_pseudo_density(wfs) (cpo)
    • projections: overlap.py: wfs.pt.integrate(psit_nG, P_ani, kpt.q) (cpo)
    • RMM-DIIS: eigensolvers/rmm_diis.py: lots of lines (cpo)
    • projections: eigensolvers/rmm_diis.py: wfs.pt.integrate(dpsit_xG, P_axi, kpt.q) (lin)
    • calc_h_matrix: eigensolvers/eigensolver.py: H_nn = self.operator.calculate_matrix_elements, hamiltonian.xc.correct_hamiltonian_matrix (lin)
    • rotate_psi: eigensolvers/eigensolver.py (lin)

Accessing suncatgpu01 SVN

We have put a version of GPAW in a local SVN repository on suncatgpu01. To access it, use the following:

svn co svn://localhost svngpaw

You can put whatever you want for the last argument (local directory name).

General Topics

  • Stanford CUDA course: http://code.google.com/p/stanford-cs193g-sp2010/
  • (Everyone) Understand gpaw (read paper)
    • what other steps could we parallelize?
    • Can we do existing parallelization better? (e.g. use ideas in Todd's GPU papers)
  • (Everyone) Go through CUDA tutorial here.
    • Understand blocks/threads/warps and how they map onto GPU hardware (details of which can be seen with "deviceQuery" command)
  • (Lin) Find tool to measure:
    • memory bandwidth usage
    • gpu flops usage
  • (Jun) :
    • Parallelize LCAO/planewave/RPA (zher performance?)? non-rmm-diis eigensolver?
    • merge with trunk?
  • (cpo) :
    • Understand code flow
    • Understand where the ~23 cuda kernels are used
    • Understand which bottlenecks we need to tackle
  • Do another gpu-gpaw install (to learn)
  • Understand Pt 3x4x3 CPU/GPU difference versus 3x2x3 (performance scaling with system size)
  • Can multiple CPU processes win by using the same GPU?
  • Understand pycuda
  • Understand gpaw interface to cuda (c/cuda subdirectory)
  • Read CUDA programming manual here.
  • Do all gpaw self-tests pass with GPUs?
  • Can we get bigger bang-per-buck with GeForce instead of Tesla? (don’t need GPUDirect, maybe live with less memory/bandwidth? double precision worse)
  • Understand cuda better:
    • Does Samuli use pinned memory correctly?
    • run/understand cuda a bandwidth benchmark
    • Could we use GPUdirect for MPI data transfer?
  • Does the GPU performance scale with the product of gridpoints*bands? Might be a combinatorial effect with the bands, linear with the grid points?
  • Duplicate Samuli results
  • Update to most recent version in svn
  • Understand where gpaw scaling maxes out for Pt 3x4x3
  • Why is CO on 2 GPUs slower than on 8 CPUs?
  • Can we do something less precise in the vacuum area? (fewer grid points?)
  • Do we need a fatter interconnect for GPUs?
  • No labels