To-Do List
10/23/2012
- libxc on gpu (lin)
- put libxc in samuli branch at "low-level" (libxc.py?)
- multi-alpha zher at a lower priority(jun)
- reduce registers? prefetch?
- explore the parameter space: tile-size
- paper (jun)
- try calling dacapo density mixing from GPAW (cpo)
- get journal recommendations from Nichols (cpo)
10/4/2012
- libxc on gpu (lin)
- PBEsol-X
- put libxc in samuli branch at "low-level" (libxc.py?)
- solve zero-ing problem and stride problem
- multi-alpha zher (jun)
- reduce registers? prefetch?
- explore the parameter space: tile-size
- paper (jun)
- create infrastructure for running convergence tests (aj)
- try calling dacapo density mixing from GPAW (cpo)
9/25/2012
- libxc on gpu (lin)
- test PBEsol
- cleanup existing code (delete commented lines, unused code)
- put in p_d_gga an p_d_mgga, for consistency
- have 1 beautiful program that runs a lda/gga/mgga functional on both CPU/GPU and times them.
- think about integrating with samuli
- multi-alpha zher (jun)
- reduce registers? prefetch?
- explore the parameter space: tile-size
- paper (jun)
- create infrastructure for running convergence tests (aj)
- help with all the above (cpo)
- work on understanding jacapo density mixing
9/18/2012
- libxc on gpu (lin)
- focus on tpss_x (summarize pattern for moving functional to gpu)
- ask samuli if there are functionals he would like us to move?
- figure out how to get nested param-size (will change "p" struct for this, in general it would be a function to deep-copy params)
- figure out how to get p_d into the functional (will change "p" struct for this)
- kinetic functionals
- understand PBE instruction replays and constants-memory
- think about cleanup of p
- summarize pattern for moving functional to gpu
- better pattern for p_d?
- think about integrating with samuli
- multi-alpha zher (jun)
- reduce registers? prefetch?
- explore the parameter space: tile-size
- paper (jun)
- create infrastructure for running convergence tests (aj)
- help with all the above (cpo)
- work on understanding jacapo density mixing
9/5/2012 and 9/12/2012
- libxc on gpu (lin)
- do mgga (summarize pattern for moving functional to gpu)
- figure out how to get nested param-size (will change "p" struct for this, in general it would be a function to deep-copy params)
- figure out how to get p_d into the functional (will change "p" struct for this)
- kinetic functionals
- understand PBE instruction replays and constants-memory
- think about cleanup of p
- summarize pattern for moving functional to gpu
- better pattern for p_d?
- think about integrating with samuli
- multi-alpha zher (jun)
- run nvvp
- look at occupancy calculator (get registers from nvvp)
- think of new ideas to speed-up
- explore the parameter space: threads-per-block, tile-size
- paper (jun)
- create infrastructure for running convergence tests (aj)
- help with all the above (cpo)
- work on understanding jacapo density mixing
8/28/2012
- libxc on gpu (lin)
- do mgga (summarize pattern for moving functional to gpu)
- figure out how to get nested param-size (will change "p" struct for this, in general it would be a function to deep-copy params)
- figure out how to get p_d into the functional (will change "p" struct for this)
- kinetic functionals
- understand PBE instruction replays and constants-memory
- think about cleanup of p
- summarize pattern for moving functional to gnu
- better pattern for p_d?
- think about integrating with samuli
- multi-alpha zher (jun)
- understand current code
- understand nvidia suggestions
- fix timing of cublas vs. source-code zher and run benchmark
- paper (jun)
- create infrastructure for running convergence tests (aj)
- help with all the above (cpo)
- work on understanding jacapo density mixing
8/21/2012
- libxc on gpu (lin)
- performance plot for RPBE (lin)
- do mgga (summarize pattern for moving functional to gpu)
- understand crash for large number of grid points
- figure out how to get nested param-size (will change "p" struct for this, in general it would be a function to deep-copy params)
- figure out how to get p_d into the functional (will change "p" struct for this)
- read thru func_aux
- kinetic functionals
- time PBE
- look at nvvp to understand bottleneck
- think about cleanup of p
- summarize pattern for moving functional to gnu
- better pattern for p_d?
- think about integrating with samuli
- multi-alpha zher (jun)
- paper (jun)
- create infrastructure for running convergence tests (aj)
- help with all the above (cpo)
- add Na2O4 calculation to AJ infrastructure
- understand default jacapo/gpaw parameters/algorithms/initial-values
8/15/2012
- libxc on gpu (lin)
- performance plot for RPBE (lin)
- work on either the mgga or the copying of "p"
- understand crash for large number of grid points
- read thru fund_aux
- time PBE
- look at nvvp to understand bottleneck
- think about cleanup of p
- summarize pattern for moving functional to gnu
- better pattern for p_d?
- evaluate possible gpu purchase (jun)
- multi-alpha zher (jun)
- paper and speeding up more (FFT?) (jun)
- create infrastructure for running convergence tests (aj)
- help with all the above (cpo)
- add Na2O4 calculation to AJ infrastructure
- understand default jacapo/gpaw parameters/algorithms/initial-values
8/8/2012
- libxc on gpu (lin)
- performance plot for RPBE (lin)
- work on either the mgga or the copying of "p"
- evaluate possible gpu purchase (jun)
- multi-alpha zher (jun)
- paper and speeding up more (FFT?) (jun)
- create infrastructure for running convergence tests (aj)
- help with all the above (cpo)
- add Na2O4 calculation to AJ infrastructure
- understand default jacapo/gpaw parameters/algorithms/initial-values
7/11/2012
- libxc on gpu (lin)
- evaluate possible gpu purchase (jun)
- multi-alpha zher (jun)
- create infrastructure for running convergence tests (aj)
- help with all the above (cpo)
- add Na2O4 calculation to AJ infrastructure
- understand default jacapo/gpaw parameters/algorithms/initial-values
6/27/2012
- libxc on gpu (lin)
- more convergence test cases (aj)
- think about FFT cutoff (aj)
- xsede machines
- generate benchmark strong-scaling plots for exx/rpa for forge (jun)
- create proposal rough draft (jun)
- finish libxc (cpo)
6/20/2012
- libxc on gpu (lin)
- more convergence test cases (aj)
- think about FFT cutoff (aj)
- xsede machines
- install software on forge (cpo)
- generate benchmark strong-scaling plots for exx/rpa for gordon/forge (no swapping!) (jun)
- finish libxc (cpo)
6/13/2012
- libxc on gpu (lin)
- more convergence test cases (aj)
- think about FFT cutoff (aj)
- xsede machines
- install software on forge (cpo)
- understand gordon error (cpo)
- generate benchmark strong-scaling plots for exx/rpa for forge (no swapping!) (jun)
- finish libxc (cpo)
6/13/2012
- try libxc on gpu (lin)
- more convergence test cases (aj)
- think about FFT cutoff (aj)
- see if we get 50% speedup with new zher code (jun)
- xsede machines
- install software (jun/cpo)
- generate benchmark strong-scaling plots for exx/rpa for forge (no swapping!) (jun)
- work on libxc (cpo)
5/30/2012
- understand x/c kernel bottleneck with nvvp (lin)
- trying cufft to see what we gain (lin)
- more convergence test cases (aj)
- think about FFT cutoff (aj)
- GEAM, ZHERK (jun)
- xsede machines (jun/cpo)
- generate benchmark strong-scaling plots for exx/rpa (no swapping!)
- use std err to look for node-to-node "time variations"
- work on libxc (cpo)
5/23/2012
- understand x/c kernel bottleneck with nvvp (lin)
- trying cufft to see what we gain (lin)
- use VO as convergence test case (aj)
- look at special-metric-weight convergence (aj)
- think about FFT cutoff (aj)
- GEAM, ZHERK (jun)
- build on hopper and xsede machines (jun/cpo)
- generate benchmark strong-scaling plots for exx/rpa (no swapping!)
- use std err to look for node-to-node "time variations"
- work on libxc (cpo)
5/9/2012
- rpbe kernel (lin)
- does memcpyasync need cudamallochost?
- fix stream behavior and try with 1,2,4,8,16 streams
- understand stream behaviour with nvvp
- zher streams(jun)
- in benchmark, have separately variable nstream/nw
- can we see whether we have 4 or 16 streams?
- understand stream behaviour with nvvp
- density mixing (aj)
- work on libxc (cpo)
5/2/2012
- looking at EXX bottleneck (rewriting) (jun)
- use cuda streams for small RPA systems (jun)
- libxc integration (cpo)
- understand MKL benchmark (jun/cpo)
- pycuda (cpo)
- understand RPBE kernel: (lin)
- understand "double" problem
- vary np, block_size, nstreams
- loop testfunc many times
- longer term: look at jussi/samuli kernel for ideas
4/25/2012
- looking at EXX bottleneck (rewriting) (jun)
- postpone work on ZHER stuff until we have news from INCITE (jun)
- talk to Frank about computing time applications (cpo)
- understand MKL benchmark (jun/cpo)
- libxc integration (cpo)
4/18/2012
- look at reduced-scope libxc example plus RPBE (lin)
- if there is time, benchmark the RPBE kernel (lin)
- zher performance improvement with multiple streams (jun)
- make INCITE version work (jun/cpo)
- move to libxc 1.2 (cpo)
4/11/2012
- libxc parallelization (lin)
- libxc integration (cpo)
- understand missing time in cublas mode (jun/cpo)
- how to put the gemm in PW mode in a fairly neat way (lin/cpo)
- start working on multiple-alpha kernel (MAZHER) (jun/cpo)
- work on INCITE proposal (jun/cpo)
3/28/2012
- gemm (lin)
- run pt3x3 (cpo)
- libxc (cpo, and lin if he finishes gemm)
- cher/fft (jun)
- fix gpu allocation (cpo)
- circular dependency problem with monkhorst_pack (cpo)
- mpi failure with cuzher (cpo)
3/21/2012
- batch queue for GPU machine (cpo)
- fft/gemm/gemv (lin/jun/cpo)
- single precision cher instead of zher? (jun/cpo)
- new libxc (cpo)
- fix libfftw detection (cpo)
- improve zher in cuda (long project, jun/cpo)
- move "expand" from python into C, post to mailing list? (lin)
- look at spin paired (cpo)
- run pt3x3 (cpo)
3/14/2012
- pycuda compatibility (cpo)
- private svn (cpo)
- try nvvp/transpose (or C60 with more grid points) for >5 minutes (lin)
- send mail to nvidia or list to understand why nvvp profile cuts off after 5 minutes (lin)
- understand bottleneck in get_wfs (jun)
- implement fft/gemv (cpo)
- is there a cuda library for trace like zgeev (cpo)
- run a 3x3x3 system to see if bottlenecks stay the same (cpo)
- driver hang status (cpo)
- understand how to fix gs.py bottlenecks in more detail (lin/cpo) using gpaw profiler:
- pseudo density: density.py: self.calculate_pseudo_density(wfs) (cpo)
- projections: overlap.py: wfs.pt.integrate(psit_nG, P_ani, kpt.q) (cpo)
- RMM-DIIS: eigensolvers/rmm_diis.py: lots of lines (cpo)
- projections: eigensolvers/rmm_diis.py: wfs.pt.integrate(dpsit_xG, P_axi, kpt.q) (lin)
- calc_h_matrix: eigensolvers/eigensolver.py: H_nn = self.operator.calculate_matrix_elements, hamiltonian.xc.correct_hamiltonian_matrix (lin)
- rotate_psi: eigensolvers/eigensolver.py (lin)
Accessing suncatgpu01 SVN
We have put a version of GPAW in a local SVN repository on suncatgpu01. To access it, use the following:
svn co svn://localhost svngpaw
You can put whatever you want for the last argument (local directory name).
General Topics
- Stanford CUDA course: http://code.google.com/p/stanford-cs193g-sp2010/
- (Everyone) Understand gpaw (read paper)
- what other steps could we parallelize?
- Can we do existing parallelization better? (e.g. use ideas in Todd's GPU papers)
- (Everyone) Go through CUDA tutorial here.
- Understand blocks/threads/warps and how they map onto GPU hardware (details of which can be seen with "deviceQuery" command)
- (Lin) Find tool to measure:
- memory bandwidth usage
- gpu flops usage
- (Jun) :
- Parallelize LCAO/planewave/RPA (zher performance?)? non-rmm-diis eigensolver?
- merge with trunk?
- (cpo) :
- Understand code flow
- Understand where the ~23 cuda kernels are used
- Understand which bottlenecks we need to tackle
- Do another gpu-gpaw install (to learn)
- Understand Pt 3x4x3 CPU/GPU difference versus 3x2x3 (performance scaling with system size)
- Can multiple CPU processes win by using the same GPU?
- Understand pycuda
- Understand gpaw interface to cuda (c/cuda subdirectory)
- Read CUDA programming manual here.
- Do all gpaw self-tests pass with GPUs?
- Can we get bigger bang-per-buck with GeForce instead of Tesla? (don’t need GPUDirect, maybe live with less memory/bandwidth? double precision worse)
- Understand cuda better:
- Does Samuli use pinned memory correctly?
- run/understand cuda a bandwidth benchmark
- Could we use GPUdirect for MPI data transfer?
- Does the GPU performance scale with the product of gridpoints*bands? Might be a combinatorial effect with the bands, linear with the grid points?
- Duplicate Samuli results
- Update to most recent version in svn
- Understand where gpaw scaling maxes out for Pt 3x4x3
- Why is CO on 2 GPUs slower than on 8 CPUs?
- Can we do something less precise in the vacuum area? (fewer grid points?)
- Do we need a fatter interconnect for GPUs?