You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 146 Next »

To-Do List

Questions for Nvidia
  • how to use constants memory
  • how to use texture memory
  • what does the 150GB/s mem bandwidth number mean?
    • it is sum of read/write bandwidth
  • optimization tricks: pre-fetch etc.
    • we get 85GB/s out of 150GB/s on 2075. use cudaDMA?
  • what does a queued warp do? (does it pre-fetch the memory)
    • yes
  • reducing number of registers in kernel (does compiler typically do this optimally?)
    • can control register usage using launch bounds
  • how to learn with nvvp if we're memory/flops limited
    • philippe just counts instructions and measures MB/s by running code (no NVVP). He has some special code that counts instructions for him in complicated cases.
  • understanding the nvvp columns
  • best way to associate right GPU with right core (e.g. "taskset", "numactl")
  • ask about zher speedup numbers: for 4kx4k why does gemm improve by x30 but zher improves by x6?
    • gemm with large sizes is compute limited, which GPU does well. zher is memory limited.
  • using automake with cuda and c in one library?
  • swapping out priority: free up memory?
  • proxy gpu allocation only works on K20?
1/8/2013
  • libxc on gpu (lin)
    • work on automake stuff
    • get the cleaned-up ifdef version from Miguel
  • digest RPA timing measurements (lin)
  • AJ tries simple new-setup Ru system with rmm-diis
    • generate temperature residual plot
    • generate residual compared to converged
  • cpo compares FFTMixer to dacapo
  • paper (jun)
  • redo timing measurements (jun/lin)
  • understand new GPU box memory slowness (cpo)
12/18/2012
  • libxc on gpu (lin)
    • use common work file for CPU/GPU
  • digest RPA timing measurements (lin)
  • paper (jun)
  • redo timing measurements (jun)
  • understand timing measurements more fully (jun)
  • dacapo density mixing vs. GPAW (cpo)
12/11/2012
  • understand nvidia zgemm speedup plot (jun/cpo)
    • ANSWER: without thread: 29 faster on GPU. With 6 thread openMP get 5, which agrees with nvidia
  • understand why zher is x6 better on GPU but we see x24 with RPA (will put device sync in code) (jun/cpo)
    • ANSWER: CPU is memory bandwidth limited (so faster with 1 core). account for roughly x2, and the other x2 comes from overlapping CPU/GPU computation.
  • does cuda5 improve ZHER? (jun/cpo) ANSWER: no improvement
  • libxc on gpu (lin)
    • use common work file for CPU/GPU
  • digest RPA timing measurements (lin)
  • think about moving lambda calc to GPU (jun) (ANSWER: no need, 10 or 20% improvement, best case)
  • try multiple surfaces with jacapo/gpaw-pw (aj)
  • paper (jun)
  • try calling dacapo density mixing from GPAW (cpo)
  • make sure all libxc self-tests run
  • why doesn't marcin's na.py converge, even with fixed density?
  • can the alphas for the nt_G really be used for the D's?
12/4/2012
  • understand nvidia zher speedup plot (jun/cpo)
  • libxc on gpu (lin)
    • use CUDA5
    • use common functional file for CPU/GPU
    • use common work file for CPU/GPU
    • read samuli old talk
    • run 3x4x3 pt system
  • RPA timing measurements (lin)
  • multi-alpha zher at a lower priority(jun)
    • reduce registers? prefetch?
    • explore the parameter space: tile-size
  • try multiple surfaces with jacapo/gpaw-pw (aj)
  • paper (jun)
  • try calling dacapo density mixing from GPAW (cpo)
  • install GPAW on Keeneland (cpo)
  • make sure all libxc self-tests run
  • move suncatgpu01 to CUDA5 (cpo)
11/27/2012
  • come up with list of items to ask about at nvidia mtgs
  • libxc on gpu (lin)
    • read samuli old talk
    • run 3x4x3 pt system
    • run PBE0
    • fix linking undefined symbol
    • make sure all self-tests run
    • put paramsize fix in for mgga and lda
    • test libxc 2.0.0
  • RPA timing measurements (lin)
  • multi-alpha zher at a lower priority(jun)
    • reduce registers? prefetch?
    • explore the parameter space: tile-size
  • try multiple surfaces with jacapo/gpaw-pw (aj)
  • paper (jun)
  • try calling dacapo density mixing from GPAW (cpo)
  • install GPAW on Keeneland (cpo)
  • "patch" file for libxc (only the memsets?) (cpo)
  • move suncatgpu01 to CUDA5 (cpo)
  • figure out how to softlink lda_c_pw.cuh (cpo)
11/20/2012
  • libxc on gpu (lin)
    • fix the zeroing (is there a cudamemset?)
    • make sure all self-tests run
  • multi-alpha zher at a lower priority(jun)
    • reduce registers? prefetch?
    • explore the parameter space: tile-size
  • try multiple surfaces with jacapo/gpaw-pw (aj)
  • paper (jun)
  • try calling dacapo density mixing from GPAW (cpo)
  • install GPAW on Keeneland (cpo)
  • merge libxc-gpu and libxc (patch memsets, and zero-ing in work) (cpo)
  • "patch" file for libxc (only the memsets?) (cpo)
  • move suncatgpu01 to CUDA5 (cpo)
11/13/2012
  • libxc on gpu (lin)
    • fix the zeroing (is there a cudamemset?)
    • double check timing for LCAO results
    • make sure all self-tests run
    • commit to svn
  • multi-alpha zher at a lower priority(jun)
    • reduce registers? prefetch?
    • explore the parameter space: tile-size
  • paper (jun)
  • try calling dacapo density mixing from GPAW (cpo)
  • install GPAW on Keeneland (cpo)
  • merge libxc-gpu and libxc (patch memsets, and zero-ing in work) (cpo)
11/6/2012
  • libxc on gpu (lin)
    • decide what to do about the hacks (with print statements)
    • copy less of the scratch data to GPU
    • run the self-tests
    • see if performance is better/worse
    • check that unmodified libxc still works
    • commit to svn
  • multi-alpha zher at a lower priority(jun)
    • reduce registers? prefetch?
    • explore the parameter space: tile-size
  • paper (jun)
  • try calling dacapo density mixing from GPAW (cpo)
  • install GPAW on Keeneland (cpo)
  • merge libxc-gpu and libxc (patch memsets, and zero-ing in work) (cpo)
10/30/2012
  • libxc on gpu (lin)
    • remove print statements
    • merge libxc-gpu and libxc
    • copy less of the scratch data to GPU
    • run the self-tests
    • do the memsets for lda/mgga
    • see if performance is better/worse
  • multi-alpha zher at a lower priority(jun)
    • reduce registers? prefetch?
    • explore the parameter space: tile-size
  • paper (jun)
  • try calling dacapo density mixing from GPAW (cpo)
  • install GPAW on Keeneland (cpo)
10/23/2012
  • libxc on gpu (lin)
    • remove print statements
    • test spin-polarized
    • understand why H numbers are different than gpugpaw_v2
    • merge libxc-gpu and libxc
  • multi-alpha zher at a lower priority(jun)
    • reduce registers? prefetch?
    • explore the parameter space: tile-size
  • paper (jun)
  • try calling dacapo density mixing from GPAW (cpo)
  • get journal recommendations from Nichols (cpo)
10/4/2012
  • libxc on gpu (lin)
    • PBEsol-X
    • put libxc in samuli branch at "low-level" (libxc.py?)
    • solve zero-ing problem and stride problem
  • multi-alpha zher (jun)
    • reduce registers? prefetch?
    • explore the parameter space: tile-size
  • paper (jun)
  • create infrastructure for running convergence tests (aj)
  • try calling dacapo density mixing from GPAW (cpo)
9/25/2012
  • libxc on gpu (lin)
    • test PBEsol
    • cleanup existing code (delete commented lines, unused code)
    • put in p_d_gga an p_d_mgga, for consistency
    • have 1 beautiful program that runs a lda/gga/mgga functional on both CPU/GPU and times them.
    • think about integrating with samuli
  • multi-alpha zher (jun)
    • reduce registers? prefetch?
    • explore the parameter space: tile-size
  • paper (jun)
  • create infrastructure for running convergence tests (aj)
  • help with all the above (cpo)
    • work on understanding jacapo density mixing
9/18/2012
  • libxc on gpu (lin)
    • focus on tpss_x (summarize pattern for moving functional to gpu)
    • ask samuli if there are functionals he would like us to move?
    • figure out how to get nested param-size (will change "p" struct for this, in general it would be a function to deep-copy params)
    • figure out how to get p_d into the functional (will change "p" struct for this)
    • kinetic functionals
    • understand PBE instruction replays and constants-memory
    • think about cleanup of p
    • summarize pattern for moving functional to gpu
    • better pattern for p_d?
    • think about integrating with samuli
  • multi-alpha zher (jun)
    • reduce registers? prefetch?
    • explore the parameter space: tile-size
  • paper (jun)
  • create infrastructure for running convergence tests (aj)
  • help with all the above (cpo)
    • work on understanding jacapo density mixing
9/5/2012 and 9/12/2012
  • libxc on gpu (lin)
    • do mgga (summarize pattern for moving functional to gpu)
    • figure out how to get nested param-size (will change "p" struct for this, in general it would be a function to deep-copy params)
    • figure out how to get p_d into the functional (will change "p" struct for this)
    • kinetic functionals
    • understand PBE instruction replays and constants-memory
    • think about cleanup of p
    • summarize pattern for moving functional to gpu
    • better pattern for p_d?
    • think about integrating with samuli
  • multi-alpha zher (jun)
    • run nvvp
    • look at occupancy calculator (get registers from nvvp)
    • think of new ideas to speed-up
    • explore the parameter space: threads-per-block, tile-size
  • paper (jun)
  • create infrastructure for running convergence tests (aj)
  • help with all the above (cpo)
    • work on understanding jacapo density mixing
8/28/2012
  • libxc on gpu (lin)
    • do mgga (summarize pattern for moving functional to gpu)
    • figure out how to get nested param-size (will change "p" struct for this, in general it would be a function to deep-copy params)
    • figure out how to get p_d into the functional (will change "p" struct for this)
    • kinetic functionals
    • understand PBE instruction replays and constants-memory
    • think about cleanup of p
    • summarize pattern for moving functional to gnu
    • better pattern for p_d?
    • think about integrating with samuli
  • multi-alpha zher (jun)
    • understand current code
    • understand nvidia suggestions
  • fix timing of cublas vs. source-code zher and run benchmark
  • paper (jun)
  • create infrastructure for running convergence tests (aj)
  • help with all the above (cpo)
    • work on understanding jacapo density mixing
8/21/2012
  • libxc on gpu (lin)
    • performance plot for RPBE (lin)
    • do mgga (summarize pattern for moving functional to gpu)
    • understand crash for large number of grid points
    • figure out how to get nested param-size (will change "p" struct for this, in general it would be a function to deep-copy params)
    • figure out how to get p_d into the functional (will change "p" struct for this)
    • read thru func_aux
    • kinetic functionals
    • time PBE
    • look at nvvp to understand bottleneck
    • think about cleanup of p
    • summarize pattern for moving functional to gnu
    • better pattern for p_d?
    • think about integrating with samuli
  • multi-alpha zher (jun)
  • paper (jun)
  • create infrastructure for running convergence tests (aj)
  • help with all the above (cpo)
    • add Na2O4 calculation to AJ infrastructure
    • understand default jacapo/gpaw parameters/algorithms/initial-values
8/15/2012
  • libxc on gpu (lin)
    • performance plot for RPBE (lin)
    • work on either the mgga or the copying of "p"
    • understand crash for large number of grid points
    • read thru fund_aux
    • time PBE
    • look at nvvp to understand bottleneck
    • think about cleanup of p
    • summarize pattern for moving functional to gnu
    • better pattern for p_d?
  • evaluate possible gpu purchase (jun)
  • multi-alpha zher (jun)
  • paper and speeding up more (FFT?) (jun)
  • create infrastructure for running convergence tests (aj)
  • help with all the above (cpo)
    • add Na2O4 calculation to AJ infrastructure
    • understand default jacapo/gpaw parameters/algorithms/initial-values
8/8/2012
  • libxc on gpu (lin)
    • performance plot for RPBE (lin)
    • work on either the mgga or the copying of "p"
  • evaluate possible gpu purchase (jun)
  • multi-alpha zher (jun)
  • paper and speeding up more (FFT?) (jun)
  • create infrastructure for running convergence tests (aj)
  • help with all the above (cpo)
    • add Na2O4 calculation to AJ infrastructure
    • understand default jacapo/gpaw parameters/algorithms/initial-values
7/11/2012
  • libxc on gpu (lin)
  • evaluate possible gpu purchase (jun)
  • multi-alpha zher (jun)
  • create infrastructure for running convergence tests (aj)
  • help with all the above (cpo)
    • add Na2O4 calculation to AJ infrastructure
    • understand default jacapo/gpaw parameters/algorithms/initial-values
6/27/2012
  • libxc on gpu (lin)
  • more convergence test cases (aj)
  • think about FFT cutoff (aj)
  • xsede machines
    • generate benchmark strong-scaling plots for exx/rpa for forge (jun)
    • create proposal rough draft (jun)
  • finish libxc (cpo)
6/20/2012
  • libxc on gpu (lin)
  • more convergence test cases (aj)
  • think about FFT cutoff (aj)
  • xsede machines
    • install software on forge (cpo)
    • generate benchmark strong-scaling plots for exx/rpa for gordon/forge (no swapping!) (jun)
  • finish libxc (cpo)
6/13/2012
  • libxc on gpu (lin)
  • more convergence test cases (aj)
  • think about FFT cutoff (aj)
  • xsede machines
    • install software on forge (cpo)
    • understand gordon error (cpo)
    • generate benchmark strong-scaling plots for exx/rpa for forge (no swapping!) (jun)
  • finish libxc (cpo)
6/13/2012
  • try libxc on gpu (lin)
  • more convergence test cases (aj)
  • think about FFT cutoff (aj)
  • see if we get 50% speedup with new zher code (jun)
  • xsede machines
    • install software (jun/cpo)
    • generate benchmark strong-scaling plots for exx/rpa for forge (no swapping!) (jun)
  • work on libxc (cpo)
5/30/2012
  • understand x/c kernel bottleneck with nvvp (lin)
  • trying cufft to see what we gain (lin)
  • more convergence test cases (aj)
  • think about FFT cutoff (aj)
  • GEAM, ZHERK (jun)
  • xsede machines (jun/cpo)
    • generate benchmark strong-scaling plots for exx/rpa (no swapping!)
    • use std err to look for node-to-node "time variations"
  • work on libxc (cpo)
5/23/2012
  • understand x/c kernel bottleneck with nvvp (lin)
  • trying cufft to see what we gain (lin)
  • use VO as convergence test case (aj)
  • look at special-metric-weight convergence (aj)
  • think about FFT cutoff (aj)
  • GEAM, ZHERK (jun)
  • build on hopper and xsede machines (jun/cpo)
    • generate benchmark strong-scaling plots for exx/rpa (no swapping!)
    • use std err to look for node-to-node "time variations"
  • work on libxc (cpo)
5/9/2012
  • rpbe kernel (lin)
    • does memcpyasync need cudamallochost?
    • fix stream behavior and try with 1,2,4,8,16 streams
    • understand stream behaviour with nvvp
  • zher streams(jun)
    • in benchmark, have separately variable nstream/nw
    • can we see whether we have 4 or 16 streams?
    • understand stream behaviour with nvvp
  • density mixing (aj)
  • work on libxc (cpo)
5/2/2012
  • looking at EXX bottleneck (rewriting) (jun)
  • use cuda streams for small RPA systems (jun)
  • libxc integration (cpo)
  • understand MKL benchmark (jun/cpo)
  • pycuda (cpo)
  • understand RPBE kernel: (lin)
    • understand "double" problem
    • vary np, block_size, nstreams
    • loop testfunc many times
    • longer term: look at jussi/samuli kernel for ideas
4/25/2012
  • looking at EXX bottleneck (rewriting) (jun)
  • postpone work on ZHER stuff until we have news from INCITE (jun)
  • talk to Frank about computing time applications (cpo)
  • understand MKL benchmark (jun/cpo)
  • libxc integration (cpo)
4/18/2012
  • look at reduced-scope libxc example plus RPBE (lin)
  • if there is time, benchmark the RPBE kernel (lin)
  • zher performance improvement with multiple streams (jun)
  • make INCITE version work (jun/cpo)
  • move to libxc 1.2 (cpo)
4/11/2012
  • libxc parallelization (lin)
  • libxc integration (cpo)
  • understand missing time in cublas mode (jun/cpo)
  • how to put the gemm in PW mode in a fairly neat way (lin/cpo)
  • start working on multiple-alpha kernel (MAZHER) (jun/cpo)
  • work on INCITE proposal (jun/cpo)
3/28/2012
  • gemm (lin)
  • run pt3x3 (cpo)
  • libxc (cpo, and lin if he finishes gemm)
  • cher/fft (jun)
  • fix gpu allocation (cpo)
  • circular dependency problem with monkhorst_pack (cpo)
  • mpi failure with cuzher (cpo)
3/21/2012
  • batch queue for GPU machine (cpo)
  • fft/gemm/gemv (lin/jun/cpo)
  • single precision cher instead of zher? (jun/cpo)
  • new libxc (cpo)
  • fix libfftw detection (cpo)
  • improve zher in cuda (long project, jun/cpo)
  • move "expand" from python into C, post to mailing list? (lin)
  • look at spin paired (cpo)
  • run pt3x3 (cpo)
3/14/2012
  • pycuda compatibility (cpo)
  • private svn (cpo)
  • try nvvp/transpose (or C60 with more grid points) for >5 minutes (lin)
  • send mail to nvidia or list to understand why nvvp profile cuts off after 5 minutes (lin)
  • understand bottleneck in get_wfs (jun)
  • implement fft/gemv (cpo)
  • is there a cuda library for trace like zgeev (cpo)
  • run a 3x3x3 system to see if bottlenecks stay the same (cpo)
  • driver hang status (cpo)
  • understand how to fix gs.py bottlenecks in more detail (lin/cpo) using gpaw profiler:
    • pseudo density: density.py: self.calculate_pseudo_density(wfs) (cpo)
    • projections: overlap.py: wfs.pt.integrate(psit_nG, P_ani, kpt.q) (cpo)
    • RMM-DIIS: eigensolvers/rmm_diis.py: lots of lines (cpo)
    • projections: eigensolvers/rmm_diis.py: wfs.pt.integrate(dpsit_xG, P_axi, kpt.q) (lin)
    • calc_h_matrix: eigensolvers/eigensolver.py: H_nn = self.operator.calculate_matrix_elements, hamiltonian.xc.correct_hamiltonian_matrix (lin)
    • rotate_psi: eigensolvers/eigensolver.py (lin)

Accessing suncatgpu01 SVN

We have put a version of GPAW in a local SVN repository on suncatgpu01. To access it, use the following:

svn co svn://localhost svngpaw

You can put whatever you want for the last argument (local directory name).

General Topics

  • Stanford CUDA course: http://code.google.com/p/stanford-cs193g-sp2010/
  • (Everyone) Understand gpaw (read paper)
    • what other steps could we parallelize?
    • Can we do existing parallelization better? (e.g. use ideas in Todd's GPU papers)
  • (Everyone) Go through CUDA tutorial here.
    • Understand blocks/threads/warps and how they map onto GPU hardware (details of which can be seen with "deviceQuery" command)
  • (Lin) Find tool to measure:
    • memory bandwidth usage
    • gpu flops usage
  • (Jun) :
    • Parallelize LCAO/planewave/RPA (zher performance?)? non-rmm-diis eigensolver?
    • merge with trunk?
  • (cpo) :
    • Understand code flow
    • Understand where the ~23 cuda kernels are used
    • Understand which bottlenecks we need to tackle
  • Do another gpu-gpaw install (to learn)
  • Understand Pt 3x4x3 CPU/GPU difference versus 3x2x3 (performance scaling with system size)
  • Can multiple CPU processes win by using the same GPU?
  • Understand pycuda
  • Understand gpaw interface to cuda (c/cuda subdirectory)
  • Read CUDA programming manual here.
  • Do all gpaw self-tests pass with GPUs?
  • Can we get bigger bang-per-buck with GeForce instead of Tesla? (don’t need GPUDirect, maybe live with less memory/bandwidth? double precision worse)
  • Understand cuda better:
    • Does Samuli use pinned memory correctly?
    • run/understand cuda a bandwidth benchmark
    • Could we use GPUdirect for MPI data transfer?
  • Does the GPU performance scale with the product of gridpoints*bands? Might be a combinatorial effect with the bands, linear with the grid points?
  • Duplicate Samuli results
  • Update to most recent version in svn
  • Understand where gpaw scaling maxes out for Pt 3x4x3
  • Why is CO on 2 GPUs slower than on 8 CPUs?
  • Can we do something less precise in the vacuum area? (fewer grid points?)
  • Do we need a fatter interconnect for GPUs?
  • No labels