Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Stanford CUDA course: http://code.google.com/p/stanford-cs193g-sp2010/
  • (Everyone) Understand gpaw (read paper)
    • what other steps could we parallelize?
    • Can we do existing parallelization better? (e.g. use ideas in Todd's GPU papers)
  • (Everyone) Go through CUDA tutorial here.
    • Understand blocks/threads/warps and how they map onto GPU hardware (details of which can be seen with "deviceQuery" command)
  • (Lin) Find tool to measure:
    • memory bandwidth usage
    • gpu flops usage
  • (Jun) :
    • Parallelize LCAO/planewave/RPA (zher performance?)? non-rmm-diis eigensolver?
    • merge with trunk?
  • (cpo) :
    • Understand code flow
    • Understand where the ~23 cuda kernels are used
    • Understand which bottlenecks we need to tackle
  • Do another gpu-gpaw install (to learn)
  • Understand Pt 3x4x3 CPU/GPU difference versus 3x2x3 (performance scaling with system size)
  • Can multiple CPU processes win by using the same GPU?
  • Understand pycuda
  • Understand gpaw interface to cuda (c/cuda subdirectory)
  • Read CUDA programming manual here.
  • Do all gpaw self-tests pass with GPUs?
  • Can we get bigger bang-per-buck with GeForce instead of Tesla? (don’t need GPUDirect, maybe live with less memory/bandwidth? double precision worse)
  • Understand cuda better:
    • Does Samuli use pinned memory correctly?
    • run/understand cuda a bandwidth benchmark
    • Could we use GPUdirect for MPI data transfer?
  • Does the GPU performance scale with the product of gridpoints*bands? Might be a combinatorial effect with the bands, linear with the grid points?
  • Duplicate Samuli results
  • Update to most recent version in svn
  • Understand where gpaw scaling maxes out for Pt 3x4x3
  • Why is CO on 2 GPUs slower than on 8 CPUs?
  • Can we do something less precise in the vacuum area? (fewer grid points?)
  • Do we need a fatter interconnect for GPUs?

What We Learned at GTC

what we learned at gpu conference:

  • how large blocks fit onto the SM (esp. wrt shared memory)
  • nersc gpu cluster (dirac) usable?
  • email addresses of fft/zher guy
  • occupany spreadsheet
    • registers per thread
    • threads per block
    • shared memory per block
  • access:
    • shared memory 10 clock ticks
    • global memory: 400*800 clock ticks
  • l1*cache/shared memory size cnan be traded off (16/48kB)
  • number of streams: 16 for 2.0 devices
  • zher improves 50% in cuda 5 ("enable atomics")
  • can try GEAM of ZHERK instead of ZHER
  • blas2 functions memory bound, blas3 compute bound
  • can maybe get access to zher source if we really need it
  • 3d ffts might be good
  • ucla gpu cluster available?
  • openACC ony available for certain compilers (pgi, cray)
  • new kepler/cuda5 features:
    • dynamic parallelism
    • hyper*Q for streams
    • slower clock speed more cores
    • mem 5x faster (maybe because of more reg space)
  • startup time with multi*gpus is a known problem
  • nvvp is dying: eclipse
    • much better information with nsight
  • VASP person says we will win with 100x100x100 grid
  • use single precision for pre-conditioning
  • read the "best-practices" manual to understand nvvp profiling
  • universal address space for transfering data
    • P2P for multi*gpus within a process
    • IPC for multi*gpus between processes