You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Next »

What We Learned at GTC 2013

  • met main developer of espresso-GPU
  • FD stencils: tricky, memory bandwidth limited, CPUs as good as GPUs if well written
  • new cudaPython from continuum analytics
  • more about GPUdirect, RDMA (waiting for driver software)
  • can use UVA with our current hardware/software
  • GPUdirect and RDMA broken by QPI
  • MVAPICH support for above (more cutting edge than openMPI)
  • LSF support for GPU clusters
  • improved "metrics" for nvprof and racecheck analysis in cuda-memcheck in cuda 5.5
  • jun won't have to take the sqrt in quite so ugly a way with new cublas
  • no GPU scalapack (could implement by having scalapack call GPU lapack routines)
  • upcoming multi-gpu cublas
  • molecular dynamics easier than DFT on GPU
  • GEMM is inefficient for "narrow" vectors (optimization in progress)
  • "Datatype" idea in MPI for moving strided data between nodes
Quantum Espresso GPU Information (from filippo spiga at GTC 2013)
  • 2-3x for small systems
  • mpi+openmp+gpu
  • tried on 2 GPUs per node
  • for 8cpu+8gpu maybe run 8 openmpi (not done by anybody, may run into memory bottleneck)
  • multiple jobs using same GPU OK
    • espresso enforces memory management
    • prints warning if it runs out of memory
  • relies on faster kernels, slower kernels maybe use 1 mpi
  • use magma for diagonalization
  • matrices on order of 10000x10000
  • if system is larger need to switch to scalapack instead of magma
  • might have to disable some gpus to maximize bandwidth (pick up ones with
    most bandwidth)
  • no p2p
  • gpu code is a "plug-in"
  • have support for "screenings"

future development possibility:

  • spin magnetization
  • potential long-term gain: split real gamma (breaks structure of code)
  • porting "phonon" portion of the code
  • for PW: people look at EXX but "closed source"?

people:
filippo spiga
girotto

What We Learned at GTC 2012

  • how large blocks fit onto the SM (esp. wrt shared memory)
  • nersc gpu cluster (dirac) usable?
  • email addresses of fft/zher guy
  • occupany spreadsheet
    • registers per thread
    • threads per block
    • shared memory per block
  • access:
    • shared memory 10 clock ticks
    • global memory: 400*800 clock ticks
  • l1*cache/shared memory size cnan be traded off (16/48kB)
  • number of streams: 16 for 2.0 devices
  • zher improves 50% in cuda 5 ("enable atomics")
  • can try GEAM of ZHERK instead of ZHER
  • blas2 functions memory bound, blas3 compute bound
  • can maybe get access to zher source if we really need it
  • 3d ffts might be good
  • ucla gpu cluster available?
  • openACC ony available for certain compilers (pgi, cray)
  • new kepler/cuda5 features:
    • dynamic parallelism
    • hyper*Q for streams
    • slower clock speed more cores
    • mem 5x faster (maybe because of more reg space)
  • startup time with multi*gpus is a known problem
  • nvvp is dying: eclipse
    • much better information with nsight
  • VASP person says we will win with 100x100x100 grid
  • use single precision for pre-conditioning
  • read the "best-practices" manual to understand nvvp profiling
  • universal address space for transfering data
    • P2P for multi*gpus within a process
    • IPC for multi*gpus between processes
  • No labels