You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 18 Next »

big steps:

  1. standalone main calls "work" kernel we call with GPU pointers (completed)
  2. standalone main calls exc_vxc (RPBE only) interface with GPU pointers
  3. gpaw CPU version uses exc_vxc on GPU (we give exc_vxc GPU pointers)
  4. gpaw GPU version uses exc_vxc on GPU

plan for step(2):

  • starting to work in libxc source
  • start with work_gga_x.c. make work_gga_x a "shell".
  • try using nvcc for everything

questions:

  • we may run out of memory when putting more stuff on GPU
  • can gga.c call a "kernel pointer" or does work_gga_x become a "shell" that calls kernel?

to make an XC(gga_type) pointer "p" on the device:

  • need the size of params
  • swap out the info/params pointers for device pointers
  • p gpu-initialization happens at func_init time

to make "work" functions into a kernel:

  • need a _global_ in the work
  • need a _device_ in the rpbe

Porting libxc: Lessons Learned

  • use nvcc for all mixed host/gpu code
  • originally had problems linking with nvcc. maybe messed up by "-x cu"? used gcc instead. but later nvcc linked OK for top level executable.
  • nvcc does C++ mangling. need extern "C" in some cases.

Things we need to deal with:

  • can't call external _device_ function with nvcc-compiled code? (workaround: use .cuh files with statically declared functions)
  • what to do about k functionals? (multiple includes of work_gga_x.c)
  • kludged local "static/global" variables (statics inside kernel don't work?)
  • make the copying of "p" beautiful
    • size of params problem (might be other similar problems here)
    • where do we do the copy?
  • gpu calling back to host code (and potentially vice-versa)
  • p sometimes needs to contain GPU function pointers and perhaps associated data (e.g. gga calling lda)
  • stride problem for spin indices
  • memset in gga.c etc.

Process for RPBE:

  • use nvcc for everything (./configure CC=nvcc CFLAGS="-arch=sm_20")
  • rename gga_x_rpbe.c to .cu, also in src/Makefile
  • added _device_ to gga_x_rpbe.c, and "extern C" to "info" struct
  • included work_gga_x.cu in the gga_x_rpbe.cu with _global_
  • removed the memset in gga.c

CUDA Problems

  • Unique symbol problem, including the same .cub in multiple files can't set breakpoint. Need cuda5 for linkers?
  • Slow startup, 30 sec wait.
  • sometimes, we can't step into a function (if it is a file included in a ".cuh"? (multi-layer include))
(cuda-gdb) 
200	/tmp/tmpxft_00001dc8_00000000-7_gga_c_pbe.cpp3.i: No such file or directory.
	in /tmp/tmpxft_00001dc8_00000000-7_gga_c_pbe.cpp3.i



(cuda-gdb) 
func (xs=0xdddddddddddddddd, p=warning: Variable is not live at this point. Value is undetermined.
0x0, order=warning: Variable is not live at this point. Value is undetermined.
0, rs=warning: Variable is not live at this point. Value is undetermined.
0, zeta=-1.4568159901474629e+144, xt=warning: Variable is not live at this point. Value is undetermined.
0, f=warning: Variable is not live at this point. Value is undetermined.
0x0, dfdrs=0xdddddddddddddddd, dfdz=0xdddddddddddddddd, dfdxt=0xdddddddddddddddd, dfdxs=warning: Variable is not live at this point. Value is undetermined.
0x0, d2fdrs2=0xdddddddddddddddd, d2fdrsz=0xdddddddddddddddd, d2fdrsxt=0xdddddddddddddddd, d2fdrsxs=0xdddddddddddddddd, d2fdz2=0xdddddddddddddddd, d2fdzxt=0xdddddddddddddddd, d2fdzxs=0xdddddddddddddddd, d2fdxt2=0xdddddddddddddddd, d2fdxtxs=0xdddddddddddddddd, d2fdxs2=0xdddddddd) at gga_c_pbe.cu:271
271	}

Recipe for Moving libxc Functional to GPU

This requires that the functional use the common "work.c" mechanism. Some functionals (e.g. tpss_c seem to not do this yet).

  • if one doesn't exist, create an "init" function
  • add copy_gga (or copy_lda/copy_mgga as appropriate) to the init function
  • declare "static XC(mgga_type) p_d" and "xc_mgga_type copy_mgga(xc_mggatype* h, size_t sizeofparams)" at the top of the .cu file
  • change "static FLOAT" to "static const _device_ FLOAT"
  • change routines for equation routines from "static void" to "static inline _device_ void"
  • at "extern "C" const" in front of XC(func_info_type) declaration
  • change the functional number (first parameter in XC(func_info_type)). We typically add _GPU to the end
  • change the #define for the functional number at the top of the file. We typically add 1000
  • save the functional as a .cu file
  • changed "#include "work_mgga_x.c"" to "#include "work_mgga_x.cuh""
  • some functionals will call external functions on the GPU like XC(rho2dzeta). May need to cobble together
    declaration header files for these.

Recipe for Moving General "Work" routine to GPU

This need to be done once for LDA/GGA/MGGA x/c for the common "work.c" file (total of 6 of these)

  • copy "static void work_mgga_x" declaration (including arguments) to the bottom. this will become cpu wrapper for the gpu kernel.
  • in this new "wrapper" routine
    • change "static void" declaration to "extern "C" static void" declaration
    • define block_size and n_blocks (based on number of points)
    • call the gpuwork kernel
    • pass p_d as the first argument
  • in the "kernel" routine
    • change kernel "static void work_mgga_x" declaration to "static void _global_ gpuwork"
    • delete "ip" loop over number of points
    • add "int idx = blockIdx.x*blockDim.x + threadIdx.x"
    • add "if (idx<np)"
    • pointers are normally incremented at the bottom. move all the stuff after "end_ip_loop:" immediately after the above "if" statement
    • add "*idx" at the end of each pointer assignment
  • save work as new copy with .cuh extension
  • No labels