big steps:
- standalone main calls "work" kernel we call with GPU pointers (completed)
- standalone main calls exc_vxc (RPBE only) interface with GPU pointers
- gpaw CPU version uses exc_vxc on GPU (we give exc_vxc GPU pointers)
- gpaw GPU version uses exc_vxc on GPU
plan for step(2):
- starting to work in libxc source
- start with work_gga_x.c. make work_gga_x a "shell".
- try using nvcc for everything
questions:
- we may run out of memory when putting more stuff on GPU
- can gga.c call a "kernel pointer" or does work_gga_x become a "shell" that calls kernel?
to make an XC(gga_type) pointer "p" on the device:
- need the size of params
- swap out the info/params pointers for device pointers
- p gpu-initialization happens at func_init time
to make "work" functions into a kernel:
- need a _global_ in the work
- need a _device_ in the rpbe
Porting libxc: Lessons Learned
- use nvcc for all mixed host/gpu code
- originally had problems linking with nvcc. maybe messed up by "-x cu"? used gcc instead. but later nvcc linked OK for top level executable.
- nvcc does C++ mangling. need extern "C" in some cases.
Things we need to deal with:
- can't call external _device_ function with nvcc-compiled code? (workaround: use .cuh files with statically declared functions)
- what to do about k functionals? (multiple includes of work_gga_x.c)
- kludged local "static/global" variables (statics inside kernel don't work?)
- make the copying of "p" beautiful
- size of params problem (might be other similar problems here)
- where do we do the copy?
- gpu calling back to host code (and potentially vice-versa)
- p sometimes needs to contain GPU function pointers and perhaps associated data (e.g. gga calling lda)
- stride problem for spin indices
- memset in gga.c etc.
Process for RPBE:
- use nvcc for everything (./configure CC=nvcc CFLAGS="-arch=sm_20")
- rename gga_x_rpbe.c to .cu, also in src/Makefile
- added _device_ to gga_x_rpbe.c, and "extern C" to "info" struct
- included work_gga_x.cu in the gga_x_rpbe.cu with _global_
- removed the memset in gga.c
CUDA Problems
- Unique symbol problem, including the same .cub in multiple files can't set breakpoint. Need cuda5 for linkers?
- Slow startup, 30 sec wait.
- sometimes, we can't step into a function (if it is a file included in a ".cuh"? (multi-layer include))
(cuda-gdb) 200 /tmp/tmpxft_00001dc8_00000000-7_gga_c_pbe.cpp3.i: No such file or directory. in /tmp/tmpxft_00001dc8_00000000-7_gga_c_pbe.cpp3.i (cuda-gdb) func (xs=0xdddddddddddddddd, p=warning: Variable is not live at this point. Value is undetermined. 0x0, order=warning: Variable is not live at this point. Value is undetermined. 0, rs=warning: Variable is not live at this point. Value is undetermined. 0, zeta=-1.4568159901474629e+144, xt=warning: Variable is not live at this point. Value is undetermined. 0, f=warning: Variable is not live at this point. Value is undetermined. 0x0, dfdrs=0xdddddddddddddddd, dfdz=0xdddddddddddddddd, dfdxt=0xdddddddddddddddd, dfdxs=warning: Variable is not live at this point. Value is undetermined. 0x0, d2fdrs2=0xdddddddddddddddd, d2fdrsz=0xdddddddddddddddd, d2fdrsxt=0xdddddddddddddddd, d2fdrsxs=0xdddddddddddddddd, d2fdz2=0xdddddddddddddddd, d2fdzxt=0xdddddddddddddddd, d2fdzxs=0xdddddddddddddddd, d2fdxt2=0xdddddddddddddddd, d2fdxtxs=0xdddddddddddddddd, d2fdxs2=0xdddddddd) at gga_c_pbe.cu:271 271 }
Recipe for Moving libxc Functional to GPU
This requires that the functional use the common "work.c" mechanism. Some functionals (e.g. tpss_c seem to not do this yet).
- if one doesn't exist, create an "init" function
- add copy_gga (or copy_lda/copy_mgga as appropriate) to the init function
- declare "static XC(mgga_type) p_d" and "xc_mgga_type copy_mgga(xc_mggatype* h, size_t sizeofparams)" at the top of the .cu file
- change "static FLOAT" to "static const _device_ FLOAT"
- change routines for equation routines from "static void" to "static inline _device_ void"
- at "extern "C" const" in front of XC(func_info_type) declaration
- change the functional number (first parameter in XC(func_info_type)). We typically add _GPU to the end
- change the #define for the functional number at the top of the file. We typically add 1000
- save the functional as a .cu file
- changed "#include "work_mgga_x.c"" to "#include "work_mgga_x.cuh""
Recipe for Moving General "Work" routine to GPU
This need to be done once for LDA/GGA/MGGA x/c for the common "work.c" file (total of 6 of these)
- copy "static void work_mgga_x" declaration (including arguments) to the bottom. this will become cpu wrapper for the gpu kernel.
- in this new "wrapper" routine
- change "static void" declaration to "extern "C" static void" declaration
- define block_size and n_blocks (based on number of points)
- call the gpuwork kernel
- pass p_d as the first argument
- in the "kernel" routine
- change kernel "static void work_mgga_x" declaration to "static void _global_ gpuwork"
- delete "ip" loop over number of points
- add "int idx = blockIdx.x*blockDim.x + threadIdx.x"
- add "if (idx<np)"
- pointers are normally incremented at the bottom. move all the stuff after "end_ip_loop:" immediately after the above "if" statement
- add "*idx" at the end of each pointer assignment
- save work as new copy with .cuh extension