Ideas for RPA Crash
- verify VBIOS settings same as nvidia (yes)
- try without "setenv CUDA_DEVICE_WAITS_ON_EXCEPTION 1" (still fails)
- run crash_test.c at keeneland (fails there too)
- change bios settings as recommended by colfax (still fails)
- swap C2075 with M2090 (problem follows the M2090!)
- remove IB card (still fails)
- try random matrix data instead of fixed data (fails with random, works with fixed)
- run colfax memory test
- reduce power consumption with "nvidia-smi --perf-limit=P4" (also tried P9). (Fixes crashes, but still get corrupt data).
- try 1 gpu per node, to see if gpus are "fighting", or if cooling/power is a problem (still fails)
- look for particular set of gpus that fail
- switch to cudamemcpy in crash_test.c
- make matrix bigger in crash_test.c
- run crash_test.c on 1 gpu per node (still fails)
- compare suncat-gpu-test (doesn't fail) and suncat-gpu (fails):
- C2075 vs. M2090
- 7 vs. 8 gpus
- IB card
- cooling
- power cables? (c13 vs. c19)
- run nbody gpu test, as suggested by colfax (doesn't fail)
- read gpu temps (read out via ipmi: code fails with temp around 69C (lower than C2075 where nvidia-smi reports 88C))
- read rack temps (67-72 at inlet, 86-94 at outlet)
- run crash_test.py with P9 (no failures)
- run 32 gpu N-N with 3-gpus per node in exclusive mode (rack 1 still warm on the outputs: 88,88,92 (top to bottom on the front panel). Still saw nan's in dbgcrash_fast/try19.
- run with rack doors open, or change rack cooling behavior
- check that PyArgParseTuple types match between python/C
- cuda4
- check power
- small c version of crash_test.py
- look at ipmi errors on gpu24/26: nothing
- look in /var/log/messages for errors from driver
- security scans
- run gpu hardware tests (with colfax software?)
- gcc instead of icc
- small file crash (yes, crashed after 2 days)
- keeneland (saw 1 nan failure and 1 kernel launch failure)
- simple gemm test crash
- does it crash on 1 node? (yes on suncat-gpu 4 cores (gpu20, and gpu26), but not on suncat-gpu-test)
- mpi errors?
- try magma GEMM (still crashes)
- eliminate IB fork warning (still crashes)
- race condition between cublasDestroy/cublasCreate? (no, happens after first create)
- study with valgrind (dbgcrash_fast/try11,13,14 show some uninitialized data in mpisum)
- study with cuda-memcheck (see dbgcrash_fast/try12) looks clean for the 32-node job, even when the data gets messed up (see many warnings about python numerical overflows, indicating failure has occurred)
- understand imprecise exceptions
- run @nvidia with 8*M2090, Tyan motherboard, and cuda 5.0 (works!)
- hardware problem (check for common node, too many jobs in the logfiles)
- read the code: cudamemcpy memory overrun?
- discontiguous numpy array? (put in asserts)
- did get a memory error when running racecheck
- no errors from cuda-memcheck heap check
- read code to look for race conditions in cukernels.cu (even though problem existed before the addition of those kernels)
- ran cuda-memcheck racecheck: only saw errors from cublas
- check ecc enabled. looked with "nvidia-smi -q"
Ecc Mode
Current : Enabled
Pending : Enabled
...