Page History

Ideas for RPA Crash

verify VBIOS settings same as nvidia (yes)
try without "setenv CUDA_DEVICE_WAITS_ON_EXCEPTION 1" (still fails)
run crash_test.c at keeneland (fails there too)
change bios settings as recommended by colfax (still fails)
swap C2075 with M2090 (problem follows the M2090!)
remove IB card (still fails)
try random matrix data instead of fixed data (fails with random, works with fixed)
run colfax memory test
reduce power consumption with "nvidia-smi --perf-limit=P4" (also tried P9). (Fixes crashes, but still get corrupt data).
try 1 gpu per node, to see if gpus are "fighting", or if cooling/power is a problem (still fails)
look for particular set of gpus that fail
switch to cudamemcpy in crash_test.c
make matrix bigger in crash_test.c
run crash_test.c on 1 gpu per node (still fails)
compare suncat-gpu-test (doesn't fail) and suncat-gpu (fails):
- C2075 vs. M2090
- 7 vs. 8 gpus
- IB card
- cooling
- power cables? (c13 vs. c19)
run nbody gpu test, as suggested by colfax (doesn't fail)
read gpu temps (read out via ipmi: code fails with temp around 69C (lower than C2075 where nvidia-smi reports 88C))
read rack temps (67-72 at inlet, 86-94 at outlet)
run crash_test.py with P9 (no failures)
run 32 gpu N-N with 3-gpus per node in exclusive mode (rack 1 still warm on the outputs: 88,88,92 (top to bottom on the front panel). Still saw nan's in dbgcrash_fast/try19.
run with rack doors open, or change rack cooling behavior
check that PyArgParseTuple types match between python/C
cuda4
check power
small c version of crash_test.py
look at ipmi errors on gpu24/26: nothing
look in /var/log/messages for errors from driver
security scans
run gpu hardware tests (with colfax software?)
gcc instead of icc
small file crash (yes, crashed after 2 days)
keeneland (saw 1 nan failure and 1 kernel launch failure)
simple gemm test crash
does it crash on 1 node? (yes on suncat-gpu 4 cores (gpu20, and gpu26), but not on suncat-gpu-test)
mpi errors?
try magma GEMM (still crashes)
eliminate IB fork warning (still crashes)
race condition between cublasDestroy/cublasCreate? (no, happens after first create)
study with valgrind (dbgcrash_fast/try11,13,14 show some uninitialized data in mpisum)
study with cuda-memcheck (see dbgcrash_fast/try12) looks clean for the 32-node job, even when the data gets messed up (see many warnings about python numerical overflows, indicating failure has occurred)
understand imprecise exceptions
run @nvidia with 8*M2090, Tyan motherboard, and cuda 5.0 (works!)
hardware problem (check for common node, too many jobs in the logfiles)
read the code: cudamemcpy memory overrun?
discontiguous numpy array? (put in asserts)
did get a memory error when running racecheck
no errors from cuda-memcheck heap check
read code to look for race conditions in cukernels.cu (even though problem existed before the addition of those kernels)
ran cuda-memcheck racecheck: only saw errors from cublas
check ecc enabled. looked with "nvidia-smi -q"
Ecc Mode
Current : Enabled
Pending : Enabled

...

Child pages

Versions Compared

Old Version 213

New Version 214

Key

Ideas for RPA Crash