Page History

...

switch to cudamemcpy in crash_test.c
make matrix bigger in crash_test.c
run crash_test.c on 2 gpus per node (still fails)
compare suncat-gpu-test (doesn't fail) and suncat-gpu (fails):
- C2075 vs. M2090
- 7 vs. 8 gpus
- IB card
- cooling
- power cables? (c13 vs. c19)
run nbody gpu test (seems to need display, so doesn't work)
read gpu temps (nvidia-smi doesn't show them, neither does ipmi)
read rack temps (67-72 at inlet, 86-94 at outlet)
run crash_test.py with P9 (no failures)
run 32 gpu N-N with 3-gpus per node in exclusive mode (rack 1 still warm on the outputs: 88,88,92 (top to bottom on the front panel). Still saw nan's in dbgcrash_fast/try19.
run with rack doors open, or change rack cooling behavior
check that PyArgParseTuple types match between python/C
cuda4
check power
small c version of crash_test.py
look at ipmi errors on gpu24/26: nothing
look in /var/log/messages for errors from driver
security scans
run gpu hardware tests (with colfax software?)
gcc instead of icc
small file crash (yes, crashed after 2 days)
keeneland (saw 1 nan failure and 1 kernel launch failure)
simple gemm test crash
does it crash on 1 node? (yes on suncat-gpu 4 cores (gpu20, and gpu26), but not on suncat-gpu-test)
mpi errors?
magma GEMM? (still crashes)
eliminate fork warning (still crashes)
race condition between cublasDestroy/cublasCreate? (no, happens after first create)
valgrind (dbgcrash_fast/try11,13,14 show some uninitialized data in mpisum)
cuda-memcheck (see dbgcrash_fast/try12) looks clean for the 32-node job, even when the data gets messed up (many warnings about python numerical overflows)
understand imprecise exceptions
lung work
hardware problem (check for common node, too many jobs in the logfiles)
read the code: cudamemcpy memory overrun?
discontiguous numpy array? (put in asserts)
did get a memory error when running racecheck
no errors from heap check
read code to look for race conditions in cukernels.cu (even though problem existed before the addition of those kernels)
ran cuda-memcheck racecheck: only saw errors from cublas
check ecc enabled. looked with "nvidia-smi -q"
Ecc Mode
Current : Enabled
Pending : Enabled

...

Child pages

Versions Compared

Old Version 199

New Version 200

Key