Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Ideas for RPA Crash

  • verify VBIOS settings same as nvidia (yes)
  • try without "setenv CUDA_DEVICE_WAITS_ON_EXCEPTION 1" (still fails)
  • run crash_test.c at keeneland (fails there too)
  • change bios settings as recommended by colfax (still fails)
  • swap C2075 with M2090 (problem follows the M2090!)
  • remove IB card (still fails)
  • try random matrix data instead of fixed data (fails with random, works with fixed)
  • run colfax memory test
  • reduce power consumption with "nvidia-smi --perf-limit=P4" (also tried P9). (Fixes crashes, but still get corrupt data).
  • try 1 gpu per node, to see if gpus are "fighting", or if cooling/power is a problem (still fails)
  • look for particular set of gpus that fail
  • switch to cudamemcpy in crash_test.c
  • make matrix bigger in crash_test.c
  • run crash_test.c on 1 gpu per node (still fails)
  • compare suncat-gpu-test (doesn't fail) and suncat-gpu (fails):
    • C2075 vs. M2090
    • 7 vs. 8 gpus
    • IB card
    • cooling
    • power cables? (c13 vs. c19)
  • run nbody gpu test, as suggested by colfax (doesn't fail)
  • read gpu temps (read out via ipmi: code fails with temp around 69C (lower than C2075 where nvidia-smi reports 88C))
  • read rack temps (67-72 at inlet, 86-94 at outlet)
  • run crash_test.py with P9 (no failures)
  • run 32 gpu N-N with 3-gpus per node in exclusive mode (rack 1 still warm on the outputs: 88,88,92 (top to bottom on the front panel). Still saw nan's in dbgcrash_fast/try19.
  • run with rack doors open, or change rack cooling behavior
  • check that PyArgParseTuple types match between python/C
  • cuda4
  • check power
  • small c version of crash_test.py
  • look at ipmi errors on gpu24/26: nothing
  • look in /var/log/messages for errors from driver
  • security scans
  • run gpu hardware tests (with colfax software?)
  • gcc instead of icc
  • small file crash (yes, crashed after 2 days)
  • keeneland (saw 1 nan failure and 1 kernel launch failure)
  • simple gemm test crash
  • does it crash on 1 node? (yes on suncat-gpu 4 cores (gpu20, and gpu26), but not on suncat-gpu-test)
  • mpi errors?
  • try magma GEMM (still crashes)
  • eliminate IB fork warning (still crashes)
  • race condition between cublasDestroy/cublasCreate? (no, happens after first create)
  • study with valgrind (dbgcrash_fast/try11,13,14 show some uninitialized data in mpisum)
  • study with cuda-memcheck (see dbgcrash_fast/try12) looks clean for the 32-node job, even when the data gets messed up (see many warnings about python numerical overflows, indicating failure has occurred)
  • understand imprecise exceptions
  • run @nvidia with 8*M2090, Tyan motherboard, and cuda 5.0 (works!)
  • hardware problem (check for common node, too many jobs in the logfiles)
  • read the code: cudamemcpy memory overrun?
  • discontiguous numpy array? (put in asserts)
  • did get a memory error when running racecheck
  • no errors from cuda-memcheck heap check
  • read code to look for race conditions in cukernels.cu (even though problem existed before the addition of those kernels)
  • ran cuda-memcheck racecheck: only saw errors from cublas
  • check ecc enabled. looked with "nvidia-smi -q"
    Ecc Mode
    Current : Enabled
    Pending : Enabled

...