Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • switch to cudamemcpy in crash_test.c
  • make matrix bigger in crash_test.c
  • run crash_test.c on 2 gpus per node (still fails)
  • compare suncat-gpu-test (doesn't fail) and suncat-gpu (fails):
    • C2075 vs. M2090
    • 7 vs. 8 gpus
    • IB card
    • cooling
    • power cables? (c13 vs. c19)
  • run nbody gpu test (seems to need display, so doesn't work)
  • read gpu temps (nvidia-smi doesn't show them, neither does ipmi)
  • read rack temps (67-72 at inlet, 86-94 at outlet)
  • run crash_test.py with P9 (no failures)
  • run 32 gpu N-N with 3-gpus per node in exclusive mode (rack 1 still warm on the outputs: 88,88,92 (top to bottom on the front panel). Still saw nan's in dbgcrash_fast/try19.
  • run with rack doors open, or change rack cooling behavior
  • check that PyArgParseTuple types match between python/C
  • cuda4
  • check power
  • small c version of crash_test.py
  • look at ipmi errors on gpu24/26: nothing
  • look in /var/log/messages for errors from driver
  • security scans
  • run gpu hardware tests (with colfax software?)
  • gcc instead of icc
  • small file crash (yes, crashed after 2 days)
  • keeneland (saw 1 nan failure and 1 kernel launch failure)
  • simple gemm test crash
  • does it crash on 1 node? (yes on suncat-gpu 4 cores (gpu20, and gpu26), but not on suncat-gpu-test)
  • mpi errors?
  • magma GEMM? (still crashes)
  • eliminate fork warning (still crashes)
  • race condition between cublasDestroy/cublasCreate? (no, happens after first create)
  • valgrind (dbgcrash_fast/try11,13,14 show some uninitialized data in mpisum)
  • cuda-memcheck (see dbgcrash_fast/try12) looks clean for the 32-node job, even when the data gets messed up (many warnings about python numerical overflows)
  • understand imprecise exceptions
  • lung work
  • hardware problem (check for common node, too many jobs in the logfiles)
  • read the code: cudamemcpy memory overrun?
  • discontiguous numpy array? (put in asserts)
  • did get a memory error when running racecheck
  • no errors from heap check
  • read code to look for race conditions in cukernels.cu (even though problem existed before the addition of those kernels)
  • ran cuda-memcheck racecheck: only saw errors from cublas
  • check ecc enabled. looked with "nvidia-smi -q"
    Ecc Mode
    Current : Enabled
    Pending : Enabled

...