Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This card has more fiber IO (16): https://www.amd.com/en/products/accelerators/alveo/v80/a-v80-p64g-pq-g.html but is a Zynq system-on-chip, so harder to boot and not currently supported.  This one has even more IO https://www.bittware.com/products/xup-vv8/ and is supported. 

June 18, 2024

Debugging session with Jeremy, Ryan, Gabriel, CPO and Ric
  - The issue with test_dma.cu failing on rdsrv403 is that the frame is too big for the buffer
    - Change PrbsTx.PacketLength to 0xff from 0xfff
    - test_dma then works on rdsrv403
    - Ryan found this from either AxiGpuAsyncCore.WriteAxiErrorVal or ReadAxiErrorVal having value 0x4
  - We modified test_dma.cu to write the AxiPcieCore.AxiVersion.scratchpad register with 0xdeadbeef
    - We saw 0xdeadbeef appear in the GUI
    - So GPU to KCU writes seem to be working
      - There is some worry that we don't know whether the path to the KCU might be going through the CPU
  - We modified test_dma.cu to replace the spin on the handshake location with a
    getchar() so we can dump after we trigger the read
    - We see that all the data is zeros rather than junk or random data
    - This explains why the spin never returns
    - AxiGpuAsyncCore.ReadAxiErrorVal shows an error value of 0x3 after
      attempting to transfer one frame
    - PrbsTx.Busy is 0
  - Jeremy determined that on our system we're using the closed source nvidia
    driver rather than the open source one installed on rdsrv403
    - He'll set us up with the open source driver
  - Ryan points out that the two GPU  cards are different
    - CPO will lend them gpu002's A5000 to try in rdsrv403
    - rdsrv403 looks to have only 1 root complex with a number of hubs and
      bridges, different from our nodes'
  - If the problem is the root complex, it's not clear that we can rearrange the
    cards in our nodes to be on the same root complex due to slots and space constraints
    - CPO suggests moving to the HSD box in Lab 3 in that case because it has
      a much larger PCIe bus