June 17, 2024

With Ryan, Larry, Mudit, Ric, Gabriel, cpo

Topics to discuss:

line 96 (https://github.com/slaclab/axi-pcie-devel/blob/3f5a268226f2fe4324add6d68063b27c140ad4b9/software/gpu/src/test_dma.cu#L96) talks to kcu: puts gpu memory addr into kcu. a gpu addr, which is good!
rdsrv415 has a pcie chassis that we could consider
iommu allows to do stuff with wide-open security settings
line 123 is for GPU, ryan says "permissions"
line 132 and 137 are for specific registers
line 152 is a gpu register
line 159 is the gpu writing to the kcu
line 164 gpu polling
line 173 transfers data back to kcu from gpu
line 159 tells the kcu we are done with the buffer

hwWritePtr is a GPU pointer
hwWriteStart is a KCU pointer

Larry says gpu gen4 16lane 50GB/s
gen4 c1100 bifurcated 8+8 matches the GPU, 25GB/s for 8 lanes, 50GB/s for 2 x 8 lanes?  can PGP also bring in 50GB/s?

Larry mentioned the possibility of using C1100 (https://www.xilinx.com/products/accelerators/varium/c1100.html) with a bifurcated pcie bus to increase data into the GPU to hopefully 50GB/s.  But it looks like it only supports 2 QSFP28 which are 100Gb/s each?  So only 20GB/s in total per C1100?  So to support the 1TB/s produced by the largest epixUHR we would need ~50 C1100 and either put them into  ~25 GPUs, or do 50 C1100 into 50 GPUs.  And Dionisio says the 35kHz epixUHR will use PGP4 with only 15Gb/s per lane, which is ~12GB/s per C1100.  So bi-furcated pcie not necessary.

This card has more fiber IO (16): https://www.amd.com/en/products/accelerators/alveo/v80/a-v80-p64g-pq-g.html but is a Zynq system-on-chip, so harder to boot and not currently supported.  This one has even more IO https://www.bittware.com/products/xup-vv8/ and is supported. 

June 18, 2024

Debugging session with Jeremy, Ryan, Gabriel, CPO and Ric
  - The issue with test_dma.cu failing on rdsrv403 is that the frame is too big for the buffer
    - Change PrbsTx.PacketLength to 0xff from 0xfff
    - test_dma then works on rdsrv403
    - Ryan found this from either AxiGpuAsyncCore.WriteAxiErrorVal or ReadAxiErrorVal having value 0x4
  - We modified test_dma.cu to write the AxiPcieCore.AxiVersion.scratchpad register with 0xdeadbeef
    - We saw 0xdeadbeef appear in the GUI
    - So GPU to KCU writes seem to be working
      - There is some worry that we don't know whether the path to the KCU might be going through the CPU
  - We modified test_dma.cu to replace the spin on the handshake location with a getchar() so we can dump after we trigger the read
    - We see that all the data is zeros rather than junk or random data
    - This explains why the spin never returns
    - AxiGpuAsyncCore.ReadAxiErrorVal shows an error value of 0x3 after attempting to transfer one frame
    - PrbsTx.Busy is 0
    - (Didn't catch the meaning of the error code other than that the transfer failed)
  - Jeremy determined that on gpu001 we're using the closed source nvidia driver rather than the open source one installed on rdsrv403
    - He'll set us up with the open source driver
  - Ryan points out that the two GPU  cards are different
    - CPO will lend them gpu002's A5000 to try in rdsrv403
    - rdsrv403 looks to have only 1 root complex with a number of hubs and bridges, different from our nodes'
  - If the problem is the root complex, it's not clear that we can rearrange the cards in our nodes to be on the same root complex due to slots and space constraints
    - CPO suggests moving to the HSD box in Lab 3 in that case because it has a much larger PCIe bus

June 24, 2024

We learned that there will be two separate free-lists: one for CPU, one for GPU

Two firmware requests:

  • LCLS would like a GPU-enabled TDet firmware (timing system triggers a fake-payload of user-defined size)
    • generates fake payloads 
    • Ryan said that TID needs to come up with a block-diagram for this
  • For real detector: LCLS would like multiple detector lanes (8 for kcu) built by the Batching EB in in kcu1500 firmware
    • this is separate from the BEB on the FEB which joins timing to data 
    • a "partial" event builder (detectors still split 
    • this is currently done for epixHR (we think the old pre-existing timing-stream is disabled in the KCU BEB)
    • Mudit could modify the existing epixHR kcu1500 firmware, but eliminate timing system and expand to 8 data lanes
    • could we eliminate the front-end BEB to avoid a level of tail-iteration?  Ryan thinks maybe even could avoid the kcu tail-iteration somehow

There are two different uses of the word "batching": batching-event-builder (which cpo thinks of as being just an event builder) and batching in the sense of giving the KCU one "wakeup" call for multiple events (a "batch").

Example of two-level batching event builder (BEB) in the FEB and KCU:

FEB "a" event-build for event 1 with two detector segments a,b and a timestamp t1: t1,d1a,d1b event-built into "f1a"
FEB "b" event-build for event 1 with two detector segments c,d and a timestamp t1: t1,d1c,d1d event-built into "f1b"

KCU batch event-build for event 1: f1a,f1b (static size for HR, UHR, but variable for sparkPIX).  Note that this is a partial event-build because other detector segments will be connected to other KCUs.  Existing LCLS software does that high-level event-build.

Note: will always get a packet from sparkpix from empty payload

  • No labels