June 17, 2024
With Ryan, Larry, Mudit, Ric, Gabriel, cpo
Topics to discuss:
- can/should we remove the cpu from the l1accept path
- batching, and keep events in time-order
- clarify which lines involve cpu/gpu/kcu
- do we "throw the switch"?
- test payloads of user-defined size trigger by l1
- multi kcu to one gpu
- root complex issue (https://docs.nvidia.com/cuda/gpudirect-rdma/index.html#supported-systems)
- can we use unified memory?
line 96 (https://github.com/slaclab/axi-pcie-devel/blob/3f5a268226f2fe4324add6d68063b27c140ad4b9/software/gpu/src/test_dma.cu#L96) talks to kcu: puts gpu memory addr into kcu. a gpu addr, which is good!
rdsrv415 has a pcie chassis that we could consider
iommu allows to do stuff with wide-open security settings
line 123 is for GPU, ryan says "permissions"
line 132 and 137 are for specific registers
line 152 is a gpu register
line 159 is the gpu writing to the kcu
line 164 gpu polling
line 173 transfers data back to kcu from gpu
line 159 tells the kcu we are done with the buffer
hwWritePtr is a GPU pointer
hwWriteStart is a KCU pointer
Larry says gpu gen4 16lane 50GB/s
gen4 c1100 bifurcated 8+8 matches the GPU, 25GB/s for 8 lanes, 50GB/s for 2 x 8 lanes? can PGP also bring in 50GB/s?
Larry mentioned the possibility of using C1100 (https://www.xilinx.com/products/accelerators/varium/c1100.html) with a bifurcated pcie bus to increase data into the GPU to hopefully 50GB/s. But it looks like it only supports 2 QSFP28 which are 100Gb/s each? So only 20GB/s in total per C1100? So to support the 1TB/s produced by the largest epixUHR we would need ~50 C1100 and either put them into ~25 GPUs, or do 50 C1100 into 50 GPUs. And Dionisio says the 35kHz epixUHR will use PGP4 with only 15Gb/s per lane, which is ~12GB/s per C1100. So bi-furcated pcie not necessary.
This card has more fiber IO (16): https://www.amd.com/en/products/accelerators/alveo/v80/a-v80-p64g-pq-g.html but is a Zynq system-on-chip, so harder to boot and not currently supported. This one has even more IO https://www.bittware.com/products/xup-vv8/ and is supported.
June 18, 2024
Debugging session with Jeremy, Ryan, Gabriel, CPO and Ric
- The issue with test_dma.cu failing on rdsrv403 is that the frame is too big for the buffer
- Change PrbsTx.PacketLength to 0xff from 0xfff
- test_dma then works on rdsrv403
- Ryan found this from either AxiGpuAsyncCore.WriteAxiErrorVal or ReadAxiErrorVal having value 0x4
- We modified test_dma.cu to write the AxiPcieCore.AxiVersion.scratchpad register with 0xdeadbeef
- We saw 0xdeadbeef appear in the GUI
- So GPU to KCU writes seem to be working
- There is some worry that we don't know whether the path to the KCU might be going through the CPU
- We modified test_dma.cu to replace the spin on the handshake location with a getchar() so we can dump after we trigger the read
- We see that all the data is zeros rather than junk or random data
- This explains why the spin never returns
- AxiGpuAsyncCore.ReadAxiErrorVal shows an error value of 0x3 after attempting to transfer one frame
- PrbsTx.Busy is 0
- (Didn't catch the meaning of the error code other than that the transfer failed)
- Jeremy determined that on gpu001 we're using the closed source nvidia driver rather than the open source one installed on rdsrv403
- He'll set us up with the open source driver
- Ryan points out that the two GPU cards are different
- CPO will lend them gpu002's A5000 to try in rdsrv403
- rdsrv403 looks to have only 1 root complex with a number of hubs and bridges, different from our nodes'
- If the problem is the root complex, it's not clear that we can rearrange the cards in our nodes to be on the same root complex due to slots and space constraints
- CPO suggests moving to the HSD box in Lab 3 in that case because it has a much larger PCIe bus
June 24, 2024
We learned that there will be two separate free-lists: one for CPU, one for GPU
Two firmware requests:
- LCLS would like a GPU-enabled TDet firmware (timing system triggers a fake-payload of user-defined size)
- generates fake payloads
- Ryan said that TID needs to come up with a block-diagram for this
- For real detector: LCLS would like multiple detector lanes (8 for kcu) built by the Batching EB in in kcu1500 firmware
- this is separate from the BEB on the FEB which joins timing to data
- a "partial" event builder (detectors still split
- this is currently done for epixHR (we think the old pre-existing timing-stream is disabled in the KCU BEB)
- Mudit could modify the existing epixHR kcu1500 firmware, but eliminate timing system and expand to 8 data lanes
- could we eliminate the front-end BEB to avoid a level of tail-iteration? Ryan thinks maybe even could avoid the kcu tail-iteration somehow
There are two different uses of the word "batching": batching-event-builder (which cpo thinks of as being just an event builder) and batching in the sense of giving the KCU one "wakeup" call for multiple events (a "batch").
Example of two-level batching event builder (BEB) in the FEB and KCU:
FEB "a" event-build for event 1 with two detector segments a,b and a timestamp t1: t1,d1a,d1b event-built into "f1a"
FEB "b" event-build for event 1 with two detector segments c,d and a timestamp t1: t1,d1c,d1d event-built into "f1b"
KCU batch event-build for event 1: f1a,f1b (static size for HR, UHR, but variable for sparkPIX). Note that this is a partial event-build because other detector segments will be connected to other KCUs. Existing LCLS software does that high-level event-build.
Note: will always get a packet from sparkpix from empty payload
July 1, 2024
- we think gpu can write to kcu, but Jeremy will try to confirm that this is really the case. we think kcu to gpu is broken.
- Mudit will work on TDet firmware
- Jeremy and TID will use rdsrv419
- LCLS should find another machine where we can control the root-complex topology better. lab3 daq-tst-dev06?
- could be that the CPU datadev driver is incompatible with the GPU datadev driver
- should probably find a different machine. a fee-alcove machine?
- chris and gabriel on vacation July 6-20
- Jeremy unavailable July 8-10
July 8, 2024
- Jeremy and CPO are on vacation
- Chris and I have set up
daq-tst-dev06
in Lab 3 for testing with a GPU & KCU- The KCU appears as
/dev/datagpu_1
- After some debugging to get the software to recognize
/dev/datagpu_1
(as opposed to_0
) theinterCardGui
comes up and shows sensible values test_dma
also runs, but theAxiVersion.ScratchPad
register does not go from0x0
to0xdeadbeef
- Neither does
test_dma
see any DMAs
- Neither does
- The KCU appears as
- Larry urges to move axi-pcie-devel forward to v4.1.0 rather than working with v4.0.0
- Mudit has created v4.2.0 (CPU/GPU DMA switch?) but it has not been tested with hardware yet
- It is unclear whether Jeremy has gotten a baseline working system yet
- Ryan agreed that my hacks of
test_dma.cu
and_Root.py
to target/dev/datagpu_1
should be sufficient - Ryan would like us all to work on one machine and get confidence in it before we branch out to different machines
- I suggested that we could pull out the current
/dev/datagpu_0
from dev06 so that the KCU of interest becomes _0- Ryan & Larry suggested pullling the other ones out as well
- Ryan suggests setting
iommu=no
in the BIOS as well as on the command line- I later found that there is no iommu parameter in dev06's BIOS
- There's no experience with multiple KCU cards and the datagpu driver - it is untested
- Larry is aware of some issue with the usual datadev driver and multiple KCUs
- Work on rdsrv419
- A GPU is installed
- CUDA is installed
- A KCU is installed and has PCIe slot 03:00
July 15, 2024
- CPO and Ryan on vacation
- Let's wait with contacting NVIDIA without GPUDirect issues until Chris gets back
Ric to look into the BIOS settings again to see if we can glean something
Jeremy is finding that when he executes KCU register reads with various methods, he always gets back the Version value from register offset 0
So far, he's baffled
July 22, 2024
- rdsrv419 scratchpad read/write tests: Jeremy/Ryan don't understand the failure mode: Seems to only read the register at offset 0 (the version register). Lower priority since it is for register-space stuff which we don't need.
- Jeremy wants to work on implementing the cuda-graph control mode (one for each event buffer). cuda-graphs can't launch themselves, but there are "scheduling kernels" that can do "tail launches".
- Larry update on configuring rdsrv415: can build kernel driver, but complains about an incompatibility between kernel driver and user application. Ryan will work Larry on this.
- Report from Ric on dev06 and gpu001:
- looked at differences with rdsrv419. made as many of them the same as possible, but didn't help.
- gpu001: noticed the gpu001 drivers were proprietary (dkms). Tried to install open drivers but cuda magically updated from 12.3 to 12.4. Status is the same: can write scratchpad register but dma's still don't work.
- Gabriel suggested access control ("ACS") settings, but so far hasn't helped.
- dev06: pulled out all the other cards from dev06, but scratchpad register write and dma don't work.
- changed bios settings (disabling access-control-system parameter) and that allowed scratchpad to work!
- mudit tdet firmware:
- firmware is done with a switch to move data between cpu/gpu
- learning about scripts from Matt/Ric to test. needs a machine to test: just to CPU for now (until we get GPU dma working)
- matt suggests using kcu in dev02 as an xpm and drive a kcu in dev06 and use the datadev_gpu driver there to test. cpo will work with mudit to get this working.
- some ideas for debugging DMA issue
- test gpu-direct with standard gpu-gpu (cpo/ric/Gabriel). use dev06 for this.
- test gpu-direct with kcu-kcu (can control both sides). TID has a "pip" tool for this (perhaps was used for Ryan Coffee's stuff or dune in 2019?). https://github.com/slaclab/axi-pcie-core/blob/main/protocol/pip/rtl/AxiPciePipCore.vhd. Larry/Mudit could work on this? But Larry gone until Aug 7 or so. use rdsrv416 for this (mudit's machine).
- pcie bus analyzer? Ryan thinks it doesn't exist.
Aug 12, 2024
- Ric having some trouble with the CPU dma locking up when gpu registers are touched. Mudit says Ric is running software with the wrong address map. Will provide a fix. Mudit working on other issues. Found driver offset 0xA00000 that was 0xD00000 in firmware. This fixed the issue. Mudit is going to move the new firmware back to 0xA00000.
- Jeremy found 26.9GB/s between host and gpu. Will provide help to Mudit as needed.
- Some discussion of multiple fpga's feeding one GPU to get 50GB/s
- worried about pcie switches/bandwidth
- fpga firmware feels ok
- event-building on the gpu is another concern
- Jeremy looked at the cuda-graph stuff, and the GPU's can "tail launch" themselves
- Ric will try Jeremy's cuda-graph branch that demonstrates buffer release
- Jeremy/Mudit have 4 buffers (configurable) per dma-lane
- can we have some flexibility in how we map buffers to (parallel) cuda streams? (in cpo's head a cuda-stream is the gpu analog of a cpu-thread)
- we want dma to overlap with stream kernels
- Jeremy will think about it. Update: with the 4 buffers we think that there will be overlap between dma/kernels
Aug 19, 2024
- Mudit believes all the issues have been solved with the TDet-gpu firmware by changing a pcie-bifurcation issue. Data can flow to both CPU and GPU.
- DMA lane going to GPU is only on the primary device (/dev/datagpu-1) but not on the other (/dev/datagpu-0). Should we change this? Might need it if we have a second GPU in the node (which we plan to do)
- Jeremy did a clean-up pass on the software on the cuda-graphs branch. There is a gpuasynclib in the cuda-graphs that Ric can use.
- Jeremy will try to add some performance measurements/plots
Aug 26, 2024
- Ric's testing of Mudit's TDet-gpu firmware still in progress
- BER for 10Gb/s links is large: worrisome
- Gabriel continues to look at GPU direct storage
- XPP is going to get a 4MP epix with 288 fibers in summer 2025 which we will readout with CPU at 120Hz but using 36 C1100's.
- need 4 chassis to bring all the fibers in
- but will buy a big chassis to receive all the fibers and test future GPU readout
- Larry points out that we need an H100 ($30-$40K) to get pcie gen5 to test the 50GB/s/GPU input bandwidth. And we also need to check compatibility with gpu-direct (which is used by the C1100)
- Matt's suggested chassis SYS-521GE supermicro
- supports pcie 5 (C1100s are still pcie 4 but that's OK)
XPP 4Mpixel high-rate detector (35kHz eventually) 250GB/s 288 fibers requires 34 C1100's (8 fibers per C1100) which requires 9 GPUs 5 chassis with 2 GPUs each, 4 C1100s per GPU Eventual goal is 50GB/s into each GPU. ~4C1100s per GPU -> 50/4=12.5GB/s epixUHR (288 fibers) |||| |||| |||| (pgp fiber lanes, 8 pgp fiber lanes per C1100) beb beb beb (batching event builder in fpga pgp card) fpga1 fpga2 fpga3 || || || (dma masters) ----v v gpu1 gpu2 Ryan/Jeremy point out that that one DMA master can fill N DMA "1-event-buffers", so only need 1 DMA master per pgp card? no known advantage having multiple dma masters per pgp card (support readout out multiple independent detectors at the same time?) the one DMA master puts a different event (built by the beb) into each 1-event-buffer (important design decision: round-robin or load-balancing? this choice can simplify time-ordering and software-event-building when we have multiple pgp cards) multiple one-event-buffers is how we get overlap between DMA and data processing data from one DMA one-event-buffer goes to one cuda stream multiple pgp card proposal: each pgp card will fill it's own one-event-buffers on the GPU (natural extension of the single-pgp-card with multiple one-event-buffers) if we have multiple pgp cards the multiple one-event-buffers need to be event-built in software there should be no mechanism to drop only a "piece" of detector (apart from fiber bit errors). dead time will only drop whole events. and different pgpcards event-count will say in-sync with other pgpcards. cpo would vote for round-robin dma-master-to-one-event-buffer transfers since it simplifies the software event builder and perhaps also the time ordering. a disadvantage of round-robin vs. load-balancing is that if peak-finders take different times on different events (for example) then load-balancing is more efficient (although the streams share one gpu and we think one stream will fill-up a (current-generation) gpu). gpu software has to put events from different cuda streams in time order does the FEX (two parts: trigger decision, data reduction) run in the above cuda-streams? cpo thought yes, ric thought no. ideally the trigger decision happens first. ric proposal: stream for event-build/time-order, then another stream for trigger data computation (e.g. #peaks), another stream for data reduction (run only if the event passes the trigger), then another stream for file writing. cpo was hoping for fewer streams for simplicity, but cpo thinks we need some streams to overlap teb trigger work with the other tasks done on gpu and that we would have enough parallelism from the existing streams (one per one-event-buffer).
NOTE: If C1100 has the pcie "bifurcation" feature that the KCU has, then we need to have to utilize both halves (with 1 or 2 DMA masters?) of the bifurcated pcie, so reduces the amount of beb event-building that we do. Or could round-robin events across the bifurcated pcie? (could make it more difficult to put events in time-order).
Sept 9, 2024
- Jeremy tried to do some profiling with Nvidia-nsight but it doesn't seem to work with gpu-launched cuda-graphs
- Jeremy will submit a pull-request to merge the cuda-graphs branch
- By default context-synchronization (cuCtxSynchronize) uses a spin loop, perhaps should use an event-based mechanism?
- Gabriel did performance measurements for ethernet weka file-writing. Performance looks identical between IB and ethernet (2GB/s for a single writer to 9GB/s for multiple writers), so we could move drp nodes to all ethernet which may allow us to use gpu-direct-storage.
- The 4Mpixel epixUHR for XPP won't run until late 2026 (and maybe will go in the hutch in summer of 2026). Will use CPU while it runs at 120Hz.
- For multiple pgp cards going to one gpu, simplest model is just to add more dma-streams going to their own cuda-streams (already have this for one card).
- An open question how much event-building we should do in cuda
- currently only have 1 dma stream working in the kcu, should we next target 2 dma streams?
- possible action items:
- cpo will draw a beautiful non-ascii diagram and schedule a gpu dataflow-mtg that includes tid/dionisio/matt/datasystems
- when ric has time he can tweak software to enable a second DMA buffer
- should work be done on bifurcated pcie? (conversation between Ryan, mudit and Larry)
- work to enable 2 kcu's to go to 4 DMA buffers (2 buffers per kcu)? needs work.
- this can be tested with a bifurcated kcu
Sept 23, 2024
GPU dataflow discussion slides:
daq-tst-dev06 information (to determine if we can test two pgp cards with 1 GPU: https://www.supermicro.com/products/system/4U/4029/PCIe-Root-Architecture.cfm. https://www.supermicro.com/manuals/superserver/4U/MNL-2107.pdf https://www.supermicro.com/manuals/motherboard/C606_602/MNL-X9DRG-of-platform.pdf. "dmidecode -t system" suggests our model is SYS-4029GP-TRT. lspci sees "PEX 8747" pci bridge chips consistent with the X9DRG-O-PCIE-P pcie board in supermicro SYS-4029GP-TRT. Can we use GPU-Direct across two pcie ports plugged into the same CPU as shown in this diagram? Section 2.4 here https://docs.nvidia.com/cuda/gpudirect-rdma/index.html suggests that GPU Direct will work through a CPU, albeit at reduced performance, but should be good for development. Test it out with GPU/KCU in the same pcie switch, then move the KCU to another pcie switch.
Sept. 30, 2024
- Mudit verified that the handout of data from pgp card to gpu with multiple one-event-buffers happens in round-robin fashion
- We have switched to round-robin from detector to pgp card (following Matt's suggestion)
- this requires new KCU firmware. because of this we won't round-robin yet for the Nov. 8,11 epixUHR beamtime. but hopefully after that we will switch to round-robin
- Larry would like to delegate this new firmware to Mudit who kindly agrees to take this on
- this requires new KCU firmware. because of this we won't round-robin yet for the Nov. 8,11 epixUHR beamtime. but hopefully after that we will switch to round-robin
- cpo looked at the pcie architecture of daq-tst-dev06 to see if we can run multiple kcu cards per GPU (see notes above from Sept. 23 mtg). It looks like it should be possible.
- Mudit can test first and if it does indeed work (should be simple) then can begin development/testing of multiple KCU-per-GPU.
- launched purchase of the H100 ($30K?) and SYS-521GE ($20K?)
- cpo is interacting with Yemi to get a meeting with Colfax/supermicro to ask about GPU direct and whether this chassis is the best choice for our application (invite Matt to mtg as well)
- Ric is having trouble with the system "surviving" (hard problem).
- When it breaks cpo/ric/mudit will meet in zoom to try to understand why.
- Mudit said there was problem with I2C initialization of clocks, but after that fix Ric continues to see problems.
- sudo for cuMemHostRegister with CU_MEMHOSTREGISTER_IOMEMORY option:
- Ric tried both with "cuda runtime api" (cudaHostRegister, in general these have "cuda" prefix to routines) instead of the "driver api" ("cu" prefix to routines) but behavior remained the same.
- see https://github.com/slaclab/axi-pcie-devel/blob/341b81cc4c73031775e1bf17d56b4d2485a378d2/software/gpu/src/test_dma.cu#L147
- TID feels like it will be difficult to wrap this and put in datagpu (or other) driver
- Ric thought dmaMapRegister did something similar in the driver, but TID said it doesn't, but gpuAddNvidiaMemory does.
- I think what I (Ric) said was that looked into finding a way to add the
cuMemHostRegister(swFpgaRegs, 0x100000, CU_MEMHOSTREGISTER_IOMEMORY);
call (or do its equivalent) to the dmaMapRegister functionality (on the driver side (in Dma_MapReg()?)) or provide it as a separate ioctl() call that would be called from user code after it calls dmaMapRegsiter, as needed, but I couldn't find a way to deal with the building issues
- I think what I (Ric) said was that looked into finding a way to add the
- Ric thought dmaMapRegister did something similar in the driver, but TID said it doesn't, but gpuAddNvidiaMemory does.
- Ric can workaround this for now with setuid
- lower-priority non-ideal fallback: seeing if we can use reduced "setuid capabilities" that be raised/lowered as the program runs
- idea from Gabriel: Regarding the idea of using the effective UID to switch to an unprivileged user: there is also the concept of a "saved user id" (SUID) which is implemented for these sorts of scenarios where you're switching back and forth for work at different privilege levels. I don't think there are direct get/set functions for it, but there is e.g.
getresuid
for retrieving all three values (real, effective, saved). If I understand correctly this may allow you to get back to root if needed (given SUID == root UID). I've never tried to use this to get root back after dropping, so mileage may vary.
- idea from Gabriel: Regarding the idea of using the effective UID to switch to an unprivileged user: there is also the concept of a "saved user id" (SUID) which is implemented for these sorts of scenarios where you're switching back and forth for work at different privilege levels. I don't think there are direct get/set functions for it, but there is e.g.
- Will ask Nvidia what best practice is
- Jeremy will compose and maybe look through the documentation to see if there is "user space" guidance. cpo will send to Nvidia.
- Larry said axi-pcie-devel/firwmare/targets/SwDriverInterCardTest/hdl/SwDrivernInterCardTest to inject data into KCU. Ideally would marry this to Mudit's TDetGPU firmware.
Oct 7, 2024
- For sudo issue (higher priority):
- add a call to datagpu driver to do the write-enable (per buffer, per event)
- Jeremy/Ric replace the cuMemHostRegister call with a driver call in the .cu file
- this will go through a context switch which isn't ideal (per event) but Ryan feels is cleaner. We can revisit if it turns into a performance bottleneck (making the register visible directly to user code).
- could batch the releases, but this may require more one-event buffers
- signal the CPU with cudaEvents when a graph/kernel completes. might need a thread per kernel which might do a spin-wait. could we run out of cores?
- it is still an option to embed cuMemHostRegister in the driver to make the FPGA register visible to the CPU.
- Ryan will make the driver change, Jeremy will make the cuda change, Ric will test.
- higher priority: robustness issues:
- Mudit has kindly agreed to meet on zoom with ric/cpo when Ric encounters robustness issues
- low priority: multiple KCU's talking to 1 GPU
- Mudit should chip away at this when he has time
- cpo will talk to manufacturers to make sure H100 and supermicro chassis purchase will work for us (e.g. GPU Direct, pcie5, power/cooling, 10 FPGA cards and 2 GPUs)
- Gautam will look into whether H100 supports GPU Direct. Preliminary indication looks good: https://forums.developer.nvidia.com/t/gpudirect-rdma-with-nvidia-a100-for-pcie/215032 (this is Ampere A100 vs. Hopper H100). H100NVL will be replaced with H200NVL.
- wanted 10 C1100 single-width half-length cards and 2 H100NVL (eventually). Gautam says that only 10 will fit: 2 slots are smaller.
- Gautam points out that C1100 is only pcie4x8
- Gautam says power/cooling should be OK
- first purchase will be 1 chassis and 1 H100
- Gautam says we should we perhaps consider a 4U machine? SYS-421G: https://www.supermicro.com/en/products/system/GPU/4U/SYS-421GE-TNRT. No double-wide in some slots.
Oct 14, 2024
- Good progress on sudo requirement
- Fixed the 0xa00000 to 0xd00000 register offset
- chris will work with with ric to create a more detailed software dataflow picture that will be shared with everyone
- mudit will continue to help ric debug issues
- jeremy will work on multi-kcu example sending to one gpu
- jeremy loads the correct driver on boot up on daq-tst-dev06
Oct 21, 2024
- Jeremy thinks the dkms (dynamic kernel module system) driver is the way to go. using the open-source driver, get it installed with dkms. would also like to do this with aes-stream-drivers. automatically updates every time the kernel gets updated (searches for all dkms modules). In progress.
- driver procedure:
- install correct open-source nvidia version with yum (maybe omit rmmod/insmod lines from comp_and_load_drivers.sh)
- datagpu driver will automatically track that using dkms
- maybe we should do this with an rpm when it's stable?
- TID already doing this for Debian, expand to other OS's.
- will use dkms longterm
- Jeremy will continue to chip away at multiple kcu's driving one gpu
- Mudit will provide Ric with header/error information
- Ric has transitions and L1Accepts flowing correctly (big milestone!)
- how to deal with the one-event-buffer pointer be correctly set when software starts up? NOTE: this is not a problem if we have one gpu stream per one-event-buffer, so perhaps low priority. options:
- wait until we have one stream per one-event-buffer (but may confuse other applications in the future)
- hit user-reset more automatically (or reset somehow when the device is opened)
- have a register that software could read to learn which one is next?
- test this by running with a single one-event buffer to see if we still need user-reset?
- request: /proc/datagpu* that has information about gpu buffers (analogous to /proc/datadev*). Jeremy thinks this is doable and will add it to his list.
- how to deal with the one-event-buffer pointer be correctly set when software starts up? NOTE: this is not a problem if we have one gpu stream per one-event-buffer, so perhaps low priority. options:
- Would be interesting to know what user-reset does. Mudit can look into what user-reset does.
- cpo found this white paper that suggests (but does not definitively state) H100 supports GPUDirect: https://resources.nvidia.com/en-us-tensor-core. Will reach out to nvidia contacts.
Oct 28, 2024
- Ric thinks Jeremy has an approach for the one-event-buffer "pointer" problem on Oct 21.
- Jeremy has indeed fixed this and some other corner-case issues
- Jeremy has updated intercard test to use 0xd00000 offset
- Ric thinks Jeremy has more info in /proc/datagpu* as described on Oct 21.
- this is another pull request to aes-stream-drivers
- Jeremy did this, but wants to to have a header to avoid hardwired numbers, and have more readable/maintainable code
- Mudit is continuing to work with Jeremy to give Ric the header structure (e.g. the "index" may be the index of the one-event-buffer that was filled)
- Jeremy wants to open pull request to aes-stream-drivers for Ryan's ioctl.
- Jeremy has two branches on axi-pcie-devel: master and multi-device, and plan to merge the latter into the former
- this has been tested in daq-tst-dev06 and appears to work!
- the example is currently on a branch (multi-device). Like testdma but create an object that be instantiated multiple times. Still called "test_dma". Will be merged into master.
- ric suggests renaming this executable under a new name to preserve the original test_dma. this was agreed to by all.
- Gabriel will work on adding detector calibration to Ric's framework when his queue empties
- cpo communicated with nvidia expert: he says H100 supports GPUDirect storage, so cpo assumed it presumably supports GPUDirect as well.
- cpo will pull the trigger on the h100/supermicro order
- ric points out that having the daq be able to handle transitions from multiple kcu cards on the CPU side could be a lot of work
- matt suggests we already have to handle this on the GPU (e.g. for disable) so maybe do it for all transitions?
Nov. 4, 2024
Topics:
jeremy driver-on-boot?
has jeremy merged axi-pcie-devel test_dma?
jeremy restarting threads
data returned by dma is not a timing header
- structure that is used to create a cuda context is thrown away
- if done correctly then the timing header is seen
related to the contention problem with 3 or 4 streams
- works with 1 or 2 streams
mudit: single vs. bifurcated pci bus