Page History

...

Recently, there was a problem in UED that turned out to be due to the provider libfabric chose to use. The symptom was the intermittent silent failure of transferring a large (O(1 MB)) buffer from the EPIX DRP to the MEB and seemed to be affected by whether recording was enabled or not. This occurred whether these two processes were running on the same node or on different nodes. The DRP indicated that the transfer was successful, but the MEB never received a completion signal (CQ) and consequently timed out building the event.

The highest performing network hardware on the UED machines use 100 Gbit/sec mlx5 interfaces from Mellanox. Although these are capable of running infiniband, we run ethernet over them. Libfabric by default chooses to use the verbs provider with these, so the above parameters were created to be able to force the tcp provider to be selected. This was done in ued.cnf with the line:

Code Block

language	py

kwargs   = 'ep_fabric="172.21.36.0/24",ep_domain=enp129s0'

The tcp provider is a partial replacement for the deprecated sockets provider (the other is udp) that was originally used to commission running the DAQ over ethernet. sockets behaves similarly to verbs but differently from tcp. To get UED going again, I switched to using sockets, which appears to solve the problem:

Code Block

language	py

kwargs   = 'ep_provider=sockets,ep_domain=enp129s0'

(I now think the ep_fabric specification was redundant.) I have a guess as to what is different about tcp vs sockets that caused the problem, but it will need further investigationA fair bit of development has occurred on the tcp provider since the libfabric release that we use (1.10.1) appeared. I need to further investigate the differences between tcp and sockets to understand the cause of the problem.

Page tree

Versions Compared

Old Version 288

New Version 289

Key