Page History
...
Recently, there was a problem in UED that turned out to be due to the provider libfabric chose to use. The symptom was the intermittent silent failure of transferring a large (O(1 MB)) buffer from the EPIX DRP to the MEB and seemed to be affected by whether recording was enabled or not. This occurred whether these two processes were running on the same node or on different nodes. The DRP indicated that the transfer was successful, but the MEB never received a completion signal (CQ) and consequently timed out building the event.
The highest performing network hardware on the UED machines use 100 Gbit/sec mlx5 interfaces from Mellanox. Although these are capable of running infiniband, we run ethernet over them. Libfabric by default chooses to use the verbs
provider with these, so the above parameters were created to be able to force the tcp
provider to be selected. This was done in ued.cnf
with the line:
Code Block | ||
---|---|---|
| ||
kwargs = 'ep_fabric="172.21.36.0/24",ep_domain=enp129s0' |
The tcp
provider is a partial replacement for the deprecated sockets
provider (the other is udp
) that was originally used to commission running the DAQ over ethernet. sockets
behaves similarly to verbs
but differently from tcp
. To get UED going again, I switched to using sockets
, which appears to solve the problem:
Code Block | ||
---|---|---|
| ||
kwargs = 'ep_provider=sockets,ep_domain=enp129s0' |
(I now think the ep_fabric
specification was redundant.) I have a guess as to what is different about tcp
vs sockets
that caused the problem, but it will need further investigationA fair bit of development has occurred on the tcp
provider since the libfabric release that we use (1.10.1) appeared. I need to further investigate the differences between tcp
and sockets
to understand the cause of the problem.