Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Recently, there was a problem in UED that turned out to be due to the provider libfabric chose to use.  The symptom was the intermittent silent failure of transferring a large (O(1 MB)) buffer from the EPIX DRP to the MEB and seemed to be affected by whether recording was enabled or not.  This occurred whether these two processes were running on the same node or on different nodes.  The DRP indicated that the transfer was successful, but the MEB never received a completion signal (CQ) and consequently timed out building the event. 

The highest performing network hardware on the UED machines use 100 Gbit/sec mlx5 interfaces from Mellanox.  Although these are capable of running infiniband, we run ethernet over them.  Libfabric  by default chooses to use the verbs provider with these, so the above parameters were created to be able to force the tcp provider to be selected.  This was done in ued.cnf with the line:

Code Block
languagepy
kwargs   = 'ep_fabric="172.21.36.0/24",ep_domain=enp129s0'

The tcp provider is a partial replacement for the deprecated sockets provider (the other is udp) that was originally used to commission running the DAQ over ethernet.  sockets behaves similarly to verbs but differently from tcp.  To get UED going again, I switched to using sockets, which appears to solve the problem:

Code Block
languagepy
kwargs   = 'ep_provider=sockets,ep_domain=enp129s0'

(I now think the ep_fabric specification was redundant.)  I have a guess as to what is different about tcp vs sockets that caused the problem, but it will need further investigationA fair bit of development has occurred on the tcp provider since the libfabric release that we use (1.10.1) appeared.  I need to further investigate the differences between tcp and sockets to understand the cause of the problem.