Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Do "systemctl daemon -reload" to pick up the changed .conf file above.  Omar needs to find out how to tell RHEL7 to find the .ko files in /cds/sw/package/daq/modules/. (i.e. this is still a work-in-progress).

Libfabric

Libfabric supplies a program called fi_info to list the available providers for transferring data between nodes and processes.  The results list is sorted from highest to lowest performing.  Options can be given to filter according to various features or capabilities.  On systems with infiniband, the verbs provider is returned as the most performant interface.  On systems without infiniband,  the 'tcp' provider is listed as the most performant. 

Our code follows this same pattern, so absent constraining parameters, libfabric chooses the highest performing interface it has access to.  The following kwargs exist to narrow the selection:

  • ep_domain: Forces the use of a particular domain (physical network interface)
  • ep_fabric: Forces the use of a particular fabric
  • ep_provider: Forces the use of a particular provider

The valid kwarg values are those listed by the fi_info command.  More information about the various providers can be found on the libfabric github site in the Readme and the Provider Feature Matrix on the Wiki tab.  These are updated with each release.

Recently, there was a problem in UED that turned out to be due to the provider libfabric chose to use.  The highest performing network hardware on the UED machines use 100 Gbit/sec mlx5 interfaces from Mellanox.  Although these are capable of running infiniband, we run ethernet over them.  Libfabric  by default chooses to use the verbs provider with these, so the above parameters were created to be able to force the tcp provider to be selected.  This was done in ued.cnf with the line:

Code Block
languagepy
kwargs   = 'ep_fabric="172.21.36.0/24",ep_domain=enp129s0'

The tcp provider is a replacement for the deprecated sockets provider that was originally used to commission running the DAQ over ethernet.  sockets behaves similarly to verbs but differently from tcp.  To get UED going again, I switched to using the sockets provider, which appears to solve the problem:

Code Block
languagepy
kwargs   = 'ep_provider=sockets,ep_domain=enp129s0'

(I now think the fabric specification was redundant.)  I have a guess as to what is different about tcp vs sockets that caused the problem, but it need further investigation.