Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This pages summarizes known problems and troubleshoot information that help running experiments a a MHz rate.

1. Node configurations 

To achieve a MHz rate, the heavy-lifting core SMD0 (MPI rank 0) needs to be assigned on an individual node exclusively. This will allow the entire node to read smalldata in parallel at the full bandwidth. 

2. Open MPI failed to TCP connect

We observe that when running using more than a few nodes (> 20) and that nodes assigned to your job are from mixed pools (cmp, mon, or eb nodes mixed in), jobs may fail with the following message:

...

This happens because we have multiple interfaces (tcp, ib, etc.) assigned to a node, however some might not have all enabled and that results in no connection when a node is missing that selected interface.

Solution

1. Restrict to 172.21.164 subnet 

export OMPI_MCA_btl_tcp_if_include=172.21.164.90/1072

2. Exclude 172.21.152 subnet

export OMPI_MCA_btl_tcp_if_exclude=eno1

2. Node configurations 

...