You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

This pages summarizes known problems and troubleshoot information that help running experiments a a MHz rate.

1. Open MPI failed to TCP connect

We observe that when running using more than a few nodes (> 20) and that nodes assigned to your job are from mixed pools (cmp, mon, or eb nodes mixed in), jobs may fail with the following message:

--------------------------------------------------------------------------

WARNING: Open MPI failed to TCP connect to a peer MPI process.  This

should not happen.


Your Open MPI job may now hang or fail.


  Local host: drp-srcf-eb010

  PID:        130465

  Message:    connect() to 172.21.164.90:1055 failed

  Error:      Operation now in progress (115)

--------------------------------------------------------------------------

This happens because we have multiple interfaces (tcp, ib, etc.) assigned to a node, however some might not have all enabled and that results in no connection when a node is missing that selected interface.

Solution

1. Restrict to 172.21.164 subnet 

export OMPI_MCA_btl_tcp_if_include=172.21.164.90/1072

2. Exclude 172.21.152 subnet

export OMPI_MCA_btl_tcp_if_exclude=eno1

2. Node configurations 

To achieve a MHz rate, the heavy-lifting core SMD0 (MPI rank 0) needs to be assigned on an individual node exclusively. This will allow the entire node to read smalldata in parallel at the full bandwidth. 

  • No labels