You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

This pages summarizes known problems and troubleshoot information that help running experiments a a MHz rate.

1. Node configurations 

To achieve a MHz rate, the heavy-lifting core SMD0 (MPI rank 0) needs to be assigned on an individual node exclusively. This will allow the entire node to read smalldata in parallel at the full bandwidth. 

2. Open MPI failed to TCP connect

We observe that when running using more than a few nodes (> 20) and that nodes assigned to your job are from mixed pools (cmp, mon, or eb nodes mixed in), jobs may fail with the following message:

--------------------------------------------------------------------------

WARNING: Open MPI failed to TCP connect to a peer MPI process.  This

should not happen.


Your Open MPI job may now hang or fail.


  Local host: drp-srcf-eb010

  PID:        130465

  Message:    connect() to 172.21.164.90:1055 failed

  Error:      Operation now in progress (115)

--------------------------------------------------------------------------

This happens because we have multiple interfaces (tcp, ib, etc.) assigned to a node, however some might not have all enabled and that results in no connection when a node is missing that selected interface.

Solution

1. Restrict to 172.21.164 subnet 

export OMPI_MCA_btl_tcp_if_include=172.21.164.90/1072

2. Exclude 172.21.152 subnet

export OMPI_MCA_btl_tcp_if_exclude=eno1


  • No labels