You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

This pages summarizes known problems and troubleshoot information that help running experiments a a MHz rate.

1. Node configurations 

To achieve a MHz rate, the heavy-lifting core SMD0 (MPI rank 0) needs to be assigned on an individual node exclusively. This will allow the entire node to read smalldata in parallel at the full bandwidth. To assign a specific node to SMD0, run 

#!/bin/bash

#SBATCH --partition=anagpu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=3
#SBATCH --output=%j.log


# setup node configurations
./setup_nodes.sh


# "-u" flushes print statements which can otherwise be hidden if mpi hangs
# "-m mpi4py.run" allows mpi to exit if one rank has an exception
mpirun python -u -m mpi4py.run /reg/g/psdm/tutorials/examplePython/mpiDataSource.py

2. Open MPI failed to TCP connect

We observe that when running using more than a few nodes (> 20) and that nodes assigned to your job are from mixed pools (cmp, mon, or eb nodes mixed in), jobs may fail with the following message:

--------------------------------------------------------------------------
WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
should not happen.

Your Open MPI job may now hang or fail.
  Local host: drp-srcf-eb010
  PID:        130465
  Message:    connect() to 172.21.164.90:1055 failed
  Error:      Operation now in progress (115)
--------------------------------------------------------------------------

This happens because we have multiple interfaces (tcp, ib, etc.) assigned to a node, however some might not have all enabled and that results in no connection when a node is missing that selected interface.

Solution

Prior to running your job, apply either 1) or 2).

1. Restrict to 172.21.164 subnet 

export OMPI_MCA_btl_tcp_if_include=172.21.164.90/1072

2. Exclude 172.21.152 subnet

export OMPI_MCA_btl_tcp_if_exclude=eno1


  • No labels