Slurm with SLURM_HOSTFILE usage

1. MPI lost tcp connection

Step to reproduce the problem (seen intermittently) 

SLURM_HOSTFILE=slurm_hosts srun -o $$.log --partition=anaq --exclusive ./run_slac.sh

Error message:

--------------------------------------------------------------------------

WARNING: Open MPI failed to TCP connect to a peer MPI process.  This

should not happen.


Your Open MPI job may now hang or fail.


  Local host: drp-srcf-eb010

  PID:        130465

  Message:    connect() to 172.21.164.90:1055 failed

  Error:      Operation now in progress (115)

--------------------------------------------------------------------------

Solution:

From this post, this seems to be a known problem for a node is configured with multiple interfaces. From the error message, drp-srcf-eb010 probably binds to 

172.21.152.107 and tries to connect to another node on 172.21.164 subnet but there's no route from that node to 172.21.152.107. So there are two solutions:

1. Restrict to 172.21.164 subnet 

export OMPI_MCA_btl_tcp_if_include=172.21.164.90/1072

2. Exclude 172.21.152 subnet

From drp-srcf-eb010,

monarin@drp-srcf-eb010 ~ $ ifconfig

eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000

        inet 172.21.152.107  netmask 255.255.252.0  broadcast 172.21.155.255

        inet6 fe80::3eec:efff:fe43:df4c  prefixlen 64  scopeid 0x20<link>

        ether 3c:ec:ef:43:df:4c  txqueuelen 1000  (Ethernet)

        RX packets 623533315  bytes 877839059185 (817.5 GiB)

        RX errors 0  dropped 0  overruns 0  frame 0

        TX packets 240940243  bytes 351017518046 (326.9 GiB)

        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0


ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 4092

        inet 172.21.164.107  netmask 255.255.252.0  broadcast 172.21.167.255

        inet6 fe80::e42:a103:27:f470  prefixlen 64  scopeid 0x20<link>

Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).

        infiniband 00:00:11:07:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  txqueuelen 256  (InfiniBand)

        RX packets 630508148  bytes 676594617662 (630.1 GiB)

        RX errors 0  dropped 0  overruns 0  frame 0

        TX packets 795093132  bytes 1777834901566 (1.6 TiB)

        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0


lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536

        inet 127.0.0.1  netmask 255.0.0.0

        inet6 ::1  prefixlen 128  scopeid 0x10<host>

        loop  txqueuelen 1000  (Local Loopback)

        RX packets 46583012  bytes 31092559674 (28.9 GiB)

        RX errors 0  dropped 0  overruns 0  frame 0

        TX packets 46583012  bytes 31092559674 (28.9 GiB)

        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Forcing nodes not to bind on eno1 also works.

export OMPI_MCA_btl_tcp_if_exclude=eno1

Previously Known Solution:

Removing eb and mon nodes in slurm_host mitigate the problem (the error is seen when either eb or mon nodes are requested in combination with cmp nodes.).

2. self in mca btl is not recognized when use only self, vader

Step to reproduce the problem:

In run_slac.sh

export OMPI_MCA_btl=self,vader

Error message:

--------------------------------------------------------------------------

At least one pair of MPI processes are unable to reach each other for

MPI communications.  This means that no Open MPI device has indicated

that it can be used to communicate between these processes.  This is

an error; Open MPI requires that all MPI processes be able to reach

each other.  This error can sometimes be the result of forgetting to

specify the "self" BTL.


  Process 1 ([[38932,26011],0]) is on host: drp-srcf-eb011

  Process 2 ([[38932,26011],1024]) is on host: drp-srcf-cmp013

  BTLs attempted: self


Your MPI job is now going to abort; sorry.

--------------------------------------------------------------------------

Traceback (most recent call last):

  File "./test_fex_cfd1.py", line 3, in <module>

    from psana import *

  File "/cds/home/m/monarin/lcls2/psana/psana/__init__.py", line 1, in <module>

    from .datasource import DataSource

  File "/cds/home/m/monarin/lcls2/psana/psana/datasource.py", line 15, in <module>

    prometheus_jobid = MPI.COMM_WORLD.bcast(prometheus_jobid, root=0)

  File "mpi4py/MPI/Comm.pyx", line 1569, in mpi4py.MPI.Comm.bcast

  File "mpi4py/MPI/msgpickle.pxi", line 723, in mpi4py.MPI.PyMPI_bcast

  File "mpi4py/MPI/msgpickle.pxi", line 724, in mpi4py.MPI.PyMPI_bcast

mpi4py.MPI.Exception: MPI_ERR_INTERN: internal error

Solution:

Use self,tcp 

3. Sorry message appears at gather

To reproduce the problem:

use -n 2177 (high no. of cores).

Error message:

--------------------------------------------------------------------------

Sorry!  You were supposed to get help about:

    accept failed

But I couldn't open the help file:

    /cds/sw/ds/ana/conda2/inst/envs/ps-4.5.5/share/openmpi/help-mpi-btl-tcp.txt: Too many open files.  Sorry!

--------------------------------------------------------------------------

Solution:

Currently, removing gather (np.array of cn_evts and cn_with_peaks) mitigate the problem. 

  1. Increase no. of open file limit (ulimit -n N) default is 1024 right now.
  2. Check limit of # open files in Slurm
  • No labels