Slurm with SLURM_HOSTFILE usage

1. MPI lost tcp connection

Step to reproduce the problem (seen intermittently)

SLURM_HOSTFILE=slurm_hosts srun -o $$.log --partition=anaq --exclusive ./run_slac.sh

Error message:

--------------------------------------------------------------------------

WARNING: Open MPI failed to TCP connect to a peer MPI process. This

should not happen.

Your Open MPI job may now hang or fail.

Local host: drp-srcf-eb010

PID: 130465

Message: connect() to 172.21.164.90:1055 failed

Error: Operation now in progress (115)

--------------------------------------------------------------------------

Solution:

From this post, this seems to be a known problem for a node is configured with multiple interfaces. From the error message, drp-srcf-eb010 probably binds to

172.21.152.107 and tries to connect to another node on 172.21.164 subnet but there's no route from that node to 172.21.152.107. So there are two solutions:

1. Restrict to 172.21.164 subnet

export OMPI_MCA_btl_tcp_if_include=172.21.164.90/1072

2. Exclude 172.21.152 subnet

From drp-srcf-eb010,

monarin@drp-srcf-eb010 ~ $ ifconfig

eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000

inet 172.21.152.107 netmask 255.255.252.0 broadcast 172.21.155.255

inet6 fe80::3eec:efff:fe43:df4c prefixlen 64 scopeid 0x20<link>

ether 3c:ec:ef:43:df:4c txqueuelen 1000 (Ethernet)

RX packets 623533315 bytes 877839059185 (817.5 GiB)

RX errors 0 dropped 0 overruns 0 frame 0

TX packets 240940243 bytes 351017518046 (326.9 GiB)

TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 4092

inet 172.21.164.107 netmask 255.255.252.0 broadcast 172.21.167.255

inet6 fe80::e42:a103:27:f470 prefixlen 64 scopeid 0x20<link>

Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).

infiniband 00:00:11:07:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand)

RX packets 630508148 bytes 676594617662 (630.1 GiB)

RX errors 0 dropped 0 overruns 0 frame 0

TX packets 795093132 bytes 1777834901566 (1.6 TiB)

TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536

inet 127.0.0.1 netmask 255.0.0.0

inet6 ::1 prefixlen 128 scopeid 0x10<host>

loop txqueuelen 1000 (Local Loopback)

RX packets 46583012 bytes 31092559674 (28.9 GiB)

RX errors 0 dropped 0 overruns 0 frame 0

TX packets 46583012 bytes 31092559674 (28.9 GiB)

TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

Forcing nodes not to bind on eno1 also works.

export OMPI_MCA_btl_tcp_if_exclude=eno1

Previously Known Solution:

Removing eb and mon nodes in slurm_host mitigate the problem (the error is seen when either eb or mon nodes are requested in combination with cmp nodes.).

2. self in mca btl is not recognized when use only self, vader

Step to reproduce the problem:

In run_slac.sh

export OMPI_MCA_btl=self,vader

Error message:

--------------------------------------------------------------------------

At least one pair of MPI processes are unable to reach each other for

MPI communications. This means that no Open MPI device has indicated

that it can be used to communicate between these processes. This is

an error; Open MPI requires that all MPI processes be able to reach

each other. This error can sometimes be the result of forgetting to

specify the "self" BTL.

Process 1 ([[38932,26011],0]) is on host: drp-srcf-eb011

Process 2 ([[38932,26011],1024]) is on host: drp-srcf-cmp013

BTLs attempted: self

Your MPI job is now going to abort; sorry.

--------------------------------------------------------------------------

Traceback (most recent call last):

File "./test_fex_cfd1.py", line 3, in <module>

from psana import *

File "/cds/home/m/monarin/lcls2/psana/psana/__init__.py", line 1, in <module>

from .datasource import DataSource

File "/cds/home/m/monarin/lcls2/psana/psana/datasource.py", line 15, in <module>

prometheus_jobid = MPI.COMM_WORLD.bcast(prometheus_jobid, root=0)

File "mpi4py/MPI/Comm.pyx", line 1569, in mpi4py.MPI.Comm.bcast

File "mpi4py/MPI/msgpickle.pxi", line 723, in mpi4py.MPI.PyMPI_bcast

File "mpi4py/MPI/msgpickle.pxi", line 724, in mpi4py.MPI.PyMPI_bcast

mpi4py.MPI.Exception: MPI_ERR_INTERN: internal error

Solution:

Use self,tcp

3. Sorry message appears at gather

To reproduce the problem:

use -n 2177 (high no. of cores).

Error message:

--------------------------------------------------------------------------

Sorry! You were supposed to get help about:

accept failed

But I couldn't open the help file:

/cds/sw/ds/ana/conda2/inst/envs/ps-4.5.5/share/openmpi/help-mpi-btl-tcp.txt: Too many open files. Sorry!

--------------------------------------------------------------------------

Solution:

Currently, removing gather (np.array of cn_evts and cn_with_peaks) mitigate the problem.

Increase no. of open file limit (ulimit -n N) default is 1024 right now.
Check limit of # open files in Slurm

Beginning November 4, 2024, login to Confluence and Jira will change. Read more.

Page tree

Slurm with SLURM_HOSTFILE usage

1. MPI lost tcp connection

2. self in mca btl is not recognized when use only self, vader

3. Sorry message appears at gather

Beginning November 4, 2024, login to Confluence and Jira will change. Read more.

Page tree

Known problems

Slurm with SLURM_HOSTFILE usage

1. MPI lost tcp connection

2. self in mca btl is not recognized when use only self, vader

3. Sorry message appears at gather