Slurm with SLURM_HOSTFILE usage
1. MPI lost tcp connection
Step to reproduce the problem (seen intermittently)
SLURM_HOSTFILE=slurm_hosts srun -o $$.log --partition=anaq --exclusive ./run_slac.sh |
---|
Error message:
-------------------------------------------------------------------------- WARNING: Open MPI failed to TCP connect to a peer MPI process. This should not happen. Your Open MPI job may now hang or fail. Local host: drp-srcf-eb010 PID: 130465 Message: connect() to 172.21.164.90:1055 failed Error: Operation now in progress (115) -------------------------------------------------------------------------- |
---|
Solution:
From this post, this seems to be a known problem for a node is configured with multiple interfaces. From the error message, drp-srcf-eb010 probably binds to
172.21.152.107 and tries to connect to another node on 172.21.164 subnet but there's no route from that node to 172.21.152.107. So there are two solutions:
1. Restrict to 172.21.164 subnet
export OMPI_MCA_btl_tcp_if_include=172.21.164.90/1072 |
---|
2. Exclude 172.21.152 subnet
From drp-srcf-eb010,
monarin@drp-srcf-eb010 ~ $ ifconfig eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000 inet 172.21.152.107 netmask 255.255.252.0 broadcast 172.21.155.255 inet6 fe80::3eec:efff:fe43:df4c prefixlen 64 scopeid 0x20<link> ether 3c:ec:ef:43:df:4c txqueuelen 1000 (Ethernet) RX packets 623533315 bytes 877839059185 (817.5 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 240940243 bytes 351017518046 (326.9 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 4092 inet 172.21.164.107 netmask 255.255.252.0 broadcast 172.21.167.255 inet6 fe80::e42:a103:27:f470 prefixlen 64 scopeid 0x20<link> Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8). infiniband 00:00:11:07:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand) RX packets 630508148 bytes 676594617662 (630.1 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 795093132 bytes 1777834901566 (1.6 TiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 1000 (Local Loopback) RX packets 46583012 bytes 31092559674 (28.9 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 46583012 bytes 31092559674 (28.9 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 |
---|
Forcing nodes not to bind on eno1 also works.
export OMPI_MCA_btl_tcp_if_exclude=eno1 |
---|
Previously Known Solution:
Removing eb and mon nodes in slurm_host mitigate the problem (the error is seen when either eb or mon nodes are requested in combination with cmp nodes.).
2. self in mca btl is not recognized when use only self, vader
Step to reproduce the problem:
In run_slac.sh
export OMPI_MCA_btl=self,vader |
---|
Error message:
-------------------------------------------------------------------------- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[38932,26011],0]) is on host: drp-srcf-eb011 Process 2 ([[38932,26011],1024]) is on host: drp-srcf-cmp013 BTLs attempted: self Your MPI job is now going to abort; sorry. -------------------------------------------------------------------------- Traceback (most recent call last): File "./test_fex_cfd1.py", line 3, in <module> from psana import * File "/cds/home/m/monarin/lcls2/psana/psana/__init__.py", line 1, in <module> from .datasource import DataSource File "/cds/home/m/monarin/lcls2/psana/psana/datasource.py", line 15, in <module> prometheus_jobid = MPI.COMM_WORLD.bcast(prometheus_jobid, root=0) File "mpi4py/MPI/Comm.pyx", line 1569, in mpi4py.MPI.Comm.bcast File "mpi4py/MPI/msgpickle.pxi", line 723, in mpi4py.MPI.PyMPI_bcast File "mpi4py/MPI/msgpickle.pxi", line 724, in mpi4py.MPI.PyMPI_bcast mpi4py.MPI.Exception: MPI_ERR_INTERN: internal error |
---|
Solution:
Use self,tcp
3. Sorry message appears at gather
To reproduce the problem:
use -n 2177 (high no. of cores).
Error message:
-------------------------------------------------------------------------- Sorry! You were supposed to get help about: accept failed But I couldn't open the help file: /cds/sw/ds/ana/conda2/inst/envs/ps-4.5.5/share/openmpi/help-mpi-btl-tcp.txt: Too many open files. Sorry! -------------------------------------------------------------------------- |
---|
Solution:
Currently, removing gather (np.array of cn_evts and cn_with_peaks) mitigate the problem.
- Increase no. of open file limit (ulimit -n N) default is 1024 right now.
- Check limit of # open files in Slurm