Page History
This pages page summarizes known problems and troubleshoot information that help running experiments a a MHz rate.
1. Node configurations
To achieve a MHz rate, we need to consider the following:
- The heavy-lifting SMD0 core
...
- (MPI rank 0) needs to be assigned on an individual node exclusively. This will allow the entire node to read smalldata in parallel at the full bandwidth
...
- .
- The Srv cores (data writing) should not exceed some limit (currently, we see 8 as the upper limit for data writing cores).
More detail on node types - PSANA2 diagram.
To assign node configurations, run
Code Block |
---|
#!/bin/bash
#SBATCH --partition=anagpu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=3
#SBATCH --output=%j.log
# setup node configurations
./setup_nodes.sh
# "-u" flushes print statements which can otherwise be hidden if mpi hangs
# "-m mpi4py.run" allows mpi to exit if one rank has an exception
mpirun python -u -m mpi4py.run /reg/g/psdm/tutorials/examplePython/mpiDataSource.py
|
2. Open MPI failed to TCP connect
We observe that when running using more than a few nodes (> 20) and that nodes assigned to your job are from mixed pools (cmp, mon, or eb nodes mixed in), jobs may fail with the following message:
Code Block |
---|
-------------------------------------------------------------------------- |
...
WARNING: Open MPI failed to TCP connect to a peer MPI process. This |
...
should not happen. |
...
Your Open MPI job may now hang or fail. |
...
Local host: drp-srcf-eb010 |
...
PID: 130465 |
...
Message: connect() to 172.21.164.90:1055 |
...
failed Error: Operation now in progress (115) |
...
-------------------------------------------------------------------------- |
This happens because we have multiple interfaces (tcp, ib, etc.) assigned to a node, however some might not have all enabled and that results in no connection when a node is missing that selected interface.
Solution
Prior to running your job, apply either 1) or 2).
1. Restrict to 172.21.164 subnet
...
Code Block |
---|
export OMPI_MCA_btl_tcp_if_include=172.21.164.90/1072 |
2. Exclude 172.21.152 subnet
Code Block |
---|
export OMPI_MCA_btl_tcp_if_exclude=eno1 |