Page History
This pages summarizes known problems and troubleshoot information that help running experiments a a MHz rate.
1. Node configurationsÂ
To achieve a MHz rate, we need to consider the following:
...
Code Block |
---|
#!/bin/bash #SBATCH --partition=anagpu #SBATCH --nodes=2 #SBATCH --ntasks-per-node=3 #SBATCH --output=%j.log # setup node configurations ./setup_nodes.sh # "-u" flushes print statements which can otherwise be hidden if mpi hangs # "-m mpi4py.run" allows mpi to exit if one rank has an exception mpirun python -u -m mpi4py.run /reg/g/psdm/tutorials/examplePython/mpiDataSource.py |
2. Open MPI failed to TCP connect
We observe that when running using more than a few nodes (> 20) and that nodes assigned to your job are from mixed pools (cmp, mon, or eb nodes mixed in), jobs may fail with the following message:
...
This happens because we have multiple interfaces (tcp, ib, etc.) assigned to a node, however some might not have all enabled and that results in no connection when a node is missing that selected interface.
Solution
Prior to running your job, apply either 1) or 2).
...
Overview
Content Tools