Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.

This page summarizes known problems and troubleshoot information that help running experiments a a MHz rate.

1. Node configurations 

To achieve a MHz rate, we need to consider the following:


Code Block

#SBATCH --partition=anagpu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=3
#SBATCH --output=%j.log

# setup node configurations

# "-u" flushes print statements which can otherwise be hidden if mpi hangs
# "-m" allows mpi to exit if one rank has an exception
mpirun python -u -m /reg/g/psdm/tutorials/examplePython/

2. Open MPI failed to TCP connect

We observe that when running using more than a few nodes (> 20) and that nodes assigned to your job are from mixed pools (cmp, mon, or eb nodes mixed in), jobs may fail with the following message:


This happens because we have multiple interfaces (tcp, ib, etc.) assigned to a node, however some might not have all enabled and that results in no connection when a node is missing that selected interface.


Prior to running your job, apply either 1) or 2).
