Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

We may not need to use openmpi with Infiniband if we can get similar performance running psana2 on Ethernet for MPI communications. This connections are needed only for transferring small data (11 GB) for this test from Smd0 to EventBuilders and BigData nodes. Here we show the performance of reading 123 GB on 16 files using 7 drp nodes (113 cores: 1 Smd0/ 12 EventBuilders/ 100 Bigdata cores).

 


Conclusion:

Using OpenMPI with Infiniband: Rate 39.5 kHz (Total Time: 253 s)

Using MPICH from conda on Ethernet: Rate 39.7 kHz (Total Time: 252 s) 


Note 1: below are plots from Grafana showing incoming/outgoing traffics

...

MPICH on Ethernet: no noticeable peaks

 


To run the test:

OpenMPI with Infiniband: 

  • Clone psana environment then remove mpi4py, mpich, and mpi.
  • Build openmpi on drp nodes (drp-tst-dev011 was used for this test). No special flag needed just use --prefix to put the build somewhere.

...

  • Existing pre-built is located at ~monarin/tmp/4.0.0-rhel7.
  • Build mpi4py using this openmpi (see recipe on relmanage/recipe)

...

cd ~/tmp

git clone https://github.com/mpi4py/mpi4py.git

cd mpi4py

export PATH=/cds/home/m/monarin/tmp/4.0.0-rhel7/bin:$PATH

which mpicc

python setup.py install  --single-version-externally-managed --record=record.txt

...

  • Running it with (for example), 

~/tmp/4.0.0-rhel7/bin/mpirun --hostfile openmpi_hosts --mca btl_openib_allow_ib 1 run_slac.sh

...

conda activate ps-1.2.2-openmpi

python dev_bd.py


MPICH on ethernet

(ps-2.1.2) monarin@drp-tst-dev011 (master *) psana2 $ /reg/g/psdm/sw/conda2/inst/envs/ps-2.1.2/bin/mpirun -f mpich_hosts ./run_slac.sh

...

source $HOME/lcls2/setup_env.sh

python dev_bd.py


Note:

Below is another test to run MPICH on Infiniband using -iface flag. Test test is done under ps-3.1.8.

mpirun -f mpich_hosts -iface ib0 ./run_slac.sh

(ps-3.1.8) monarin@drp-tst-dev011 (master *) psana2 $ cat run_slac.sh 

#!/bin/bash

source $HOME/lcls2/setup_env.sh

conda activate ps-3.1.8

python test_send_recv.py

(ps-3.1.8) monarin@drp-tst-dev011 (master *) psana2 $ cat test_send_recv.py 

from mpi4py import MPI

comm = MPI.COMM_WORLD

rank = comm.Get_rank()

size = comm.Get_size()

myhost = MPI.Get_processor_name()

import numpy as np


n = 100000

if rank == 0:

    data = np.arange(1000000, dtype='i')

    for i in range(n):

        comm.send(data, dest=1)

    comm.send(np.zeros(1000000, dtype='i'), dest=1)

else:

    while True:

        data = comm.recv(source=0)

        print(f'rank{rank} recv {data.shape} {np.sum(data)}')

        if np.sum(data) == 0:

            break


print(f'rank{rank} on host {myhost} done')

Image Added

Checking the traffic on IB

https://pswww.slac.stanford.edu/system/grafana/d/C81U-s_mz/drp-dev-io?orgId=1&var-IBmetrics=node_infiniband_port_constraint_errors_transmitted_total&var-job=drpdev&var-group=dev&var-nname=drp-tst-dev&var-node=All

Note on Slurm

To submit slurm job, use following two methods

sbatch submit_slac.sh

cat submit_slac.sh

#!/bin/bash

#SBATCH --partition=anagpu

#SBATCH --job-name=psana2-test

#SBATCH --ntasks=4

#SBATCH --ntasks-per-node=4

#SBATCH --output=%j.log

 

# -u flushes print statements which can otherwise be hidden if mpi hangs

t_start=`date +%s`

srun python ./test_mpi.py 

t_end=`date +%s`


echo PSJobCompleted TotalElapsed $((t_end-t_start))

or

srun --partition=anagpu --ntasks=4 --ntasks-per-node=4 python ./test_mpi.py