Smd0 sends and receives data to/from EventBuilder cores. We can overlap these communications with two computation tasks (chunk building and message repacking.

Test setup

We used 16 smd files artificially generated for this test.

Test data: /cds/data/drpsrcf/users/monarin/xtcdata/10M16n/

Test code: 

https://github.com/monarin/psana-nersc/blob/master/psana2/test_psana2_perf.py

https://github.com/monarin/psana-nersc/blob/master/psana2/run_slac.sh

Running the test (below shows 3 cores):

`which mpirun` -n 3 --hostfile openmpi_hosts --mca btl_openib_allow_ib 1 ./run_slac.sh

(ps-4.3.2) monarin@drp-srcf-eb003 (master *) psana2 $ cat openmpi_hosts 

drp-srcf-eb003 slots=1

drp-srcf-eb011 slots=50

You'll need to update no. of eventbuilder cores in run_slac.sh to use more eventbuilder cores.

Blocking Performance

Below is the blocking MPI performance for a comparison with improvements seen in the following sections:


eb=1eb=2eb=4eb=8eb=16
TASKtotal(ms)#occurstotal(ms)#occurstotal(ms)#occurstotal(ms)#occurstotal(ms)#occurs
SMD0GOTCHUNK2028.1510862009.6810861981.510861992.4710861995.751086
SMD0GOTEB575.47108745.39108745.85108745.11108746.171087
SMD0GOTREPACK264.11087298.131087272.721087298.471087279.941087
SMD0DONEWITHEB3535.7710875023.3610874849.8810874972.4610874895.141087
SMD0GOTSTEPHIST64.02108760.18108763.05108764.49108769.191087
SMD0GOTSTEP85.66108784.72108786.01108783.84108786.551087
total:6553.166553.167521.477521.477299.017299.017456.847456.847372.747372.74
rate (MHz)1.53
1.33
1.37
1.34
1.36

Overlapping with Send

By replacing Send with Isend. We allow Smd0 to move on after initiating send command to an eventbuilder core. With this overlap, we see that the total wall time improves from 7.4 to 4.4 seconds with 16 eventbuilder cores.


eb=1eb=2eb=4eb=8eb=16
TASKtotal(ms)#occurstotal(ms)#occurstotal(ms)#occurstotal(ms)#occurstotal(ms)#occurs
SMD0GOTCHUNK1995.0710861993.6910861975.410861964.610861983.131086
SMD0GOTEB5841.7810872779.4810871964.3610871916.6610871832.131087
SMD0GOTREPACK297.861087258.331087295.821087306.371087345.581087
SMD0DONEWITHEB57.76108761.78108762.37108761.95108760.761087
SMD0GOTSTEPHIST78.34108785.66108787.96108780.87108781.091087
SMD0GOTSTEP86.38108789.5108791.87108789.96108788.431087
total:8357.188357.185268.445268.444477.774477.774420.424420.424391.134391.13
rate (MHz)1.20
1.90
2.23
2.26
2.28

Conclusions/ Known Issues

We gain some performance by overlapping Send with other computation tasks. 

In additional to overlapping send, we can also perform computational tasks while Smd0 wait for an eventbuilder core to come back (Irecv). 

  • No labels