Overlapping communication with computation

Smd0 sends and receives data to/from EventBuilder cores. We can overlap these communications with two computation tasks (chunk building and message repacking.

Test setup

We used 16 smd files artificially generated for this test.

Test data: /cds/data/drpsrcf/users/monarin/xtcdata/10M16n/

Test code:

https://github.com/monarin/psana-nersc/blob/master/psana2/test_psana2_perf.py

https://github.com/monarin/psana-nersc/blob/master/psana2/run_slac.sh

Running the test (below shows 3 cores):

`which mpirun` -n 3 --hostfile openmpi_hosts --mca btl_openib_allow_ib 1 ./run_slac.sh

(ps-4.3.2) monarin@drp-srcf-eb003 (master *) psana2 $ cat openmpi_hosts

drp-srcf-eb003 slots=1

drp-srcf-eb011 slots=50

You'll need to update no. of eventbuilder cores in run_slac.sh to use more eventbuilder cores.

Blocking Performance

Below is the blocking MPI performance for a comparison with improvements seen in the following sections:

	eb=1		eb=2		eb=4		eb=8		eb=16
TASK	total(ms)	#occurs	total(ms)	#occurs	total(ms)	#occurs	total(ms)	#occurs	total(ms)	#occurs
SMD0GOTCHUNK	2028.15	1086	2009.68	1086	1981.5	1086	1992.47	1086	1995.75	1086
SMD0GOTEB	575.47	1087	45.39	1087	45.85	1087	45.11	1087	46.17	1087
SMD0GOTREPACK	264.1	1087	298.13	1087	272.72	1087	298.47	1087	279.94	1087
SMD0DONEWITHEB	3535.77	1087	5023.36	1087	4849.88	1087	4972.46	1087	4895.14	1087
SMD0GOTSTEPHIST	64.02	1087	60.18	1087	63.05	1087	64.49	1087	69.19	1087
SMD0GOTSTEP	85.66	1087	84.72	1087	86.01	1087	83.84	1087	86.55	1087
total:	6553.16	6553.16	7521.47	7521.47	7299.01	7299.01	7456.84	7456.84	7372.74	7372.74
rate (MHz)	1.53		1.33		1.37		1.34		1.36

Overlapping with Send

By replacing Send with Isend. We allow Smd0 to move on after initiating send command to an eventbuilder core. With this overlap, we see that the total wall time improves from 7.4 to 4 seconds with 16 eventbuilder cores.

	eb=1		eb=2		eb=4		eb=8		eb=16
TASK	total(ms)	#occurs	total(ms)	#occurs	total(ms)	#occurs	total(ms)	#occurs	total(ms)	#occurs
SMD0GOTCHUNK	1964.37	1086	2035.83	1086	2015.74	1086	1992.94	1086	2004.79	1086
SMD0GOTEB	5695.49	1087	2800	1087	1748.11	1087	1676.01	1087	1619.14	1087
SMD0GOTREPACK	244.85	1087	212.7	1087	235.5	1087	198.57	1087	186.95	1087
SMD0DONEWITHEB	48.9	1087	50.04	1087	52.74	1087	53.23	1087	51.64	1087
SMD0GOTSTEPHIST	76.27	1087	79.68	1087	83.65	1087	83.27	1087	82.98	1087
SMD0GOTSTEP	87.37	1087	86.9	1087	90.34	1087	92.26	1087	88.62	1087
total:	8117.26	8117.26	5265.15	5265.15	4226.07	4226.07	4096.28	4096.28	4034.12	4034.12
rate (MHz)	1.23		1.90		2.37		2.44		2.48

Conclusions/ Known Issues

We gain some performance by overlapping Send with other computation tasks. However, this code with (Isend/ Irecv) crashes with the current real experiment data (tmoc00118, run=463). We need to investigate this issue before continuing this work.

In additional to overlapping send, we can also perform computational tasks while Smd0 wait for an eventbuilder core to come back (Irecv). This implementation should be explored after the issue mentioned above is solved.

Page tree

Overlapping communication with computation

Test setup

Blocking Performance

Overlapping with Send

Conclusions/ Known Issues