...
Below is the blocking MPI performance for a comparison with improvements seen in the following sections:
| eb=1 | eb=2 | eb=4 | eb=8 | eb=16 |
TASK | total(ms) | #occurs | total(ms) | #occurs | total(ms) | #occurs | total(ms) | #occurs | total(ms) | #occurs |
SMD0GOTCHUNK | 2028.15 | 1086 | 2009.68 | 1086 | 1981.5 | 1086 | 1992.47 | 1086 | 1995.75 | 1086 |
SMD0GOTEB | 575.47 | 1087 | 45.39 | 1087 | 45.85 | 1087 | 45.11 | 1087 | 46.17 | 1087 |
SMD0GOTREPACK | 264.1 | 1087 | 298.13 | 1087 | 272.72 | 1087 | 298.47 | 1087 | 279.94 | 1087 |
SMD0DONEWITHEB | 3535.77 | 1087 | 5023.36 | 1087 | 4849.88 | 1087 | 4972.46 | 1087 | 4895.14 | 1087 |
SMD0GOTSTEPHIST | 64.02 | 1087 | 60.18 | 1087 | 63.05 | 1087 | 64.49 | 1087 | 69.19 | 1087 |
SMD0GOTSTEP | 85.66 | 1087 | 84.72 | 1087 | 86.01 | 1087 | 83.84 | 1087 | 86.55 | 1087 |
total: | 6553.16 | 6553.16 | 7521.47 | 7521.47 | 7299.01 | 7299.01 | 7456.84 | 7456.84 | 7372.74 | 7372.74 |
rate (MHz) | 1.53 |
| 1.33 |
| 1.37 |
| 1.34 |
| 1.36 |
|
Overlapping with Send
By replacing Send with Isend. We allow Smd0 to move on after initiating send command to an eventbuilder core. With this overlap, we see that the total wall time improves from 7.4 to 4.4 seconds with 16 eventbuilder cores.
| eb=1 | eb=2 | eb=4 | eb=8 | eb=16 |
TASK | total(ms) | #occurs | total(ms) | #occurs | total(ms) | #occurs | total(ms) | #occurs | total(ms) | #occurs |
SMD0GOTCHUNK |
1964372035832015741992942004795695492800174811167601161914244852127235519857186954895004527453235164762779688365832782988737869903492266281172681172652651552651542260742260740962840962840341240341223374448Conclusions/ Known Issues
We gain some performance by overlapping Send with other computation tasks. However, this code with (Isend/ Irecv) crashes with the current real experiment data (tmoc00118, run=463). We need to investigate this issue before continuing this work.
In additional to overlapping send, we can also perform computational tasks while Smd0 wait for an eventbuilder core to come back (Irecv).