We explored if allowing disk reading task to overlap with computation task on Smd0 (chunk building) can speed up the processing rate.

Test setup

Location of test data:

smd_dir = '/cds/data/drpsrcf/users/monarin/xtcdata/10M60n/xtcdata/smalldata'

Parameters for dev_smd0.py

os.environ['PS_SMD_MAX_RETRIES'] = '0'

os.environ['PS_SMD_N_EVENTS'] = '10000'

os.environ['PS_SMD_CHUNKSIZE'] = '16777216'

os.environ['PS_SMD0_NUM_THREADS'] = '32'

Parameters for test_multitasking_threads.py

cdef uint64_t buf_size = 100000000

cdef uint64_t read_size = 5000000

Location of test scripts:

https://github.com/monarin/psana-nersc/blob/master/psana2/dev_smd0.py

https://github.com/monarin/divelite/blob/master/cython/parallel/test_multitasking_threads.py (formerly tst.py)

Smd0 performance w/o overlapping I/O

Below shows performance of Smd0 (no eventbuilder cores connected) when reading and building chunks synchronously. 

(ps-4.3.2) monarin@drp-srcf-eb003 (master *) psana2 $ python dev_smd0.py 1
found EndRun
total search time: 0.041627854108810425
#Smdfiles: 1 #Events: 10000403 Elapsed Time (s): 0.31 Rate (MHz): 32.35 Bandwidth(GB/s):2.4585713863431047
(ps-4.3.2) monarin@drp-srcf-eb003 (master *) psana2 $ python dev_smd0.py 2
found EndRun
total search time: 0.07852859795093536
#Smdfiles: 2 #Events: 10000403 Elapsed Time (s): 0.59 Rate (MHz): 16.95 Bandwidth(GB/s):2.5766052518112645
(ps-4.3.2) monarin@drp-srcf-eb003 (master *) psana2 $ python dev_smd0.py 4
found EndRun
total search time: 0.1361357718706131
#Smdfiles: 4 #Events: 10000403 Elapsed Time (s): 0.81 Rate (MHz): 12.36 Bandwidth(GB/s):3.757219106209765
(ps-4.3.2) monarin@drp-srcf-eb003 (master *) psana2 $ python dev_smd0.py 8
found EndRun
total search time: 0.2327030450105667
#Smdfiles: 8 #Events: 10000403 Elapsed Time (s): 1.20 Rate (MHz): 8.36 Bandwidth(GB/s):5.085212222769722
(ps-4.3.2) monarin@drp-srcf-eb003 (master *) psana2 $ python dev_smd0.py 16
found EndRun
total search time: 0.24143944680690765
#Smdfiles: 16 #Events: 10000403 Elapsed Time (s): 1.97 Rate (MHz): 5.07 Bandwidth(GB/s):6.160083997216601
(ps-4.3.2) monarin@drp-srcf-eb003 (master *) psana2 $ python dev_smd0.py 32
found EndRun
total search time: 0.27678677439689636
#Smdfiles: 32 #Events: 10000403 Elapsed Time (s): 3.38 Rate (MHz): 2.96 Bandwidth(GB/s):7.200784223247125
(ps-4.3.2) monarin@drp-srcf-eb003 (master *) psana2 $ python dev_smd0.py 52
found EndRun
total search time: 0.3047977387905121
#Smdfiles: 52 #Events: 10000403 Elapsed Time (s): 4.81 Rate (MHz): 2.08 Bandwidth(GB/s):8.223357074851291

I/O Performance w/o computation

We compare these results with a script that just perform disk reading (no computation). 

#Files: 1 Elapsed(s): 0.26s. Bandwidth(GB/s):2.97
(ps-4.3.2) monarin@drp-srcf-eb003 (master *) parallel $ python tst.py 2
#Files: 2 Elapsed(s): 0.32s. Bandwidth(GB/s):4.70
(ps-4.3.2) monarin@drp-srcf-eb003 (master *) parallel $ python tst.py 4
#Files: 4 Elapsed(s): 0.40s. Bandwidth(GB/s):7.53
(ps-4.3.2) monarin@drp-srcf-eb003 (master *) parallel $ python tst.py 8
#Files: 8 Elapsed(s): 0.62s. Bandwidth(GB/s):9.88
(ps-4.3.2) monarin@drp-srcf-eb003 (master *) parallel $ python tst.py 16
#Files: 16 Elapsed(s): 1.32s. Bandwidth(GB/s):9.20
(ps-4.3.2) monarin@drp-srcf-eb003 (master *) parallel $ python tst.py 32
#Files: 32 Elapsed(s): 2.36s. Bandwidth(GB/s):10.29
(ps-4.3.2) monarin@drp-srcf-eb003 (master *) parallel $ python tst.py 52
#Files: 52 Elapsed(s): 3.64s. Bandwidth(GB/s):10.86

Note that we do not see a big improvement eliminating computation task, which is an indication that we are already I/O limited in this case. 




  • No labels