Multithreading Performance of Smd0

We used Cython multithreading (via openmp) in Smd0 for parallel read and collecting event timestamps. We noticed that there is a 0.3 ms overhead with thread synchronization. The table below shows the amount of time (ms) used for the Smd0 to finish yielding a batch. The comparison is done with and w/o multithreading. Although we expect this amount of time to decrease linearly when Smd0 yields a smaller batch size, this is not true for the case of multithreading. For batch size = 10,000, it took 2.3 ms while batch size = 1, this number is still around 1 ms (see second row of the Table below in comparison with the first row when multithreading is not used).

Time Spent per Batch (ms)/ BATCH_SIZE	10000	1000	100	1
Average Time w/o multithreading	6.20	0.63	0.07	0.0009
Average Time w prange	2.31	1.09	0.99	0.94
Max Time w/o multithreading	15.49	13.79	13.45	13.34
Max Time w prange	4.28	2.69	3.11	9.18
Min Time w/o multithreading	1.54	0.16	0.01	0.0 ...
Min Time w prange	1.31	0.92	0.78	0.67
Std. Time w/o multithreading	6.56	2.44	0.77	0.08
Std. Time w prange	1.39	0.30	0.14	0.13

We investigated how much time does Cython need to enter and exit a loop that activates and join threads (note that threads are created only once by Cython). With the simple code below, it takes about ~0.3 ms to complete.

from cython.parallel import prange
import numpy as np
from posix.time cimport timeval, gettimeofday

def do_prange(int n, int batch_size):
  cdef int i, j
  cdef int sum_i = 0

cdef timeval tv_st, tv_en
  cdef unsigned long ut_st, ut_en

  gettimeofday(&tv_st, NULL)
  for i in prange(n, nogil=True, schedule='static'):
    for j in range(batch_size):
      sum_i += 1

  gettimeofday(&tv_en, NULL)
  ut_st = 1000000 * tv_st.tv_sec + tv_st.tv_usec
  ut_en = 1000000 * tv_en.tv_sec + tv_en.tv_usec
  print(f'{ut_st} {ut_en} {ut_en - ut_st} {sum_i}')

This seems to be

Page tree

Multithreading Performance of Smd0