Multithreading Performance

We used Cython multithreading (via openmp) in Smd0 for parallel read and collecting event timestamps. We noticed that there is a 0.3 ms overhead with thread synchronization. The table below shows the amount of time (ms) used for the Smd0 to finish yielding a batch. The comparison is done with and w/o multithreading. Although we expect this amount of time to decrease linearly when Smd0 yields a smaller batch size, this is not true for the case of multithreading. For batch size = 10,000, it took 2.3 ms while batch size = 1, this number is still around 1 ms (see second row of the Table below in comparison with the first row when multithreading is not used). 


Time Spent per Batch (ms)/ BATCH_SIZE1000010001001
Average Time w/o multithreading6.200.630.070.0009
Average Time w prange2.311.090.990.94
Max Time w/o multithreading 15.4913.7913.4513.34
Max Time w prange4.282.693.119.18
Min Time w/o multithreading 1.540.160.010.0 ...
Min Time w prange1.310.920.780.67
Std. Time w/o multithreading 6.562.440.770.08
Std. Time w prange1.390.300.140.13


Simple Cython parallel performance test

We investigated how much time does Cython need to enter and exit a loop that activates and join threads (note that threads are created only once by Cython). With the simple code below, it takes about ~0.3 ms to complete.

from cython.parallel import prange
import numpy as np
from posix.time cimport timeval, gettimeofday
def do_prange(int n, int batch_size):
cdef int i, j
cdef int sum_i = 0
cdef timeval tv_st, tv_en
cdef unsigned long ut_st, ut_en
  gettimeofday(&tv_st, NULL)
for i in prange(n, nogil=True, schedule='static'):
for j in range(batch_size):
sum_i += 1
  gettimeofday(&tv_en, NULL)
ut_st = 1000000 * tv_st.tv_sec + tv_st.tv_usec
ut_en = 1000000 * tv_en.tv_sec + tv_en.tv_usec
print(f'{ut_st} {ut_en} {ut_en - ut_st} {sum_i}')

Simple std::thread performance test

This seems to be the case when using std::thread in c++. Running the code below also consumes about 0.3 ms.

#include <iostream>

#include <thread>

#include <vector>

#include <sys/time.h>

using namespace std;


void func(int tid) {

    int a;

    a=tid;

    //cout << "Lanched by thread " << tid << endl;

}


int main() {

    vector<thread> th;

    int nr_threads = 10;

    struct timeval tv_st, tv_en; 

  gettimeofday(&tv_st, NULL);


    // Launch a group of threads

    for (int i = 0; i < nr_threads; ++i) {

        th.push_back(thread(func, i));

    }

    // Join the threads with the main thread

    for (auto &t : th) {

        t.join();

    }


    gettimeofday(&tv_en, NULL);

    unsigned long ut_st, ut_en;

    ut_st = 1000000 * tv_st.tv_sec + tv_st.tv_usec;

    ut_en = 1000000 * tv_en.tv_sec + tv_en.tv_usec;

    cout << ut_st << " " << ut_en << " " << ut_en - ut_st << endl;


    return 0;

}


Conclusion

We learned that using multithreading in Smd0 has an overhead of 0.3 ms and this could hurt its performance when batch size is set to too small (e.g. 1). However, this is unlikely going the case for how we actually process the events since batch size is going to be around 1000 or more. 



  • No labels