Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

2024-02-06 Test of milano216 host with perf

...

Description

Using command:

perf stat -e cache-references,cache-misses,cycles,instructions,branches,branch-misses,faults,migrations,page-faults,L1-dcache-load-misses,L1-icache-load-misses     python test-scaling-subproc.py <parameter>

...

2024-02-07 Test of milano216 host with command perf

...

Description

Running perf with mpirun on a single and 80 CPUs:

...

perf stat -e cache-references,cache-misses,cycles,instructions,branches,branch-misses,faults,migrations,page-faults,L1-dcache-load-misses,L1-icache-load-misses,dTLB-load-misses,iTLB-load-misses     mpirun -n 80 python  Detector/examples/test-scaling-mpi.py

Code Block
titleCode of the time_consuming_algorithm
collapsetrue
import numpy as np
from time import time

def random_standard(shape=(40,60), mu=200, sigma=25, dtype=np.float64):
    a = mu + sigma*np.random.standard_normal(shape)
    return np.require(a, dtype)

def random_arrays(sh2d = (8*512,1024), dtype=np.float64):
    sh3d = (3,) + sh2d
    return random_standard(shape=sh2d, mu=10, sigma=2, dtype=dtype),\
           random_standard(shape=sh3d, mu=20, sigma=3, dtype=dtype)

def time_consuming_algorithm():
    a, b = random_arrays()
    gr1 = a>=11
    gr2 = (a>9) & (a<11)
    gr3 = a<=9
    t0_sec = time()
    a[gr1] -= b[0, gr1]
    a[gr2] -= b[1, gr2]
    a[gr3] -= b[2, gr3]
    return time() - t0_sec
Code Block
titleCode of the event loop
collapsetrue
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

hostname = get_hostname()
cpu_num = psutil.Process().cpu_num()
print('rank:%02d cpu_num:%03d size:%02d' % (rank, cpu_num, size))

ranks = (0, 10, 20, 30, 40, 50, 60, 70)
SAVE_FIGS = True
SHOW_FIGS = False
nevents = 100
arrts = np.zeros((nevents, size), dtype=np.float64)

for nevt in range(nevents):
    dt_sec = time_consuming_algorithm()
    arrts[nevt,rank] = dt_sec  # dt_sec = time()-t0_sec
    cpu_num = psutil.Process().cpu_num()
    if cpu_num >=16 and cpu_num <=23:
        print('rank:%02d cpu_num:%03d nevt:%03d time:%.6f CPU_NUM IN WEKA RANGE [16,23]' % (rank, cpu_num, nevt, dt_sec))
    if nevt%10>0: continue
    print('rank:%02d cpu_num:%03d nevt:%03d time:%.6f' % (rank, cpu_num, nevt, dt_sec))
...
somme graphics for array arrts

Results

Code Block
titleResults for perf with mpirun on a single and 80 CPUs
collapsetrue
ana-4.0.59-py3 [dubrovin@sdfmilan216:~/LCLS/con-py3]$ perf stat -e cache-references,cache-misses,cycles,instructions,branches,branch-misses,faults,migrations,page-faults,L1-dcache-load-misses,L1-icache-load-misses,dTLB-load-misses,iTLB-load-misses     mpirun -n 1 python  Detector/examples/test-scaling-mpi.py 
...
Performance counter stats for 'mpirun -n 1 python Detector/examples/test-scaling-mpi.py':

     4,448,830,552      cache-references:u                                            (50.00%)
        90,374,312      cache-misses:u            #    2.031 % of all cache refs      (50.00%)
   222,814,516,280      cycles:u                                                      (50.02%)
   426,700,282,993      instructions:u            #    1.92  insn per cycle           (50.01%)
    58,876,394,584      branches:u                                                    (50.01%)
     2,343,687,188      branch-misses:u           #    3.98% of all branches          (50.01%)
           635,183      faults:u                                                    
                 0      migrations:u                                                
           635,183      page-faults:u                                               
     2,158,358,417      L1-dcache-load-misses:u                                       (50.00%)
         5,694,036      L1-icache-load-misses:u                                       (49.99%)
         4,282,821      dTLB-load-misses:u                                            (49.99%)
           890,671      iTLB-load-misses:u                                            (50.00%)

      73.297275789 seconds time elapsed

      69.795728000 seconds user
       2.318007000 seconds sys

ana-4.0.59-py3 [dubrovin@sdfmilan216:~/LCLS/con-py3]$ perf stat -e cache-references,cache-misses,cycles,instructions,branches,branch-misses,faults,migrations,page-faults,L1-dcache-load-misses,L1-icache-load-misses,dTLB-load-misses,iTLB-load-misses     mpirun -n 80 python  Detector/examples/test-scaling-mpi.py
...
 Performance counter stats for 'mpirun -n 80 python Detector/examples/test-scaling-mpi.py':

   349,526,509,383      cache-references:u                                            (50.01%)
     5,932,480,814      cache-misses:u            #    1.697 % of all cache refs      (50.00%)
18,768,444,974,036      cycles:u                                                      (50.00%)
33,983,153,714,284      instructions:u            #    1.81  insn per cycle           (49.99%)
 4,684,730,635,234      branches:u                                                    (49.99%)
   186,649,297,019      branch-misses:u           #    3.98% of all branches          (50.00%)
        52,121,421      faults:u                                                    
                 0      migrations:u                                                
        52,121,421      page-faults:u                                               
   171,500,392,922      L1-dcache-load-misses:u                                       (50.00%)
       267,672,856      L1-icache-load-misses:u                                       (50.00%)
       339,145,247      dTLB-load-misses:u                                            (50.01%)
        69,780,394      iTLB-load-misses:u                                            (50.01%)

      92.952500273 seconds time elapsed

    6501.353593000 seconds user
     410.844719000 seconds sys

...