Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Content

Table of Contents

previous page: Scaling behavior of psana1 - Part 1 - det.calib method in multicore processing with mpi

2024-02-06 Test of

...

milano216 host with

...

perf

...

Test description

...

Description

Using command:

perf stat -e cache-references,cache-misses,cycles,instructions,branches,branch-misses,faults,migrations,page-faults,L1-dcache-load-misses,L1-icache-load-misses     python

...

test-scaling-subproc.py <parameter>

where parameter defines test for different number of CPUs, e.g. <parameter> = −1,−2,−8,−13,−16,−17,−18 stands for test on single, 8, 16, 32, 56, 64, 128 CPUs.

Code Block
titleCode of time_consuming_algorithm
collapsetrue
import numpy as np
from time import time, sleep

def random_standard(shape=(40,60), mu=200, sigma=25, dtype=np.float64):
    a = mu + sigma*np.random.standard_normal(shape)
    return np.require(a, dtype)

def random_arrays(sh2d = (8*512,1024), dtype=np.float64):
    sh3d = (3,) + sh2d
    return random_standard(shape=sh2d, mu=10, sigma=2, dtype=dtype),\ -1':

     4,522,410,200      cache-references:u                                            (62.49%)
       112,207,635    random_standard(shape=sh3d,  cache-misses:u  mu=20, sigma=3, dtype=dtype)

def time_consuming_algorithm():
    t01 = time()
    #a, b =  2.481 % of all cache refsrandom_arrays()
    t02 = time()
    gr1  (62.51%)= a>=11
   224,402,878,245 gr2 = (a>9) &  cycles:u(a<11)
    gr3 = a<=9
    t03 = time()
    a[gr1] -= b[0, gr1]
    a[gr2] -= b[1, gr2]
    a[gr3] -= b[2, gr3]
    t04 = time()
    return (t01, t02, t03, t04)
Code Block
titleCode of the event loop in test-scaling-subproc.py
collapsetrue
def do_algo(cpu=0, cmt='v0'):

        hostname = get_hostname(62.51%)
    428,582,543,872  #cpu_num = psutil.Process().cpu_num()
    print('requested instructionscpu:u%03d' % cpu)

    SAVE_FIGS = True
    # SHOW_FIGS = False
   1.91 nevents insn= per100
 cycle   ntpoints = 6
    arrts = (62.51%np.zeros((nevents,ntpoints), dtype=np.float64)
    59,430,436,824  t05_old = time()

    branches:ufor nevt in range(nevents):
        t00 = time()
        times = time_consuming_algorithm()
        cpu_num      = psutil.Process().cpu_num()
        #if cpu_num >=16 and cpu_num <=23:
  (62.50%)
     2,353,206,592 #     branch-misses:u           #print('cpu_num:%03d nevt:%03d time:%.6f CPU_NUM IN WEKA RANGE [16,23]' % (cpu_num, nevt, dt_sec))
     3.96% of all branchest05 = time()
        (62.50%)
   times = (t00,) + times + (t05,)
        657,277 arrts[nevt,:] = times
     faults:u   dt_evt = t05 - t05_old
        t05_old = t05
        if nevt%10>0: continue
        dt_alg = times[4] - times[3]
        dt_in  = times[4] - times[1]
        print('cpu_num:%03d nevt:%03d  times  (sec)' % (cpu_num, nevt), 0\
      migrations:u       ' random arrs: %.6f' % (times[2] - times[1]), \
             ' indeces: %.6f'     % (times[3] - times[2]), \
         
    ' alg: %.6f'     657,277    % (times[4] page-faults:u   times[3]), \
             ' inside algo: %.6f' % (times[4] - times[1]), \
             ' per event: %.6f'   % dt_evt)
...
further code is ffor 
saving results and graphics

Results

Code Block
titleResults of the test for 1,8,16,32,56,64,128 CPU
collapsetrue
ana-4.0.59-py3 [dubrovin@sdfmilan216:~/LCLS/con-py3]$


1 CPU
======
 Performance counter stats for 'python test-scaling-subproc.py -1':

     4,522,410,200   2,169,783,808      L1-dcache-load-misses:u                   cache-references:u                    (62.50%)
         7,173,374      L1-icache-load-misses:u         (62.49%)
       112,207,635      cache-misses:u            #     (62.50%)

      70.762930452 seconds time elapsed

      66.918003000 seconds user
       2.380196000 seconds sys

8 CPU
======
 Performance counter stats for 'python test-scaling-subproc.py -2':

    35,293,654,9472.481 % of all cache refs      (62.51%)
   224,402,878,245      cycles:u                 cache-references:u                                            (62.50%(62.51%)
       675,772,563428,582,543,872      cache-missesinstructions:u            #    1.915 % of all cache refs91  insn per cycle           (62.50%51%)
    159,863430,835436,416,629824      cyclesbranches:u                                                      (62.50%)
     32,408353,694206,078,315592      instructionsbranch-misses:u            #    13.8396% of insn per cycle all branches          (62.50%)
      470,729,321,611     657,277 branches:u     faults:u                                               (62.50%)
    18,710,029,709 
     branch-misses:u           # 0   3.97% of all branchesmigrations:u          (62.50%)
         4,759,204      faults:u                       
           657,277      page-faults:u            
                 0      migrations:u            
     2,169,783,808      L1-dcache-load-misses:u                         
         4,759,204      page-faults:u(62.50%)
                                               
    17,164,781,0687,173,374      L1-dcacheicache-load-misses:u                                       (62.50%)

      70.762930452 seconds 42,407,266 time elapsed

     L1-icache-load-misses:u    66.918003000 seconds user
                                    (62.50%)

      82.107165073 seconds time elapsed

     600.726489000 seconds user
      28.169314000 2.380196000 seconds sys


168 CPU
======
 Performance counter stats for 'python test-scaling-subproc.py -82':

    7135,125293,012654,043947      cache-references:u                                            (62.50%)
       2675,509772,743,885563      cache-misses:u            #    31.529915 % of all cache refs      (62.50%)
 41,256863,512835,072416,612629      cycles:u                                                      (62.50%)
 63,815408,210694,853078,848315      instructions:u            #    1.6083  insn per cycle           (62.50%)
   940470,797729,592321,651611      branches:u                                                    (62.50%)
    3718,401710,077029,277709      branch-misses:u           #    3.98%97% of all branches          (62.50%)
         94,874759,603204      faults:u                                                    
                 0      migrations:u                                                
         94,874759,603204      page-faults:u                                               
    3417,764164,585781,133068      L1-dcache-load-misses:u                                       (62.50%)
        8242,908407,203266      L1-icache-load-misses:u                                       (62.50%)

      9882.180409648107165073 seconds time elapsed

     1370600.175346000726489000 seconds user
     121 28.864448000169314000 seconds sys


3216 CPU
======
 Performance counter stats for 'python test-scaling-subproc.py -138':

    14071,229125,421012,945043      cache-references:u                                            (62.50%)
     52,022509,345743,750885      cache-misses:u            #    3.582529 % of all cache refs      (62.50%)
 84,558256,410512,936072,114612      cycles:u                                                      (62.50%)
13 6,628815,360210,184853,584848      instructions:u            #    1.5960  insn per cycle           (62.50%)
  1 940,881797,291592,550,548651      branches:u                                                    (62.50%)
    7437,783401,808077,615277      branch-misses:u           #    3.98% of all branches          (62.50%)
         199,579874,143603      faults:u                                                    
                 0      migrations:u                                                
         199,579874,143603      page-faults:u                                               
    6834,615764,480585,748133      L1-dcache-load-misses:u                                       (62.50%)
        16382,094908,161203      L1-icache-load-misses:u                                       (62.50%)

      9998.279801084180409648 seconds time elapsed

    27631370.979749000175346000 seconds user
     246121.852789000864448000 seconds sys

5632 CPU
======
 Performance counter stats for 'python test-scaling-subproc.py -1613':

   245140,664229,589421,385945      cache-references:u                                            (62.50%)
     5,986022,128345,102750      cache-misses:u            #    23.437582 % of all cache refs      (62.50%)
13 8,462558,198410,820936,573114      cycles:u                                                      (62.50%)
2313,847628,765360,747184,744584      instructions:u            #    1.7759  insn per cycle           (62.50%)
 31,290881,927291,488550,525548      branches:u                                                    (62.50%)
    13074,897783,170808,304615      branch-misses:u           #    3.98% of all branches          (62.50%)
        3519,494579,247143      faults:u                                                    
                 0      migrations:u                                                
        3519,494579,247143      page-faults:u                                               
    11968,933615,873480,577748      L1-dcache-load-misses:u                                       (62.50%)
       288163,403094,921161      L1-icache-load-misses:u                                       (62.50%)

     108 99.453630713279801084 seconds time elapsed

    53812763.177612000979749000 seconds user
     333246.903330000852789000 seconds sys


ana-4.0.59-py3 [dubrovin@sdfmilan216:~/LCLS/con-py3]$


6456 CPU
======
 Performance counter stats for 'python test-scaling-subproc.py -1716':

   281245,639664,175589,978385      cache-references:u                                            (62.50%)
     85,968986,404128,974102      cache-misses:u            #    32.184437 % of all cache refs      (62.50%)
1613,140462,364198,752820,053573      cycles:u                                                      (62.50%)
2723,256847,133765,511747,829744      instructions:u            #    1.6977  insn per cycle           (62.50%)
 3,761290,710927,111488,186525      branches:u                                                    (62.50%)
   149130,569897,155170,086304      branch-misses:u           #    3.98% of all branches          (62.50%)
        3935,148494,442247      faults:u                                                    
                 0      migrations:u                                                
        3935,148494,442247      page-faults:u                                               
   137119,584933,278873,754577      L1-dcache-load-misses:u                                       (62.50%)
       330288,750403,296921      L1-icache-load-misses:u                                       (62.50%)

     120108.688547006453630713 seconds time elapsed

    62745381.688233000177612000 seconds user
     484333.406164000903330000 seconds sys


12064 CPU
=======
 Performance counter stats for 'python test-scaling-subproc.py -1817':

   532281,229639,037175,371978      cache-references:u                                            (62.50%)
    14 8,227968,944404,434974      cache-misses:u            #    23.673184 % of all cache refs      (62.50%)
2916,404140,359364,241752,173053      cycles:u                                                      (62.50%)
5127,095256,884133,028511,391829      instructions:u            #    1.7469  insn per cycle           (62.50%)
 73,053761,547710,766111,317186      branches:u                                                    (62.50%)
   280149,479569,284155,507086      branch-misses:u           #    3.98% of all branches          (62.50%)
        7339,250148,012442      faults:u                                                    
                 0      migrations:u                                                
        7339,250148,012442      page-faults:u                                               
   260137,078584,672278,869754      L1-dcache-load-misses:u                                       (62.50%)
       618330,858750,635296      L1-icache-load-misses:u                                       (62.50%)

     119120.736692035688547006 seconds time elapsed

    116286274.275939000688233000 seconds user
     843484.423292000406164000 seconds sys

120 CPU

Summary

...

number

of CPU

...

cache-

references

...

cache-

misses

...

branch-

misses

...

L1-dcache-

load-misses

...

L1-icache-

load-misses

...

532,229,037,371

...

=======
 Performance counter stats for 'python test-scaling-subproc.py -18':

   532,229,037,371      cache-references:u                                            (62.50%)
    14,227,944,434      cache-misses:u            #    2.673 % of all cache refs      (62.50%)
29,404,359,241,173      cycles:u                                                      (62.50%)
51,095,884,028,391      instructions:u            #    1.74  insn per cycle           (62.50%)
 7,053,547,766,317      branches:u                                                    (62.50%)
   280,479,284,507      branch-misses:u           #    3.98% of all branches          (62.50%)
        73,250,012      faults:u                                                    
                 0      migrations:u                                                
        73,250,012      page-faults:u                                               
   260,078,672,869      L1-dcache-load-misses:u                                       (62.50%)
       618,858,635      L1-icache-load-misses:u                                       (62.50%)

     119.736692035 seconds time elapsed

   11628.275939000 seconds user
     843.423292000 seconds sys

Summary

number

of CPU

cache-

references

cache-

misses

cyclesinstructionsbranches

branch-

misses

faultspage-faults

L1-dcache-

load-misses

L1-icache-

load-misses

L1-icache

Ratio N/1

cmt
14,522,410,200112,207,635224,402,878,245428,582,543,87259,430,436,8242,353,206,592657,277657,2772,169,783,8087,173,3741
835,293,654,947675,772,563


18,710,029,709

17,164,781,06842,407,2665.9
1671,125,012,0432,509,743,885


37,401,077,277

34,764,585,13382,908,20311.6
32140,229,421,9455,022,345,750


74,783,808,615

68,615,480,748163,094,16122.7
56245,664,589,3855,986,128,102


130,897,170,304

119,933,873,577288,403,92140.2
64281,639,175,9788,968,404,974


149,569,155,086

137,584,278,754330,750,29646.1
120

532,229,037,371

14,227,944,43429,404,359,241,17351,095,884,028,3917,053,547,766,317280,479,284,50773,250,01273,250,012260,078,672,869618,858,63586.2













2024-02-07 Test of milano216 host with command perf

Description

Running perf with mpirun on a single and 80 CPUs:

perf stat -e cache-references,cache-misses,cycles,instructions,branches,branch-misses,faults,migrations,page-faults,L1-dcache-load-misses,L1-icache-load-misses,dTLB-load-misses,iTLB-load-misses     mpirun -n 1 python  Detector/examples/test-scaling-mpi.py

perf stat -e cache-references,cache-misses,cycles,instructions,branches,branch-misses,faults,migrations,page-faults,L1-dcache-load-misses,L1-icache-load-misses,dTLB-load-misses,iTLB-load-misses     mpirun -n 80 python  Detector/examples/test-scaling-mpi.py

Code Block
titleCode of the time_consuming_algorithm
collapsetrue
import numpy as np
from time import time

def random_standard(shape=(40,60), mu=200, sigma=25, dtype=np.float64):
    a = mu + sigma*np.random.standard_normal(shape)
    return np.require(a, dtype)

def random_arrays(sh2d = (8*512,1024), dtype=np.float64):
    sh3d = (3,) + sh2d
    return random_standard(shape=sh2d, mu=10, sigma=2, dtype=dtype),\
           random_standard(shape=sh3d, mu=20, sigma=3, dtype=dtype)

def time_consuming_algorithm():
    a, b = random_arrays()
    gr1 = a>=11
    gr2 = (a>9) & (a<11)
    gr3 = a<=9
    t0_sec = time()
    a[gr1] -= b[0, gr1]
    a[gr2] -= b[1, gr2]
    a[gr3] -= b[2, gr3]
    return time() - t0_sec
Code Block
titleCode of the event loop in Detector/examples/test-scaling-mpi.py
collapsetrue
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

hostname = get_hostname()
cpu_num = psutil.Process().cpu_num()
print('rank:%02d cpu_num:%03d size:%02d' % (rank, cpu_num, size))

ranks = (0, 10, 20, 30, 40, 50, 60, 70)
SAVE_FIGS = True
SHOW_FIGS = False
nevents = 100
arrts = np.zeros((nevents, size), dtype=np.float64)

for nevt in range(nevents):
    dt_sec = time_consuming_algorithm()
    arrts[nevt,rank] = dt_sec  # dt_sec = time()-t0_sec
    cpu_num = psutil.Process().cpu_num()
    if cpu_num >=16 and cpu_num <=23:
        print('rank:%02d cpu_num:%03d nevt:%03d time:%.6f CPU_NUM IN WEKA RANGE [16,23]' % (rank, cpu_num, nevt, dt_sec))
    if nevt%10>0: continue
    print('rank:%02d cpu_num:%03d nevt:%03d time:%.6f' % (rank, cpu_num, nevt, dt_sec))
...
somme graphics for array arrts

Results

Code Block
titleResults for perf with mpirun on a single and 80 CPUs
collapsetrue
ana-4.0.59-py3 [dubrovin@sdfmilan216:~/LCLS/con-py3]$ perf stat -e cache-references,cache-misses,cycles,instructions,branches,branch-misses,faults,migrations,page-faults,L1-dcache-load-misses,L1-icache-load-misses,dTLB-load-misses,iTLB-load-misses     mpirun -n 1 python  Detector/examples/test-scaling-mpi.py 
...
Performance counter stats for 'mpirun -n 1 python Detector/examples/test-scaling-mpi.py':

     4,448,830,552      cache-references:u                                            (50.00%)
        90,374,312      cache-misses:u            #    2.031 % of all cache refs      (50.00%)
   222,814,516,280      cycles:u                                                      (50.02%)
   426,700,282,993      instructions:u            #    1.92  insn per cycle           (50.01%)
    58,876,394,584      branches:u                                                    (50.01%)
     2,343,687,188      branch-misses:u           #    3.98% of all branches          (50.01%)
           635,183      faults:u                                                    
                 0      migrations:u                                                
           635,183      page-faults:u                                               
     2,158,358,417      L1-dcache-load-misses:u                                       (50.00%)
         5,694,036      L1-icache-load-misses:u                                       (49.99%)
         4,282,821      dTLB-load-misses:u                                            (49.99%)
           890,671      iTLB-load-misses:u                                            (50.00%)

      73.297275789 seconds time elapsed

      69.795728000 seconds user
       2.318007000 seconds sys

ana-4.0.59-py3 [dubrovin@sdfmilan216:~/LCLS/con-py3]$ perf stat -e cache-references,cache-misses,cycles,instructions,branches,branch-misses,faults,migrations,page-faults,L1-dcache-load-misses,L1-icache-load-misses,dTLB-load-misses,iTLB-load-misses     mpirun -n 80 python  Detector/examples/test-scaling-mpi.py
...
 Performance counter stats for 'mpirun -n 80 python Detector/examples/test-scaling-mpi.py':

   349,526,509,383      cache-references:u                                            (50.01%)
     5,932,480,814      cache-misses:u            #    1.697 % of all cache refs      (50.00%)
18,768,444,974,036      cycles:u                                                      (50.00%)
33,983,153,714,284      instructions:u            #    1.81  insn per cycle           (49.99%)
 4,684,730,635,234      branches:u                                                    (49.99%)
   186,649,297,019      branch-misses:u           #    3.98% of all branches          (50.00%)
        52,121,421      faults:u                                                    
                 0      migrations:u                                                
        52,121,421      page-faults:u                                               
   171,500,392,922      L1-dcache-load-misses:u                                       (50.00%)
       267,672,856      L1-icache-load-misses:u                                       (50.00%)
       339,145,247      dTLB-load-misses:u                                            (50.01%)
        69,780,394      iTLB-load-misses:u                                            (50.01%)

      92.952500273 seconds time elapsed

    6501.353593000 seconds user
     410.844719000 seconds sys

Summary

number

of mpi cores

cache-

references

cache-

misses

cyclesinstructionsbranches

branch-

misses

faultspage-faults

L1-dcache-

load-misses

L1-icache-

load-misses

dTLB-

load-misses

iTLB-

load-misses

cmt
14,448,830,55290,374,312222,814,516,280426,700,282,99358,876,394,5842,343,687,188635,183635,1832,158,358,4175,694,0364,282,821890,671
80349,526,509,3835,932,480,81418,768,444,974,03633,983,153,714,2844,684,730,635,234186,649,297,01952,121,42152,121,421171,500,392,922267,672,856339,145,24769,780,394
Ratio (80)/(1)79.465.784.179.679.579.782.082.079.347.079.278.4


2024-02-09 Test of milano216 host with command perf

Description

Use commands with changed list of counters like 

perf stat -e stalled-cycles-backend,stalled-cycles-frontend,ls_l1_d_tlb_miss.all,l1_dtlb_misses,l1_data_cache_fills_all,bp_l1_tlb_miss_l2_tlb_miss.if2m,bp_l1_tlb_miss_l2_tlb_miss,l2_dtlb_misses,l2_itlb_misses  python test-scaling-subproc.py -8

Convert perf output to dict, present results in table.

Summary

(*) CPU numbers excludes weka FS.

number of CPU         

stalled-cycles-backend

←Ratio N/1

stalled-cycles-frontend←Ratio N/1

ls_l1_d_tlb_

miss.all

←Ratio N/1

l1_dtlb_

misses

←Ratio N/1

l1_data_cache_

fills_all 

←Ratio N/1

bp_l1_tlb_miss

_l2_tlb_miss.if2m

←Ratio N/1

bp_l1_tlb_miss_

l2_tlb_miss

←Ratio N/1

l2_dtlb_

misses

←Ratio N/1l2_itlb_misses←Ratio N/1
1143,8286141230,987724133,227437132,84519312179,46971413,7011769,30914,8333841719,0261
82105,881833153421,10835915172,7790305.2173,5082125.318216,5648748.325,6066.96124,8977.931,7193006.55591,8217.8
168796,313234  618018,69189035327,8927539.9326,3371839.934551,34106015.855,33114.812467,9761668,2272211410605,35214.7
24*10413,149941  7210519,49087046491,67324814.8490,56609314.951539,38429723.678,4332117889,6212396,9224692015177,11621.1
3217251,05529712013858,55495560671,04724720.2666,23099720.368736,84216831.5105,8742923936,97831135,3222502821599,94030.0
56*

17892,504080

124

24120,4931581041136,77853834.21135,44832534.6120696,77595255.3178,0824842679,84355234,4982544838164,17153
64

27304,844238

190

27697,5220171201258,99972937.91258,03135438.3141469,10904664.9201,3305450957,21866258,6096325343825,04260.9
120*

45388,735746

316

46279,2646612002382,06582071.62376,50710673.3264016,453328121375,69910293410,817121488,30815510178261,952109

References

Command perf with 5sec accumulation time submitted in subprocess one by one in loop, response parameters parsed in dict and preserved in the list. In total, 100 loops are executed for 500 sec. After loop 10 and 50 (twise) host is loaded by the "simulated time consuming job" loading different number of CPUs, which is running for about 2 min or ~20 loops. At the end of the loop plots and table of parameters vs time are generated. 

Results

References