You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

First Iteration

Submitted the following script on s3df multiple times (also an identical script with constraint "CrowdStrike_off"):

#!/bin/bash

#SBATCH --partition=roma
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=120
#SBATCH --output=%j.log
#SBATCH --constraint=CrowdStrike_on
#SBATCH --account=lcls:prjdat21

mpirun python mfxl1028222.py
import time
startup_begin = time.time()
from psana import *
import sys

ds = MPIDataSource('exp=mfxl1028222:run=90:smd')
det = Detector('epix10k2M')
ngood=0
for nevt,evt in enumerate(ds.events()):
    calib = det.calib(evt)
    if calib is not None:
        ngood+=1
    if nevt==0:
        startup_end = time.time()
        start = time.time()
tottime = time.time()-start
print('processed',ngood,tottime,tottime/(ngood-1)) # we ignored first event so 
-1
print('startup',startup_end-startup_begin)

Ran this to get job runtimes:

sacct -j 42602286,42602287,42602516,42602519,42602573,42602576,42602682,42603207,42696097,42696120,42696193,42696194,42696537,42696539,42696567,42696568,42696605,42696606,42696667,42696670,42696744,42696745,42696794,42696797,42714615,42714616,42714723,42714724,42714996,42714998,42715300,42715302,42715657,42715658,42724310,42724317,42725442,42725447,42730341,42730353,42731038,42731045,42738739,42738750,42739483,42739491,42741266,42741272 --format=elapsedraw,constraint,reqcpus,nodelist,start | grep Crowd

See this output:

       120     CrowdStrike_off      120      sdfrome047 2024-03-22T10:20:21 
       127      CrowdStrike_on      120      sdfrome027 2024-03-22T10:20:23 
       121      CrowdStrike_on      120      sdfrome006 2024-03-22T10:33:34 
       114     CrowdStrike_off      120      sdfrome079 2024-03-22T10:33:36 
       122     CrowdStrike_off      120      sdfrome039 2024-03-22T10:37:38 
        92      CrowdStrike_on      120      sdfrome006 2024-03-22T10:37:39 
       131      CrowdStrike_on      120      sdfrome022 2024-03-22T10:43:11 
       125     CrowdStrike_off      120      sdfrome080 2024-03-22T10:50:08 
       110     CrowdStrike_off      120      sdfrome120 2024-03-25T17:43:20 
       139      CrowdStrike_on      120      sdfrome023 2024-03-25T17:43:29 
        89      CrowdStrike_on      120      sdfrome023 2024-03-25T17:47:06 
       112     CrowdStrike_off      120      sdfrome109 2024-03-25T17:47:08 
       108     CrowdStrike_off      120      sdfrome111 2024-03-25T17:52:35 
       137      CrowdStrike_on      120      sdfrome003 2024-03-25T17:52:40 
        88      CrowdStrike_on      120      sdfrome003 2024-03-25T17:55:22 
        69     CrowdStrike_off      120      sdfrome111 2024-03-25T17:55:30 
        67     CrowdStrike_off      120      sdfrome111 2024-03-25T17:57:47 
        79      CrowdStrike_on      120      sdfrome003 2024-03-25T17:57:47 
        75      CrowdStrike_on      120      sdfrome003 2024-03-25T17:59:37 
        68     CrowdStrike_off      120      sdfrome111 2024-03-25T17:59:39 
       127     CrowdStrike_off      120      sdfrome115 2024-03-25T18:03:19 
       125      CrowdStrike_on      120      sdfrome004 2024-03-25T18:03:22 
        82      CrowdStrike_on      120      sdfrome004 2024-03-25T18:07:34 
       128     CrowdStrike_off      120      sdfrome115 2024-03-25T18:07:34 
       118     CrowdStrike_off      120      sdfrome119 2024-03-26T07:17:31 
       122      CrowdStrike_on      120      sdfrome028 2024-03-26T07:17:38 
       133      CrowdStrike_on      120      sdfrome003 2024-03-26T07:26:39 
        85     CrowdStrike_off      120      sdfrome119 2024-03-26T07:26:39 
       113     CrowdStrike_off      120      sdfrome075 2024-03-26T07:46:04 
       128      CrowdStrike_on      120      sdfrome004 2024-03-26T07:46:06 
       128      CrowdStrike_on      120      sdfrome010 2024-03-26T08:06:06 
       113     CrowdStrike_off      120      sdfrome116 2024-03-26T08:06:08 
       124     CrowdStrike_off      120      sdfrome121 2024-03-26T08:38:10 
       138      CrowdStrike_on      120      sdfrome024 2024-03-26T08:38:12 
       117     CrowdStrike_off      120      sdfrome085 2024-03-26T11:00:23 
       125      CrowdStrike_on      120      sdfrome011 2024-03-26T11:00:25 
       124      CrowdStrike_on      120      sdfrome012 2024-03-26T11:06:10 
       146     CrowdStrike_off      120      sdfrome088 2024-03-26T11:06:12 
       116     CrowdStrike_off      120      sdfrome091 2024-03-26T11:31:02 
       121      CrowdStrike_on      120      sdfrome015 2024-03-26T11:31:05 
        74      CrowdStrike_on      120      sdfrome015 2024-03-26T11:34:28 
        79     CrowdStrike_off      120      sdfrome091 2024-03-26T11:34:29 
       121     CrowdStrike_off      120      sdfrome098 2024-03-26T12:15:02 
       120      CrowdStrike_on      120      sdfrome016 2024-03-26T12:15:04 
        84      CrowdStrike_on      120      sdfrome016 2024-03-26T12:19:06 
        82     CrowdStrike_off      120      sdfrome098 2024-03-26T12:19:08 
       129     CrowdStrike_off      120      sdfrome100 2024-03-26T12:28:32 
       144      CrowdStrike_on      120      sdfrome025 2024-03-26T12:28:34 

Run this analysis script on the above output:

f = open('junk.out','r')
nodes = []
ontimes = []
offtimes = []
for line in f:
    fields=line.split()
    node = fields[3]
    if node in nodes:
        print('skipping duplicate node run to avoid caching issues:',node)
        continue
    nodes.append(node)
    on = 'on' in fields[1]
    time = int(fields[0])
    if on:
        ontimes.append(time)
    else:
        offtimes.append(time)
import numpy as np
mean = []
err_on_mean = []
for times in [offtimes,ontimes]:
    print(times)
    mean.append(np.mean(times))
    err_on_mean.append(np.std(times)/np.sqrt(len(times)))
diff_err = np.sqrt(err_on_mean[0]**2+err_on_mean[1]**2)
diff = mean[1]-mean[0]
print('Fractional change:',diff/mean[0],'+-',diff_err/mean[0])
import matplotlib.pyplot as plt
plt.hist([ontimes,offtimes])
plt.show()

See the following output:

(ana-4.0.59-py3) [cpo@sdfiana002 problems]$ python junk3.py
skipping duplicate node run to avoid caching issues: sdfrome006
skipping duplicate node run to avoid caching issues: sdfrome023
skipping duplicate node run to avoid caching issues: sdfrome003
skipping duplicate node run to avoid caching issues: sdfrome111
skipping duplicate node run to avoid caching issues: sdfrome111
skipping duplicate node run to avoid caching issues: sdfrome003
skipping duplicate node run to avoid caching issues: sdfrome003
skipping duplicate node run to avoid caching issues: sdfrome111
skipping duplicate node run to avoid caching issues: sdfrome004
skipping duplicate node run to avoid caching issues: sdfrome115
skipping duplicate node run to avoid caching issues: sdfrome003
skipping duplicate node run to avoid caching issues: sdfrome119
skipping duplicate node run to avoid caching issues: sdfrome004
skipping duplicate node run to avoid caching issues: sdfrome015
skipping duplicate node run to avoid caching issues: sdfrome091
skipping duplicate node run to avoid caching issues: sdfrome016
skipping duplicate node run to avoid caching issues: sdfrome098
[120, 114, 122, 125, 110, 112, 108, 127, 118, 113, 113, 124, 117, 146, 116, 121]
[127, 121, 131, 139, 137, 125, 122, 128, 138, 125, 124, 121, 120]
Fractional change: 0.07062716926305589 +- 0.023752642242268553

With the following plot:

This suggests we see a (7.1+-2.3)% performance penalty from crowdstrike.

Second Iteration

Add a filesystem-cache-flush command from Yee to try to increase the statistics

for i in $(seq 1 100);
do
    sbatch junk.sh
    sbatch junk1.sh
done

#!/bin/bash

#SBATCH --dependency=singleton
#SBATCH --job-name=cson
#SBATCH --partition=roma
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=120
#SBATCH --output=%j.log
#SBATCH --constraint=CrowdStrike_on
#SBATCH --account=lcls:prjdat21

echo "***cson " `hostname`
/sdf/group/scs/tools/free-pagecache
time mpirun python mfxl1028222.py

  • No labels