Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

These scripts and log files can be found in ~cpo/problems/crowdstrike/.

First Iteration

Submitted the following script on s3df multiple times (also an identical script with constraint "CrowdStrike_off"):

...

Add a filesystem-cache-flush command from Yee to try to increase the statistics: 100 jobs of each type (crowdstrike on/off).  Run only one crowdstrike_on job and one crowdstrike_off job at a time to avoid leaning too heavily on the filesystem.  Unfortunately this means all the "on" jobs ran on sdfrome007 (100 jobs)  and all the "off" jobs ran on sdfrome039 (34 jobs) sdfrome037 (25 jobs) sdfrome042 (30 39 jobs) and sdfrome087 (2 jobs).  The first "on" job time was 133 second (slurm job id 42810224) and the first "off" job time was 119 seconds.  This looks pretty consistent with the distribution of all job times (see plot below) suggesting that data caching wasn't a bit effect since adding Yee's cache-flush command (previously some jobs ran as quickly at 90 seconds). 

...

Code Block
#!/bin/bash

#SBATCH --dependency=singleton
#SBATCH --job-name=cson
#SBATCH --partition=roma
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=120
#SBATCH --output=%j.log
#SBATCH --constraint=CrowdStrike_on
#SBATCH --account=lcls:prjdat21

echo "***cson " `hostname`
/sdf/group/scs/tools/free-pagecache
time mpirun python mfxl1028222.py

Code Block
import globtime
logsstartup_begin = globtime.globtime('*.log')
logs.sort() # put them in time order
nodes = []
ontimes = []
offtimes = []
print(logs)
for log in logsfrom psana import *
import sys

ds = MPIDataSource('exp=mfxl1028222:run=90:smd')
det = Detector('epix10k2M')
ngood=0
for nevt,evt in enumerate(ds.events()):
    fcalib = open(log,'r'det.calib(evt)
    on = False
if calib is not None:
     for line in f:
     ngood+=1
    if '***' in linenevt==0:
        startup_end    if 'cson' in line:= time.time()
        start = time.time()
tottime = time.time()-start
#print('processed',ngood,tottime,tottime/(ngood-1)) # we ignored first event so
 -1
#print('startup',startup_end-startup_begin)
Code Block
import glob
logs = glob.glob('iter2/*.log')
logs.sort() # put them in time order
nodes = []
ontimes = []
offtimes = []
onnodes = []
offnodes = []
#print('***',logs)

def nodecount(nodelist):
    uniquenodes = set(nodelist)
    for n in uniquenodes:
       on=True
            node = line.split()[1]
        if 'real' in line:
            timestr = line.split()[1]
            hours_minutes = timestr.split('m')
   print(n,nodelist.count(n))

for log in logs:
    f = open(log,'r')
    minuteson = float(hours_minutes[0])False
    for line in f:
      seconds = float(hours_minutes[1][:-1]) if '***' in line:
            time = minutes*60+secondsif 'cson' in line:
    #if node in nodes:
    #    print('skipping duplicate node',node) on=True
    #    continue
    nodes.append(node)
    if on:= line.split()[1]
        ontimes.append(time)
    else:
if 'real' in line:
         offtimes.append(time)
import numpy as np
meantimestr = line.split()[1]
err_on_mean = []
for times in [offtimes,ontimes]:
    print(times)
   hours_minutes = meantimestr.append(np.mean(times))split('m')
    err_on_mean.append(np.std(times)/np.sqrt(len(times)))
diff_err = np.sqrt(err_on_mean[0]**2+err_on_mean[1]**2)
diff = mean[1]-mean[0]
print('Fractional change:',diff/mean[0],'+-',diff_err/mean[0])
import matplotlib.pyplot as plt
plt.hist([ontimes,offtimes])
plt.show()

Output:

        minutes = float(hours_minutes[0])
            seconds = float(hours_minutes[1][:-1])
            time = minutes*60+seconds
    #if node in nodes:
    #    print('skipping duplicate node',node)
    #    continue
    nodes.append(node)
    if on:
        ontimes.append(time)
        onnodes.append(node)
    else:
        offtimes.append(time)
        offnodes.append(node)
import numpy as np
mean = []
err_on_mean = []
for times in [offtimes,ontimes]:
    #print(times)
    mean.append(np.mean(times))
    err_on_mean.append(np.std(times)/np.sqrt(len(times)))
diff_err = np.sqrt(err_on_mean[0]**2+err_on_mean[1]**2)
diff = mean[1]-mean[0]
print('*** offnodes job count:')
nodecount(offnodes)
print('*** onnodes job count:')
nodecount(onnodes)
print('Fractional change:',diff/mean[0],'+-',diff_err/mean[0])
import matplotlib.pyplot as plt
plt.hist([ontimes,offtimes])
plt.show()

Output:

*** offnodes job count:
sdfrome039 34
sdfrome087 2
sdfrome037 25
sdfrome042 39
*** onnodes job count:
sdfrome007 100
Fractional change: 0.15766088705149858 +- 0.01466574691536575

Image Added

Third Iteration

*** offnodes job count:
sdfrome035 14
sdfrome114 35
sdfrome087 6
sdfrome042 42
sdfrome073 1
sdfrome036 2
*** onnodes job count:
sdfrome019 34
sdfrome004 2
sdfrome021 64
Fractional change: 0.2588939230105063 +- 0.016541549294602324

Image Added

Fourth Iteration

*** offnodes job count:
sdfrome042 48
sdfrome043 14
sdfrome111 1
sdfrome039 27
sdfrome086 10
*** onnodes job count:
sdfrome016 100
Fractional change: 0.2359417044882193 +- 0.015870310667490246

Image Added

Update 2024-09-15


We repeated the test on roma partition, 105 iterations each with constraint Crowdstrike_on/Crowdstrike_off alternating. This test was performed during a period of low utilization of the rome partition with no competing network or storage contention.

Measured runtime for psana analysis of mfxl1028222 run=29:smd on exclusive node with 120 cores.

Note: the previous measurements were done with run=90:smd. We chose run=29:smd, because it has more events and therefore takes longer, minimizing effects related to job startup.
Image Added

Plot:Fractional change: 0.24461288024797354 +- 0.001079561972505891