Page History
Table of Contents |
---|
These scripts and log files can be found in ~cpo/problems/crowdstrike/.
First Iteration
Submitted the following script on s3df multiple times (also an identical script with constraint "CrowdStrike_off"):
...
This suggests we see a (7.1+-2.3)% performance penalty from crowdstrike.
Second Iteration
Caching
Add a filesystem-cache-flush command from Yee to try to increase the statistics
Code Block |
---|
for i in $(seq 1 100);
do
sbatch junk.sh
sbatch junk1.sh
done
|
...
: 100 jobs of each type (crowdstrike on/off). Run only one crowdstrike_on job and one crowdstrike_off job at a time to avoid leaning too heavily on the filesystem. Unfortunately this means all the "on" jobs ran on sdfrome007 (100 jobs) and all the "off" jobs ran on sdfrome039 (34 jobs) sdfrome037 (25 jobs) sdfrome042 (39 jobs) and sdfrome087 (2 jobs). The first "on" job time was 133 second (slurm job id 42810224) and the first "off" job time was 119 seconds. This looks pretty consistent with the distribution of all job times (see plot below) suggesting that data caching wasn't a bit effect since adding Yee's cache-flush command (previously some jobs ran as quickly at 90 seconds).
Code and Results
Code Block |
---|
for i in $(seq 1 100);
do
sbatch junk.sh
sbatch junk1.sh
done
|
Code Block |
---|
#!/bin/bash
#SBATCH --dependency=singleton
#SBATCH --job-name=cson
#SBATCH --partition=roma
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=120
#SBATCH --output=%j.log
#SBATCH --constraint=CrowdStrike_on
#SBATCH --account=lcls:prjdat21
echo "***cson " `hostname`
/sdf/group/scs/tools/free-pagecache
time mpirun python mfxl1028222.py
|
Code Block |
---|
import time
startup_begin = time.time()
from psana import *
import sys
ds = MPIDataSource('exp=mfxl1028222:run=90:smd')
det = Detector('epix10k2M')
ngood=0
for nevt,evt in enumerate(ds.events()):
calib = det.calib(evt)
if calib is not None:
ngood+=1
if nevt==0:
startup_end = time.time()
start = time.time()
tottime = time.time()-start
#print('processed',ngood,tottime,tottime/(ngood-1)) # we ignored first event so
-1
#print('startup',startup_end-startup_begin) |
Code Block |
---|
import glob
logs = glob.glob('iter2/*.log')
logs.sort() # put them in time order
nodes = []
ontimes = []
offtimes = []
onnodes = []
offnodes = []
#print('***',logs)
def nodecount(nodelist):
uniquenodes = set(nodelist)
for n in uniquenodes:
print(n,nodelist.count(n))
for log in logs:
f = open(log,'r')
on = False
for line in f:
if '***' in line:
if 'cson' in line:
on=True
node = line.split()[1]
if 'real' in line:
timestr = line.split()[1]
hours_minutes = timestr.split('m')
minutes = float(hours_minutes[0])
seconds = float(hours_minutes[1][:-1])
time = minutes*60+seconds
#if node in nodes:
# print('skipping duplicate node',node)
# continue
nodes.append(node)
if on:
ontimes.append(time)
onnodes.append(node)
else:
offtimes.append(time)
offnodes.append(node)
import numpy as np
mean = []
err_on_mean = []
for times in [offtimes,ontimes]:
#print(times)
mean.append(np.mean(times))
err_on_mean.append(np.std(times)/np.sqrt(len(times)))
diff_err = np.sqrt(err_on_mean[0]**2+err_on_mean[1]**2)
diff = mean[1]-mean[0]
print('*** offnodes job count:')
nodecount(offnodes)
print('*** onnodes job count:')
nodecount(onnodes)
print('Fractional change:',diff/mean[0],'+-',diff_err/mean[0])
import matplotlib.pyplot as plt
plt.hist([ontimes,offtimes])
plt.show() |
Output:
*** offnodes job count:
sdfrome039 34
sdfrome087 2
sdfrome037 25
sdfrome042 39
*** onnodes job count:
sdfrome007 100
Fractional change: 0.15766088705149858 +- 0.01466574691536575
Third Iteration
*** offnodes job count:
sdfrome035 14
sdfrome114 35
sdfrome087 6
sdfrome042 42
sdfrome073 1
sdfrome036 2
*** onnodes job count:
sdfrome019 34
sdfrome004 2
sdfrome021 64
Fractional change: 0.2588939230105063 +- 0.016541549294602324
Fourth Iteration
*** offnodes job count:
sdfrome042 48
sdfrome043 14
sdfrome111 1
sdfrome039 27
sdfrome086 10
*** onnodes job count:
sdfrome016 100
Fractional change: 0.2359417044882193 +- 0.015870310667490246
Update 2024-09-15
We repeated the test on roma partition, 105 iterations each with constraint Crowdstrike_on/Crowdstrike_off alternating. This test was performed during a period of low utilization of the rome partition with no competing network or storage contention.
Measured runtime for psana analysis of mfxl1028222 run=29:smd on exclusive node with 120 cores.
Note: the previous measurements were done with run=90:smd. We chose run=29:smd, because it has more events and therefore takes longer, minimizing effects related to job startup.
...
Fractional change: 0.24461288024797354 +- 0.001079561972505891