These scripts and log files can be found in ~cpo/problems/crowdstrike/.
First Iteration
Submitted the following script on s3df multiple times (also an identical script with constraint "CrowdStrike_off"):
#!/bin/bash #SBATCH --partition=roma #SBATCH --nodes=1 #SBATCH --ntasks-per-node=120 #SBATCH --output=%j.log #SBATCH --constraint=CrowdStrike_on #SBATCH --account=lcls:prjdat21 mpirun python mfxl1028222.py
import time startup_begin = time.time() from psana import * import sys ds = MPIDataSource('exp=mfxl1028222:run=90:smd') det = Detector('epix10k2M') ngood=0 for nevt,evt in enumerate(ds.events()): calib = det.calib(evt) if calib is not None: ngood+=1 if nevt==0: startup_end = time.time() start = time.time() tottime = time.time()-start print('processed',ngood,tottime,tottime/(ngood-1)) # we ignored first event so -1 print('startup',startup_end-startup_begin)
Ran this to get job runtimes:
sacct -j 42602286,42602287,42602516,42602519,42602573,42602576,42602682,42603207,42696097,42696120,42696193,42696194,42696537,42696539,42696567,42696568,42696605,42696606,42696667,42696670,42696744,42696745,42696794,42696797,42714615,42714616,42714723,42714724,42714996,42714998,42715300,42715302,42715657,42715658,42724310,42724317,42725442,42725447,42730341,42730353,42731038,42731045,42738739,42738750,42739483,42739491,42741266,42741272 --format=elapsedraw,constraint,reqcpus,nodelist,start | grep Crowd
See this output:
120 CrowdStrike_off 120 sdfrome047 2024-03-22T10:20:21 127 CrowdStrike_on 120 sdfrome027 2024-03-22T10:20:23 121 CrowdStrike_on 120 sdfrome006 2024-03-22T10:33:34 114 CrowdStrike_off 120 sdfrome079 2024-03-22T10:33:36 122 CrowdStrike_off 120 sdfrome039 2024-03-22T10:37:38 92 CrowdStrike_on 120 sdfrome006 2024-03-22T10:37:39 131 CrowdStrike_on 120 sdfrome022 2024-03-22T10:43:11 125 CrowdStrike_off 120 sdfrome080 2024-03-22T10:50:08 110 CrowdStrike_off 120 sdfrome120 2024-03-25T17:43:20 139 CrowdStrike_on 120 sdfrome023 2024-03-25T17:43:29 89 CrowdStrike_on 120 sdfrome023 2024-03-25T17:47:06 112 CrowdStrike_off 120 sdfrome109 2024-03-25T17:47:08 108 CrowdStrike_off 120 sdfrome111 2024-03-25T17:52:35 137 CrowdStrike_on 120 sdfrome003 2024-03-25T17:52:40 88 CrowdStrike_on 120 sdfrome003 2024-03-25T17:55:22 69 CrowdStrike_off 120 sdfrome111 2024-03-25T17:55:30 67 CrowdStrike_off 120 sdfrome111 2024-03-25T17:57:47 79 CrowdStrike_on 120 sdfrome003 2024-03-25T17:57:47 75 CrowdStrike_on 120 sdfrome003 2024-03-25T17:59:37 68 CrowdStrike_off 120 sdfrome111 2024-03-25T17:59:39 127 CrowdStrike_off 120 sdfrome115 2024-03-25T18:03:19 125 CrowdStrike_on 120 sdfrome004 2024-03-25T18:03:22 82 CrowdStrike_on 120 sdfrome004 2024-03-25T18:07:34 128 CrowdStrike_off 120 sdfrome115 2024-03-25T18:07:34 118 CrowdStrike_off 120 sdfrome119 2024-03-26T07:17:31 122 CrowdStrike_on 120 sdfrome028 2024-03-26T07:17:38 133 CrowdStrike_on 120 sdfrome003 2024-03-26T07:26:39 85 CrowdStrike_off 120 sdfrome119 2024-03-26T07:26:39 113 CrowdStrike_off 120 sdfrome075 2024-03-26T07:46:04 128 CrowdStrike_on 120 sdfrome004 2024-03-26T07:46:06 128 CrowdStrike_on 120 sdfrome010 2024-03-26T08:06:06 113 CrowdStrike_off 120 sdfrome116 2024-03-26T08:06:08 124 CrowdStrike_off 120 sdfrome121 2024-03-26T08:38:10 138 CrowdStrike_on 120 sdfrome024 2024-03-26T08:38:12 117 CrowdStrike_off 120 sdfrome085 2024-03-26T11:00:23 125 CrowdStrike_on 120 sdfrome011 2024-03-26T11:00:25 124 CrowdStrike_on 120 sdfrome012 2024-03-26T11:06:10 146 CrowdStrike_off 120 sdfrome088 2024-03-26T11:06:12 116 CrowdStrike_off 120 sdfrome091 2024-03-26T11:31:02 121 CrowdStrike_on 120 sdfrome015 2024-03-26T11:31:05 74 CrowdStrike_on 120 sdfrome015 2024-03-26T11:34:28 79 CrowdStrike_off 120 sdfrome091 2024-03-26T11:34:29 121 CrowdStrike_off 120 sdfrome098 2024-03-26T12:15:02 120 CrowdStrike_on 120 sdfrome016 2024-03-26T12:15:04 84 CrowdStrike_on 120 sdfrome016 2024-03-26T12:19:06 82 CrowdStrike_off 120 sdfrome098 2024-03-26T12:19:08 129 CrowdStrike_off 120 sdfrome100 2024-03-26T12:28:32 144 CrowdStrike_on 120 sdfrome025 2024-03-26T12:28:34
Run this analysis script on the above output:
f = open('junk.out','r') nodes = [] ontimes = [] offtimes = [] for line in f: fields=line.split() node = fields[3] if node in nodes: print('skipping duplicate node run to avoid caching issues:',node) continue nodes.append(node) on = 'on' in fields[1] time = int(fields[0]) if on: ontimes.append(time) else: offtimes.append(time) import numpy as np mean = [] err_on_mean = [] for times in [offtimes,ontimes]: print(times) mean.append(np.mean(times)) err_on_mean.append(np.std(times)/np.sqrt(len(times))) diff_err = np.sqrt(err_on_mean[0]**2+err_on_mean[1]**2) diff = mean[1]-mean[0] print('Fractional change:',diff/mean[0],'+-',diff_err/mean[0]) import matplotlib.pyplot as plt plt.hist([ontimes,offtimes]) plt.show()
See the following output:
(ana-4.0.59-py3) [cpo@sdfiana002 problems]$ python junk3.py skipping duplicate node run to avoid caching issues: sdfrome006 skipping duplicate node run to avoid caching issues: sdfrome023 skipping duplicate node run to avoid caching issues: sdfrome003 skipping duplicate node run to avoid caching issues: sdfrome111 skipping duplicate node run to avoid caching issues: sdfrome111 skipping duplicate node run to avoid caching issues: sdfrome003 skipping duplicate node run to avoid caching issues: sdfrome003 skipping duplicate node run to avoid caching issues: sdfrome111 skipping duplicate node run to avoid caching issues: sdfrome004 skipping duplicate node run to avoid caching issues: sdfrome115 skipping duplicate node run to avoid caching issues: sdfrome003 skipping duplicate node run to avoid caching issues: sdfrome119 skipping duplicate node run to avoid caching issues: sdfrome004 skipping duplicate node run to avoid caching issues: sdfrome015 skipping duplicate node run to avoid caching issues: sdfrome091 skipping duplicate node run to avoid caching issues: sdfrome016 skipping duplicate node run to avoid caching issues: sdfrome098 [120, 114, 122, 125, 110, 112, 108, 127, 118, 113, 113, 124, 117, 146, 116, 121] [127, 121, 131, 139, 137, 125, 122, 128, 138, 125, 124, 121, 120] Fractional change: 0.07062716926305589 +- 0.023752642242268553
With the following plot:
This suggests we see a (7.1+-2.3)% performance penalty from crowdstrike.
Second Iteration
Caching
Add a filesystem-cache-flush command from Yee to try to increase the statistics: 100 jobs of each type (crowdstrike on/off). Run only one crowdstrike_on job and one crowdstrike_off job at a time to avoid leaning too heavily on the filesystem. Unfortunately this means all the "on" jobs ran on sdfrome007 (100 jobs) and all the "off" jobs ran on sdfrome039 (34 jobs) sdfrome037 (25 jobs) sdfrome042 (39 jobs) and sdfrome087 (2 jobs). The first "on" job time was 133 second (slurm job id 42810224) and the first "off" job time was 119 seconds. This looks pretty consistent with the distribution of all job times (see plot below) suggesting that data caching wasn't a bit effect since adding Yee's cache-flush command (previously some jobs ran as quickly at 90 seconds).
Code and Results
for i in $(seq 1 100); do sbatch junk.sh sbatch junk1.sh done
#!/bin/bash #SBATCH --dependency=singleton #SBATCH --job-name=cson #SBATCH --partition=roma #SBATCH --nodes=1 #SBATCH --ntasks-per-node=120 #SBATCH --output=%j.log #SBATCH --constraint=CrowdStrike_on #SBATCH --account=lcls:prjdat21 echo "***cson " `hostname` /sdf/group/scs/tools/free-pagecache time mpirun python mfxl1028222.py
import time startup_begin = time.time() from psana import * import sys ds = MPIDataSource('exp=mfxl1028222:run=90:smd') det = Detector('epix10k2M') ngood=0 for nevt,evt in enumerate(ds.events()): calib = det.calib(evt) if calib is not None: ngood+=1 if nevt==0: startup_end = time.time() start = time.time() tottime = time.time()-start #print('processed',ngood,tottime,tottime/(ngood-1)) # we ignored first event so -1 #print('startup',startup_end-startup_begin)
import glob logs = glob.glob('*.log') logs.sort() # put them in time order nodes = [] ontimes = [] offtimes = [] print(logs) for log in logs: f = open(log,'r') on = False for line in f: if '***' in line: if 'cson' in line: on=True node = line.split()[1] if 'real' in line: timestr = line.split()[1] hours_minutes = timestr.split('m') minutes = float(hours_minutes[0]) seconds = float(hours_minutes[1][:-1]) time = minutes*60+seconds #if node in nodes: # print('skipping duplicate node',node) # continue nodes.append(node) if on: ontimes.append(time) else: offtimes.append(time) import numpy as np mean = [] err_on_mean = [] for times in [offtimes,ontimes]: print(times) mean.append(np.mean(times)) err_on_mean.append(np.std(times)/np.sqrt(len(times))) diff_err = np.sqrt(err_on_mean[0]**2+err_on_mean[1]**2) diff = mean[1]-mean[0] print('Fractional change:',diff/mean[0],'+-',diff_err/mean[0]) import matplotlib.pyplot as plt plt.hist([ontimes,offtimes]) plt.show()
Output:
*** offnodes job count:
sdfrome039 34
sdfrome087 2
sdfrome037 25
sdfrome042 39
*** onnodes job count:
sdfrome007 100
Fractional change: 0.15766088705149858 +- 0.01466574691536575
Third Iteration
*** offnodes job count:
sdfrome035 14
sdfrome114 35
sdfrome087 6
sdfrome042 42
sdfrome073 1
sdfrome036 2
*** onnodes job count:
sdfrome019 34
sdfrome004 2
sdfrome021 64
Fractional change: 0.2588939230105063 +- 0.016541549294602324
Fourth Iteration
*** offnodes job count:
sdfrome042 48
sdfrome043 14
sdfrome111 1
sdfrome039 27
sdfrome086 10
*** onnodes job count:
sdfrome016 100
Fractional change: 0.2359417044882193 +- 0.015870310667490246