First Iteration
Submitted the following script on s3df multiple times (also an identical script with constraint "CrowdStrike_off"):
#!/bin/bash #SBATCH --partition=roma #SBATCH --nodes=1 #SBATCH --ntasks-per-node=120 #SBATCH --output=%j.log #SBATCH --constraint=CrowdStrike_on #SBATCH --account=lcls:prjdat21 mpirun python mfxl1028222.py
import time startup_begin = time.time() from psana import * import sys ds = MPIDataSource('exp=mfxl1028222:run=90:smd') det = Detector('epix10k2M') ngood=0 for nevt,evt in enumerate(ds.events()): calib = det.calib(evt) if calib is not None: ngood+=1 if nevt==0: startup_end = time.time() start = time.time() tottime = time.time()-start print('processed',ngood,tottime,tottime/(ngood-1)) # we ignored first event so -1 print('startup',startup_end-startup_begin)
Ran this to get job runtimes:
sacct -j 42602286,42602287,42602516,42602519,42602573,42602576,42602682,42603207,42696097,42696120,42696193,42696194,42696537,42696539,42696567,42696568,42696605,42696606,42696667,42696670,42696744,42696745,42696794,42696797,42714615,42714616,42714723,42714724,42714996,42714998,42715300,42715302,42715657,42715658,42724310,42724317,42725442,42725447,42730341,42730353,42731038,42731045,42738739,42738750,42739483,42739491,42741266,42741272 --format=elapsedraw,constraint,reqcpus,nodelist,start | grep Crowd
See this output:
120 CrowdStrike_off 120 sdfrome047 2024-03-22T10:20:21 127 CrowdStrike_on 120 sdfrome027 2024-03-22T10:20:23 121 CrowdStrike_on 120 sdfrome006 2024-03-22T10:33:34 114 CrowdStrike_off 120 sdfrome079 2024-03-22T10:33:36 122 CrowdStrike_off 120 sdfrome039 2024-03-22T10:37:38 92 CrowdStrike_on 120 sdfrome006 2024-03-22T10:37:39 131 CrowdStrike_on 120 sdfrome022 2024-03-22T10:43:11 125 CrowdStrike_off 120 sdfrome080 2024-03-22T10:50:08 110 CrowdStrike_off 120 sdfrome120 2024-03-25T17:43:20 139 CrowdStrike_on 120 sdfrome023 2024-03-25T17:43:29 89 CrowdStrike_on 120 sdfrome023 2024-03-25T17:47:06 112 CrowdStrike_off 120 sdfrome109 2024-03-25T17:47:08 108 CrowdStrike_off 120 sdfrome111 2024-03-25T17:52:35 137 CrowdStrike_on 120 sdfrome003 2024-03-25T17:52:40 88 CrowdStrike_on 120 sdfrome003 2024-03-25T17:55:22 69 CrowdStrike_off 120 sdfrome111 2024-03-25T17:55:30 67 CrowdStrike_off 120 sdfrome111 2024-03-25T17:57:47 79 CrowdStrike_on 120 sdfrome003 2024-03-25T17:57:47 75 CrowdStrike_on 120 sdfrome003 2024-03-25T17:59:37 68 CrowdStrike_off 120 sdfrome111 2024-03-25T17:59:39 127 CrowdStrike_off 120 sdfrome115 2024-03-25T18:03:19 125 CrowdStrike_on 120 sdfrome004 2024-03-25T18:03:22 82 CrowdStrike_on 120 sdfrome004 2024-03-25T18:07:34 128 CrowdStrike_off 120 sdfrome115 2024-03-25T18:07:34 118 CrowdStrike_off 120 sdfrome119 2024-03-26T07:17:31 122 CrowdStrike_on 120 sdfrome028 2024-03-26T07:17:38 133 CrowdStrike_on 120 sdfrome003 2024-03-26T07:26:39 85 CrowdStrike_off 120 sdfrome119 2024-03-26T07:26:39 113 CrowdStrike_off 120 sdfrome075 2024-03-26T07:46:04 128 CrowdStrike_on 120 sdfrome004 2024-03-26T07:46:06 128 CrowdStrike_on 120 sdfrome010 2024-03-26T08:06:06 113 CrowdStrike_off 120 sdfrome116 2024-03-26T08:06:08 124 CrowdStrike_off 120 sdfrome121 2024-03-26T08:38:10 138 CrowdStrike_on 120 sdfrome024 2024-03-26T08:38:12 117 CrowdStrike_off 120 sdfrome085 2024-03-26T11:00:23 125 CrowdStrike_on 120 sdfrome011 2024-03-26T11:00:25 124 CrowdStrike_on 120 sdfrome012 2024-03-26T11:06:10 146 CrowdStrike_off 120 sdfrome088 2024-03-26T11:06:12 116 CrowdStrike_off 120 sdfrome091 2024-03-26T11:31:02 121 CrowdStrike_on 120 sdfrome015 2024-03-26T11:31:05 74 CrowdStrike_on 120 sdfrome015 2024-03-26T11:34:28 79 CrowdStrike_off 120 sdfrome091 2024-03-26T11:34:29 121 CrowdStrike_off 120 sdfrome098 2024-03-26T12:15:02 120 CrowdStrike_on 120 sdfrome016 2024-03-26T12:15:04 84 CrowdStrike_on 120 sdfrome016 2024-03-26T12:19:06 82 CrowdStrike_off 120 sdfrome098 2024-03-26T12:19:08 129 CrowdStrike_off 120 sdfrome100 2024-03-26T12:28:32 144 CrowdStrike_on 120 sdfrome025 2024-03-26T12:28:34
Run this analysis script on the above output:
f = open('junk.out','r') nodes = [] ontimes = [] offtimes = [] for line in f: fields=line.split() node = fields[3] if node in nodes: print('skipping duplicate node run to avoid caching issues:',node) continue nodes.append(node) on = 'on' in fields[1] time = int(fields[0]) if on: ontimes.append(time) else: offtimes.append(time) import numpy as np mean = [] err_on_mean = [] for times in [offtimes,ontimes]: print(times) mean.append(np.mean(times)) err_on_mean.append(np.std(times)/np.sqrt(len(times))) diff_err = np.sqrt(err_on_mean[0]**2+err_on_mean[1]**2) diff = mean[1]-mean[0] print('Fractional change:',diff/mean[0],'+-',diff_err/mean[0]) import matplotlib.pyplot as plt plt.hist([ontimes,offtimes]) plt.show()
See the following output:
(ana-4.0.59-py3) [cpo@sdfiana002 problems]$ python junk3.py skipping duplicate node run to avoid caching issues: sdfrome006 skipping duplicate node run to avoid caching issues: sdfrome023 skipping duplicate node run to avoid caching issues: sdfrome003 skipping duplicate node run to avoid caching issues: sdfrome111 skipping duplicate node run to avoid caching issues: sdfrome111 skipping duplicate node run to avoid caching issues: sdfrome003 skipping duplicate node run to avoid caching issues: sdfrome003 skipping duplicate node run to avoid caching issues: sdfrome111 skipping duplicate node run to avoid caching issues: sdfrome004 skipping duplicate node run to avoid caching issues: sdfrome115 skipping duplicate node run to avoid caching issues: sdfrome003 skipping duplicate node run to avoid caching issues: sdfrome119 skipping duplicate node run to avoid caching issues: sdfrome004 skipping duplicate node run to avoid caching issues: sdfrome015 skipping duplicate node run to avoid caching issues: sdfrome091 skipping duplicate node run to avoid caching issues: sdfrome016 skipping duplicate node run to avoid caching issues: sdfrome098 [120, 114, 122, 125, 110, 112, 108, 127, 118, 113, 113, 124, 117, 146, 116, 121] [127, 121, 131, 139, 137, 125, 122, 128, 138, 125, 124, 121, 120] Fractional change: 0.07062716926305589 +- 0.023752642242268553
With the following plot:
This suggests we see a (7.1+-2.3)% performance penalty from crowdstrike.
Second Iteration
Add a filesystem-cache-flush command from Yee to try to increase the statistics
for i in $(seq 1 100); do sbatch junk.sh sbatch junk1.sh done
#!/bin/bash #SBATCH --dependency=singleton #SBATCH --job-name=cson #SBATCH --partition=roma #SBATCH --nodes=1 #SBATCH --ntasks-per-node=120 #SBATCH --output=%j.log #SBATCH --constraint=CrowdStrike_on #SBATCH --account=lcls:prjdat21 echo "***cson " `hostname` /sdf/group/scs/tools/free-pagecache time mpirun python mfxl1028222.py
Overview
Content Tools