Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Can limit ourselves to 8000 hsd samples (~1us for both ions/electrons)
    • will do development to make slow-ions fit in the 1us window
  • At 1MHz 8000 samples is 16GB/s so too big for one drp node with 4-6GB/s limit (especially with 2 hsd's per node)
    • can reduce with fex by at least a factor of 8 (20 non-contiguous areas of 50 samples each, changing per-event) to get within 4GB/s per drp node (Taran confirmed that we can do this with the 1us window, factor of 8 applies to both electrons and ions)
    • at 100kHz 8000 samples would work from a data volume perspective
  • For the FZP 2048 samples from the piranha have one contiguous "blob" of ~200 pixels that should be used for the outer product, which changes on a per-event basis.
    • find the highest pixel with a window with a fixed-size window

We want these 6 outer-products:

  • electron-electron hsd-hsd outer product (symmetric, same hsd, can save a factor of 2)
  • ion-ion hsd-hsd outer product (symmetric, same hsd, can save a factor of 2)
  • electron-ion hsd-hsd outer product
  • (most important) electron-fzp outer product (fzp is piranha: 2048)
  • ion-fzp outer product (fzp is piranha: 2048)
  • (most important) fzp-fzp outer product (symmetric, can save a factor of 2)

Performance with Fex Data

We tested the 6 outer products outlined above and accumulate the results back to the full size matrices (3 of 8000 x 8000, 2 of 8000 x 2048, and 1 2048 x 2048) on s3df. The performance per core is around 400 Hz. We scale this up to 1MHz with 2048 cores (18 milano nodes).

Image Added

This script for this performance test is test_fast_outer_filling.py and was submitted with submit_s3df.sh

Performance with reduced full data

Numpy Outer Products

Code Block
languagebash
(ps-4.5.26) monarin@drp-srcf-eb003 (master *) tmolw8819 👁)$ mpirun -n 3 python runtest_np_outer.py 
a.shape=(2048,),b.shape=(59400,) dtype=<class 'numpy.int16'>
a,b min: 0.07 max: 0.09 avg: 0.08 rate:12.04Hz
b,b min: 2.03 max: 2.48 avg: 2.31 rate:0.43Hz
total min: 2.10 max: 2.57 avg: 2.40 rate:0.42Hz
(ps-4.5.26) monarin@drp-srcf-eb003 (master *) tmolw8819 👁)$ python test_np_outer.py 
a.shape=(2048,),b.shape=(59400,) dtype=<class 'numpy.float32'>
a,b min: 0.14 max: 0.17 avg: 0.16 rate:6.31Hz
b,b min: 3.88 max: 4.64 avg: 4.36 rate:0.23Hz
total min: 4.02 max: 4.82 avg: 4.52 rate:0.22Hz
(ps-4.5.26) monarin@drp-srcf-eb003 (master *) tmolw8819 👁)$ python test_np_outer.py 
a.shape=(2048,),b.shape=(59400,) dtype=<class 'numpy.float64'>
a,b min: 0.28 max: 0.33 avg: 0.31 rate:3.26Hz
b,b min: 7.45 max: 8.67 avg: 8.15 rate:0.12Hz
total min: 7.73 max: 9.00 avg: 8.46 rate:0.12Hz
(ps-4.5.26) monarin@drp-srcf-eb003 (master *) tmolw8819 👁)$ python test_np_outer.py 
a.shape=(2048,),b.shape=(8000,) dtype=<class 'numpy.int16'>
a,b min: 0.01 max: 0.01 avg: 0.01 rate:77.62Hz
b,b min: 0.04 max: 0.05 avg: 0.05 rate:19.64Hz
total min: 0.06 max: 0.07 avg: 0.06 rate:15.68Hz
(ps-4.5.26) monarin@drp-srcf-eb003 (master *) tmolw8819 👁)$ python test_np_outer.py 
a.shape=(2048,),b.shape=(8000,) dtype=<class 'numpy.float32'>
a,b min: 0.02 max: 0.03 avg: 0.03 rate:38.00Hz
b,b min: 0.09 max: 0.11 avg: 0.10 rate:9.87Hz
total min: 0.11 max: 0.14 avg: 0.13 rate:7.84Hz
(ps-4.5.26) monarin@drp-srcf-eb003 (master *) tmolw8819 👁)$ python test_np_outer.py 
a.shape=(2048,),b.shape=(8000,) dtype=<class 'numpy.float64'>
a,b min: 0.04 max: 0.05 avg: 0.05 rate:19.58Hz
b,b min: 0.17 max: 0.21 avg: 0.20 rate:5.00Hz
total min: 0.22 max: 0.27 avg: 0.25 rate:3.98Hz

Pytorch Outer Products

Code Block
languagebash
(ps-4.5.26) monarin@drp-srcf-eb003 (master *) tmolw8819 👁)$ python test_np_outer.py 
a.shape=torch.Size([2048]),b.shape=torch.Size([59400]) dtype=torch.int16
a,b min: 0.02 max: 0.04 avg: 0.03 rate:29.94Hz
b,b min: 0.35 max: 0.85 avg: 0.67 rate:1.49Hz
total min: 0.37 max: 0.89 avg: 0.71 rate:1.42Hz
(ps-4.5.26) monarin@drp-srcf-eb003 (master *) tmolw8819 👁)$ python test_np_outer.py 
a.shape=torch.Size([2048]),b.shape=torch.Size([59400]) dtype=torch.float32
a,b min: 0.04 max: 0.07 avg: 0.06 rate:17.47Hz
b,b min: 0.59 max: 1.55 avg: 1.19 rate:0.84Hz
total min: 0.63 max: 1.62 avg: 1.25 rate:0.80Hz
(ps-4.5.26) monarin@drp-srcf-eb003 (master *) tmolw8819 👁)$ python test_np_outer.py 
a.shape=torch.Size([2048]),b.shape=torch.Size([59400]) dtype=torch.float64
a,b min: 0.06 max: 0.13 avg: 0.10 rate:10.24Hz
b,b min: 1.06 max: 2.89 avg: 2.13 rate:0.47Hz
total min: 1.12 max: 3.02 avg: 2.23 rate:0.45Hz
(ps-4.5.26) monarin@drp-srcf-eb003 (master *) tmolw8819 👁)$ python test_np_outer.py 
a.shape=torch.Size([2048]),b.shape=torch.Size([8000]) dtype=torch.int16
a,b min: 0.01 max: 0.02 avg: 0.01 rate:73.23Hz
b,b min: 0.02 max: 0.03 avg: 0.02 rate:45.16Hz
total min: 0.03 max: 0.04 avg: 0.04 rate:27.93Hz
(ps-4.5.26) monarin@drp-srcf-eb003 (master *) tmolw8819 👁)$ python test_np_outer.py 
a.shape=torch.Size([2048]),b.shape=torch.Size([8000]) dtype=torch.float32
a,b min: 0.01 max: 0.02 avg: 0.02 rate:57.11Hz
b,b min: 0.02 max: 0.04 avg: 0.03 rate:30.44Hz
total min: 0.04 max: 0.06 avg: 0.05 rate:19.86Hz
(ps-4.5.26) monarin@drp-srcf-eb003 (master *) tmolw8819 👁)$ python test_np_outer.py 
a.shape=torch.Size([2048]),b.shape=torch.Size([8000]) dtype=torch.float64
a,b min: 0.02 max: 0.03 avg: 0.02 rate:40.98Hz
b,b min: 0.03 max: 0.07 avg: 0.06 rate:17.91Hz
total min: 0.05 max: 0.10 avg: 0.08 rate:12.46Hz

Python script with psana2

Note: Timing values shown below are from single precision variables

Code Block
languagebash
_products.py 0 1 2048 59400
nevent: 1 vls: (2048,) inner: (59400,) outer: (59400,) dt:15.693s. load:0.037s calc:15.656s outer vls/wf:0.29s, wf/wf:6.68s add vls/wf:0.29s wf/wf:8.39s tt:15.693s. rate:0.06Hz
nevent: 2 vls: (2048,) inner: (59400,) outer: (59400,) dt:10.523s. load:0.004s calc:10.519s outer vls/wf:0.29s, wf/wf:7.69s add vls/wf:0.09s wf/wf:2.45s tt:26.216s. rate:0.08Hz
nevent: 3 vls: (2048,) inner: (59400,) outer: (59400,) dt:10.526s. load:0.003s calc:10.523s outer vls/wf:0.29s, wf/wf:7.70s add vls/wf:0.09s wf/wf:2.44s tt:36.742s. rate:0.08Hz
nevent: 4 vls: (2048,) inner: (59400,) outer: (59400,) dt:10.445s. load:0.003s calc:10.442s outer vls/wf:0.34s, wf/wf:7.57s add vls/wf:0.09s wf/wf:2.44s tt:47.187s. rate:0.08Hz
nevent: 5 vls: (2048,) inner: (59400,) outer: (59400,) dt:10.627s. load:0.003s calc:10.624s outer vls/wf:0.33s, wf/wf:7.74s add vls/wf:0.09s wf/wf:2.46s tt:57.814s. rate:0.09Hz
nevent: 6 vls: (2048,) inner: (59400,) outer: (59400,) dt:10.420s. load:0.003s calc:10.417s outer vls/wf:0.33s, wf/wf:7.56s add vls/wf:0.09s wf/wf:2.44s tt:68.234s. rate:0.09Hz
nevent: 7 vls: (2048,) inner: (59400,) outer: (59400,) dt:10.580s. load:0.003s calc:10.577s outer vls/wf:0.35s, wf/wf:7.69s add vls/wf:0.09s wf/wf:2.45s tt:78.814s. rate:0.09Hz
nevent: 8 vls: (2048,) inner: (59400,) outer: (59400,) dt:10.469s. load:0.005s calc:10.463s outer vls/wf:0.33s, wf/wf:7.59s add vls/wf:0.09s wf/wf:2.46s tt:89.283s. rate:0.09Hz
nevent: 8 tt:92.53516s. rate:0.011566894771298394kHz
(ps-4.5.26) monarin@drp-srcf-eb003 (master *) tmolw8819 👁)$ mpirun -n 3 python run_outer_products.py 0 1 2048 800059400
nevent: 1 vls: (2048,) inner: (8000,) outer: (800059400,) dt:09.495s86s. load:0.035s04s calc:9.82s tt:9.86s. rate:0.461s outer vls/wf:0.04s, wf/wf:0.18s add vls/wf:0.05s wf/wf:0.19s tt:0.495s. rate:2.02Hz
nevent: 210Hz
nevent: 2 vls: (2048,) outer: (59400,) dt:6.37s. load:0.00s calc:6.36s tt:16.22s. rate:0.12Hz
nevent: 3 vls: (2048,) inner: (8000,) outer: (800059400,) dt:06.323s37s. load:0.004s00s calc:6.36s tt:22.59s. rate:0.319s outer vls/wf:0.06s, wf/wf:0.21s add vls/wf:0.01s wf/wf:0.05s tt:0.818s. rate:2.44Hz
nevent: 313Hz
nevent: 4 vls: (2048,) outer: (59400,) dt:6.38s. load:0.00s calc:6.38s tt:28.97s. rate:0.14Hz
nevent: 5 vls: (2048,) inner: (8000,) outer: (800059400,) dt:06.323s37s. load:0.003s00s calc:6.37s tt:35.34s. rate:0.321s outer vls/wf:0.06s, wf/wf:0.21s add vls/wf:0.01s wf/wf:0.05s tt:1.142s. rate:2.63Hz
nevent: 414Hz
nevent: 6 vls: (2048,) outer: (59400,) dt:6.34s. load:0.00s calc:6.33s tt:41.68s. rate:0.14Hz
nevent: 7 vls: (2048,) inner: (8000,) outer: (800059400,) dt:06.319s37s. load:0.003s00s calc:6.36s tt:48.04s. rate:0.316s outer vls/wf:0.06s, wf/wf:0.20s add vls/wf:0.01s wf/wf:0.05s tt:1.461s. rate:2.74Hz
nevent: 5 vls: (2048,) inner: (800015Hz
nevent: 8 vls: (2048,) outer: (59400,) dt:6.35s. load:0.00s calc:6.34s tt:54.39s. rate:0.15Hz
nevent: 8 tt:56.44s. rate:0.00706kHz
(ps-4.5.26) monarin@drp-srcf-eb003 (master *) tmolw8819 👁)$ mpirun -n 3 python run_outer_products.py 0 1 2048 8000
nevent: 1 vls: (2048,) outer: (8000,) dt:0.324s30s. load:0.003s03s calc:0.26s tt:0.321s outer vls/wf:0.06s, wf/wf:0.21s add vls/wf:0.01s wf/wf:0.05s tt:1.784s. rate:2.80Hz
nevent: 6.30s. rate:3.39Hz
nevent: 2 vls: (2048,) outer: (8000,) dt:0.17s. load:0.00s calc:0.16s tt:0.46s. rate:4.31Hz
nevent: 3 vls: (2048,) inner: (8000,) outer: (8000,) dt:0.319s17s. load:0.003s00s calc:0.16s tt:0.316s outer vls/wf:0.06s, wf/wf:0.20s add vls/wf:0.01s wf/wf:0.05s tt:2.103s. rate:2.85Hz
nevent: 7.63s. rate:4.76Hz
nevent: 4 vls: (2048,) outer: (8000,) dt:0.17s. load:0.00s calc:0.16s tt:0.80s. rate:5.01Hz
nevent: 5 vls: (2048,) inner: (8000,) outer: (8000,) dt:0.323s17s. load:0.003s00s calc:0.16s tt:0.320s outer vls/wf:0.06s, wf/wf:0.21s add vls/wf:0.01s wf/wf:0.05s tt:2.426s. rate:2.89Hz
nevent: 8.97s. rate:5.18Hz
nevent: 6 vls: (2048,) outer: (8000,) dt:0.17s. load:0.00s calc:0.16s tt:1.13s. rate:5.30Hz
nevent: 7 vls: (2048,) inner: (8000,) outer: (8000,) dt:0.322s17s. load:0.005s00s calc:0.317s outer vls/wf:0.05s, wf/wf:0.20s add vls/wf:0.01s wf/wf:0.05s tt:2.748s. rate:2.91Hz16s tt:1.30s. rate:5.39Hz
nevent: 8 vls: (2048,) outer: (8000,) dt:0.17s. load:0.00s calc:0.16s tt:1.47s. rate:5.45Hz
nevent: 8 tt:94.95117s29s. rate:0.00124389665434137kHz00054kHz

Reducing wf length from 59400 to 8000,

...

the 2 outer products ran 40x faster.

mona should test:

  • the "sparse" outer product from fex
  • full 8000x8000 (plus 8000x2048 for fzp):
    • we can do 8000 samples from 2 hsd channels per drp node at 100kHz (3.2GB/s) from a data-volume perspective but need to check from a CPU perspective.

Memory Usage

the full-matrix memory usage with mpi: looks like 8GB for 60 cores for 8000 samples

...

Code Block
languagebash
>>> b,vls,hsd=(8,2048,8000)
>>> ((vls*b*2)+(hsd*b*2)+(vls*hsd*b)+(hsd*hsd*b))/1e9
0.643232768
>>> b,vls,hsd=(4,2048,8000)
>>> ((vls*b*2)+(hsd*b*2)+(vls*hsd*b)+(hsd*hsd*b))/1e9
0.321616384
>>> b,vls,hsd=(2,2048,8000)
>>> ((vls*b*2)+(hsd*b*2)+(vls*hsd*b)+(hsd*hsd*b))/1e9
0.160808192

Observed memory usage

Single precision

1 core: 1.3GB

10 cores: 7.9GB (0.8GB per core)

50 cores: 17.3GB (0.35GB per core)

Double precision

50 cores: 50.8GB (1GB per core)

in future with 1MHz hopefully have a better detector for fzp that also runs at 1MHz

...