TMO Outer Product

Discussions with Taran, Mona on Feb. 9, 2023

Can limit ourselves to 8000 hsd samples (~1us)
At 1MHz 8000 samples is 16GB/s so too big for one drp node with 4-6GB/s limit (especially with 2 hsd's per node)
- can reduce with fex by at least a factor of 8 to get within 4GB/s per drp node
- at 100kHz 8000 samples would work from a data volume perspective

We want these outer-products:

hsd-hsd outer product (symmetric, same hsd, save a factor of 2)
hsd-fzp outer product (fzp is piranha: 2048)

Note: Timing values shown below are from single precision variables

(ps-4.5.26) monarin@drp-srcf-eb003 (master *) tmolw8819 👁)$ mpirun -n 3 python run_outer_products.py 0 1 2048 59400
nevent: 1 vls: (2048,) outer: (59400,) dt:9.86s. load:0.04s calc:9.82s tt:9.86s. rate:0.10Hz
nevent: 2 vls: (2048,) outer: (59400,) dt:6.37s. load:0.00s calc:6.36s tt:16.22s. rate:0.12Hz
nevent: 3 vls: (2048,) outer: (59400,) dt:6.37s. load:0.00s calc:6.36s tt:22.59s. rate:0.13Hz
nevent: 4 vls: (2048,) outer: (59400,) dt:6.38s. load:0.00s calc:6.38s tt:28.97s. rate:0.14Hz
nevent: 5 vls: (2048,) outer: (59400,) dt:6.37s. load:0.00s calc:6.37s tt:35.34s. rate:0.14Hz
nevent: 6 vls: (2048,) outer: (59400,) dt:6.34s. load:0.00s calc:6.33s tt:41.68s. rate:0.14Hz
nevent: 7 vls: (2048,) outer: (59400,) dt:6.37s. load:0.00s calc:6.36s tt:48.04s. rate:0.15Hz
nevent: 8 vls: (2048,) outer: (59400,) dt:6.35s. load:0.00s calc:6.34s tt:54.39s. rate:0.15Hz
nevent: 8 tt:56.44s. rate:0.00706kHz
(ps-4.5.26) monarin@drp-srcf-eb003 (master *) tmolw8819 👁)$ mpirun -n 3 python run_outer_products.py 0 1 2048 8000
nevent: 1 vls: (2048,) outer: (8000,) dt:0.30s. load:0.03s calc:0.26s tt:0.30s. rate:3.39Hz
nevent: 2 vls: (2048,) outer: (8000,) dt:0.17s. load:0.00s calc:0.16s tt:0.46s. rate:4.31Hz
nevent: 3 vls: (2048,) outer: (8000,) dt:0.17s. load:0.00s calc:0.16s tt:0.63s. rate:4.76Hz
nevent: 4 vls: (2048,) outer: (8000,) dt:0.17s. load:0.00s calc:0.16s tt:0.80s. rate:5.01Hz
nevent: 5 vls: (2048,) outer: (8000,) dt:0.17s. load:0.00s calc:0.16s tt:0.97s. rate:5.18Hz
nevent: 6 vls: (2048,) outer: (8000,) dt:0.17s. load:0.00s calc:0.16s tt:1.13s. rate:5.30Hz
nevent: 7 vls: (2048,) outer: (8000,) dt:0.17s. load:0.00s calc:0.16s tt:1.30s. rate:5.39Hz
nevent: 8 vls: (2048,) outer: (8000,) dt:0.17s. load:0.00s calc:0.16s tt:1.47s. rate:5.45Hz
nevent: 8 tt:4.29s. rate:0.00054kHz

Reducing wf length from 59400 to 8000, the 2 outer products ran 40x faster.

mona should test:

the "sparse" outer product from fex
full 8000x8000 (plus 8000x2048 for fzp):
- we can do 8000 samples from 2 hsd channels per drp node at 100kHz (3.2GB/s) from a data-volume perspective but need to check from a CPU perspective.

Memory Usage

the full-matrix memory usage with mpi: looks like 8GB for 60 cores for 8000 samples

Estimates of memory usage based on data types (float64, float32, and int16) of the variables:

>>> b,vls,hsd=(8,2048,8000)
>>> ((vls*b*2)+(hsd*b*2)+(vls*hsd*b)+(hsd*hsd*b))/1e9
0.643232768
>>> b,vls,hsd=(4,2048,8000)
>>> ((vls*b*2)+(hsd*b*2)+(vls*hsd*b)+(hsd*hsd*b))/1e9
0.321616384
>>> b,vls,hsd=(2,2048,8000)
>>> ((vls*b*2)+(hsd*b*2)+(vls*hsd*b)+(hsd*hsd*b))/1e9
0.160808192

Observed memory usage

Single precision

1 core: 1.3GB

10 cores: 7.9GB (0.8GB per core)

50 cores: 17.3GB (0.35GB per core)

Double precision

50 cores: 50.8GB (1GB per core)

in future with 1MHz hopefully have a better detector for fzp that also runs at 1MHz

Page tree

TMO Outer Product

Memory Usage