Discussions with Taran, Mona on Feb. 9, 2023
- Can limit ourselves to 8000 hsd samples (~1us)
- At 1MHz 8000 samples is 16GB/s so too big for one drp node with 4-6GB/s limit (especially with 2 hsd's per node)
- can reduce with fex by at least a factor of 8 to get within 4GB/s per drp node
- at 100kHz 8000 samples would work from a data volume perspective
We want these outer-products:
- hsd-hsd outer product (symmetric, same hsd, save a factor of 2)
- hsd-fzp outer product (fzp is piranha: 2048)
(ps-4.5.26) monarin@drp-srcf-eb003 (master *) tmolw8819 👁)$ mpirun -n 3 python run_outer_products.py 0 1 2048 59400 nevent: 1 vls: (2048,) inner: (59400,) outer: (59400,) dt:15.693s. load:0.037s calc:15.656s outer vls/wf:0.29s, wf/wf:6.68s add vls/wf:0.29s wf/wf:8.39s tt:15.693s. rate:0.06Hz nevent: 2 vls: (2048,) inner: (59400,) outer: (59400,) dt:10.523s. load:0.004s calc:10.519s outer vls/wf:0.29s, wf/wf:7.69s add vls/wf:0.09s wf/wf:2.45s tt:26.216s. rate:0.08Hz nevent: 3 vls: (2048,) inner: (59400,) outer: (59400,) dt:10.526s. load:0.003s calc:10.523s outer vls/wf:0.29s, wf/wf:7.70s add vls/wf:0.09s wf/wf:2.44s tt:36.742s. rate:0.08Hz nevent: 4 vls: (2048,) inner: (59400,) outer: (59400,) dt:10.445s. load:0.003s calc:10.442s outer vls/wf:0.34s, wf/wf:7.57s add vls/wf:0.09s wf/wf:2.44s tt:47.187s. rate:0.08Hz nevent: 5 vls: (2048,) inner: (59400,) outer: (59400,) dt:10.627s. load:0.003s calc:10.624s outer vls/wf:0.33s, wf/wf:7.74s add vls/wf:0.09s wf/wf:2.46s tt:57.814s. rate:0.09Hz nevent: 6 vls: (2048,) inner: (59400,) outer: (59400,) dt:10.420s. load:0.003s calc:10.417s outer vls/wf:0.33s, wf/wf:7.56s add vls/wf:0.09s wf/wf:2.44s tt:68.234s. rate:0.09Hz nevent: 7 vls: (2048,) inner: (59400,) outer: (59400,) dt:10.580s. load:0.003s calc:10.577s outer vls/wf:0.35s, wf/wf:7.69s add vls/wf:0.09s wf/wf:2.45s tt:78.814s. rate:0.09Hz nevent: 8 vls: (2048,) inner: (59400,) outer: (59400,) dt:10.469s. load:0.005s calc:10.463s outer vls/wf:0.33s, wf/wf:7.59s add vls/wf:0.09s wf/wf:2.46s tt:89.283s. rate:0.09Hz nevent: 8 tt:92.53516s. rate:0.011566894771298394kHz (ps-4.5.26) monarin@drp-srcf-eb003 (master *) tmolw8819 👁)$ mpirun -n 3 python run_outer_products.py 0 1 2048 8000 nevent: 1 vls: (2048,) inner: (8000,) outer: (8000,) dt:0.495s. load:0.035s calc:0.461s outer vls/wf:0.04s, wf/wf:0.18s add vls/wf:0.05s wf/wf:0.19s tt:0.495s. rate:2.02Hz nevent: 2 vls: (2048,) inner: (8000,) outer: (8000,) dt:0.323s. load:0.004s calc:0.319s outer vls/wf:0.06s, wf/wf:0.21s add vls/wf:0.01s wf/wf:0.05s tt:0.818s. rate:2.44Hz nevent: 3 vls: (2048,) inner: (8000,) outer: (8000,) dt:0.323s. load:0.003s calc:0.321s outer vls/wf:0.06s, wf/wf:0.21s add vls/wf:0.01s wf/wf:0.05s tt:1.142s. rate:2.63Hz nevent: 4 vls: (2048,) inner: (8000,) outer: (8000,) dt:0.319s. load:0.003s calc:0.316s outer vls/wf:0.06s, wf/wf:0.20s add vls/wf:0.01s wf/wf:0.05s tt:1.461s. rate:2.74Hz nevent: 5 vls: (2048,) inner: (8000,) outer: (8000,) dt:0.324s. load:0.003s calc:0.321s outer vls/wf:0.06s, wf/wf:0.21s add vls/wf:0.01s wf/wf:0.05s tt:1.784s. rate:2.80Hz nevent: 6 vls: (2048,) inner: (8000,) outer: (8000,) dt:0.319s. load:0.003s calc:0.316s outer vls/wf:0.06s, wf/wf:0.20s add vls/wf:0.01s wf/wf:0.05s tt:2.103s. rate:2.85Hz nevent: 7 vls: (2048,) inner: (8000,) outer: (8000,) dt:0.323s. load:0.003s calc:0.320s outer vls/wf:0.06s, wf/wf:0.21s add vls/wf:0.01s wf/wf:0.05s tt:2.426s. rate:2.89Hz nevent: 8 vls: (2048,) inner: (8000,) outer: (8000,) dt:0.322s. load:0.005s calc:0.317s outer vls/wf:0.05s, wf/wf:0.20s add vls/wf:0.01s wf/wf:0.05s tt:2.748s. rate:2.91Hz nevent: 8 tt:9.95117s. rate:0.00124389665434137kHz
Reducing wf length from 59400 to 8000,
- wf/wf outer product is 36.6x faster
- vls/wf outer product is 5.5x faster
mona should test:
- the "sparse" outer product from fex
- full 8000x8000 (plus 8000x2048 for fzp):
- we can do 8000 samples from 2 hsd channels per drp node at 100kHz (3.2GB/s) from a data-volume perspective but need to check from a CPU perspective.
the full-matrix memory usage with mpi: looks like 8GB for 60 cores for 8000 samples
Estimates of memory usage based on data types (float64, float32, and int16) of the variables:
>>> b,vls,hsd=(8,2048,8000) >>> ((vls*b*2)+(hsd*b*2)+(vls*hsd*b)+(hsd*hsd*b))/1e9 0.643232768 >>> b,vls,hsd=(4,2048,8000) >>> ((vls*b*2)+(hsd*b*2)+(vls*hsd*b)+(hsd*hsd*b))/1e9 0.321616384 >>> b,vls,hsd=(2,2048,8000) >>> ((vls*b*2)+(hsd*b*2)+(vls*hsd*b)+(hsd*hsd*b))/1e9 0.160808192
in future with 1MHz hopefully have a better detector for fzp that also runs at 1MHz
Overview
Content Tools