Page History
...
We tested the 6 outer products outlined above on drp-srcf-eb002.
Without filling out full matrices, the best rate is 1.25 kHz for all 6 operations 4kHz for the most important ones.
and accumulate the results back to the full size matrices (3 of 8000 x 8000, 2 of 8000 x 2048, and 1 2048 x 2048) on s3df. The performance per core is around 400 Hz. We scale this up to 1MHz with 2500 cores (20 nodes).
Code Block | ||
---|---|---|
| ||
(ps-4.5.26) monarin@drp-srcf-eb002 (master *) tmolw8819 👁)$ python test_fast_outer.py
ehsd.shape=(20, 50),fzp.shape=(200,) dtype=<class 'numpy.float32'>
Elapsed Time (s): 20 blobs 0.00077 fzp: 0.00003 total:0.00080
Rate (kHz) : 20 blobs 1.30 fzp: 33.29 total:1.25
(ps-4.5.26) monarin@drp-srcf-eb002 (master *) tmolw8819 👁)$ python test_fast_outer.py
ehsd.shape=(20, 50),fzp.shape=(200,) dtype=<class 'numpy.float32'>
Elapsed Time (s): 20 blobs 0.00022 fzp: 0.00003 total:0.00025
Rate (kHz) : 20 blobs 4.64 fzp: 33.45 total:4.07 |
With filling full matrices,
Code Block | ||
---|---|---|
| ||
ehsd.shape=(20, 50),fzp.shape=(200,) dtype=<class 'numpy.float32'>
Elapsed Time (s): 20 blobs 0.00671 fzp: 0.00006 total:0.00677
Rate (kHz) : 20 blobs 0.15 fzp: 17.06 total:0.15
ehsd.shape=(20, 50),fzp.shape=(200,) dtype=<class 'numpy.float32'>
Elapsed Time (s): 20 blobs 0.00058 fzp: 0.00006 total:0.00064
Rate (kHz) : 20 blobs 1.71 fzp: 16.71 total:1.55 |
Python script for the results above (with matrix filling):
Code Block | ||||
---|---|---|---|---|
| ||||
import numpy as np
import time
import torch
dtype = np.float32
ctor = np
n_samples = 50
n_blobs = 20
n_fzp_samples = 200
ehsd = ctor.random.rand(1000).reshape((n_blobs, n_samples)).astype(dtype)
ihsd = ctor.random.rand(1000).reshape((n_blobs, n_samples)).astype(dtype)
fzp = ctor.random.rand(n_fzp_samples).astype(dtype)
n_events = 10
tt = ctor.zeros((n_events,3))
o_ehsd_ehsd = ctor.zeros((n_blobs, n_samples, n_samples), dtype=dtype)
o_ihsd_ihsd = ctor.zeros((n_blobs, n_samples, n_samples), dtype=dtype)
o_ehsd_ihsd = ctor.zeros((n_blobs, n_samples, n_samples), dtype=dtype)
o_ehsd_fzp = ctor.zeros((n_blobs, n_samples, n_fzp_samples), dtype=dtype)
o_ihsd_fzp = ctor.zeros((n_blobs, n_samples, n_fzp_samples), dtype=dtype)
o_fzp_fzp = ctor.zeros((n_fzp_samples, n_fzp_samples), dtype=dtype)
for i in range(n_events):
t0 = time.monotonic()
for i_blob, (_ehsd, _ihsd) in enumerate(zip(ehsd, ihsd)):
o_ehsd_ehsd[i_blob,:] = ctor.outer(_ehsd, _ehsd)
o_ihsd_ihsd[i_blob,:] = ctor.outer(_ihsd, _ihsd)
o_ehsd_ihsd[i_blob,:] = ctor.outer(_ehsd, _ihsd)
o_ehsd_fzp[i_blob,:] = ctor.outer(_ehsd, fzp)
o_ihsd_fzp[i_blob,:] = ctor.outer(_ihsd, fzp)
t1 = time.monotonic()
o_fzp_fzp[:] = ctor.outer(fzp, fzp)
t2 = time.monotonic()
tt[i, :] = [t1-t0, t2-t1, t2-t0]
print(f'{ehsd.shape=},{fzp.shape=} {dtype=}')
mean_tt = np.mean(tt, axis=0)
print(f'Elapsed Time (s): {n_blobs} blobs {mean_tt[0]:.5f} fzp: {mean_tt[1]:.5f} total:{mean_tt[2]:.5f}')
rate = (n_events/np.sum(tt, axis=0))*1e-3
print(f'Rate (kHz) : {n_blobs} blobs {rate[0]:.2f} fzp: {rate[1]:.2f} total:{rate[2]:.2f}')
|
Memory usage on one node with 50 processes
#Cores: 2500 myhost='sdfmilan003' Time: 26.39476s Rate: 947.16kHz |
This script for this performance test is test_fast_outer_filling.py and was submitted with submit_slac.sh
Performance with reduced full data
...
Overview
Content Tools