Page History
...
We tested the 6 outer products outlined above on drp-srcf-eb002.
Without filling out full matrices, the best rate is 1.25 kHz for all 6 operations 4kHz for the most important ones.
...
language | bash |
---|
...
and accumulate the results back to the full size matrices (3 of 8000 x 8000, 2 of 8000 x 2048, and 1 2048 x 2048) on s3df. The performance per core is around 400 Hz. We scale this up to 1MHz with 2048 cores (18 milano nodes).
...
This script for this performance test is
...
With filling full matrices,
Code Block | ||
---|---|---|
| ||
ehsd.shape=(20, 50),fzp.shape=(200,) dtype=<class 'numpy.float32'>
Elapsed Time (s): 20 blobs 0.00671 fzp: 0.00006 total:0.00677
Rate (kHz) : 20 blobs 0.15 fzp: 17.06 total:0.15
ehsd.shape=(20, 50),fzp.shape=(200,) dtype=<class 'numpy.float32'>
Elapsed Time (s): 20 blobs 0.00058 fzp: 0.00006 total:0.00064
Rate (kHz) : 20 blobs 1.71 fzp: 16.71 total:1.55 |
Python script for the results above (with matrix filling):
Code Block | ||||
---|---|---|---|---|
| ||||
import numpy as np
import time
import torch
dtype = np.float32
ctor = np
n_samples = 50
n_blobs = 20
n_fzp_samples = 200
ehsd = ctor.random.rand(1000).reshape((n_blobs, n_samples)).astype(dtype)
ihsd = ctor.random.rand(1000).reshape((n_blobs, n_samples)).astype(dtype)
fzp = ctor.random.rand(n_fzp_samples).astype(dtype)
n_events = 10
tt = ctor.zeros((n_events,3))
o_ehsd_ehsd = ctor.zeros((n_blobs, n_samples, n_samples), dtype=dtype)
o_ihsd_ihsd = ctor.zeros((n_blobs, n_samples, n_samples), dtype=dtype)
o_ehsd_ihsd = ctor.zeros((n_blobs, n_samples, n_samples), dtype=dtype)
o_ehsd_fzp = ctor.zeros((n_blobs, n_samples, n_fzp_samples), dtype=dtype)
o_ihsd_fzp = ctor.zeros((n_blobs, n_samples, n_fzp_samples), dtype=dtype)
o_fzp_fzp = ctor.zeros((n_fzp_samples, n_fzp_samples), dtype=dtype)
for i in range(n_events):
t0 = time.monotonic()
for i_blob, (_ehsd, _ihsd) in enumerate(zip(ehsd, ihsd)):
o_ehsd_ehsd[i_blob,:] = ctor.outer(_ehsd, _ehsd)
o_ihsd_ihsd[i_blob,:] = ctor.outer(_ihsd, _ihsd)
o_ehsd_ihsd[i_blob,:] = ctor.outer(_ehsd, _ihsd)
o_ehsd_fzp[i_blob,:] = ctor.outer(_ehsd, fzp)
o_ihsd_fzp[i_blob,:] = ctor.outer(_ihsd, fzp)
t1 = time.monotonic()
o_fzp_fzp[:] = ctor.outer(fzp, fzp)
t2 = time.monotonic()
tt[i, :] = [t1-t0, t2-t1, t2-t0]
print(f'{ehsd.shape=},{fzp.shape=} {dtype=}')
mean_tt = np.mean(tt, axis=0)
print(f'Elapsed Time (s): {n_blobs} blobs {mean_tt[0]:.5f} fzp: {mean_tt[1]:.5f} total:{mean_tt[2]:.5f}')
rate = (n_events/np.sum(tt, axis=0))*1e-3
print(f'Rate (kHz) : {n_blobs} blobs {rate[0]:.2f} fzp: {rate[1]:.2f} total:{rate[2]:.2f}')
|
Memory usage on one node with 50 processes
_filling.py and was submitted with submit_s3df.sh
Performance with reduced full data
...
Overview
Content Tools