Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

We tested the 6 outer products outlined above on drp-srcf-eb002.

Without filling out full matrices, the best rate is 1.25 kHz for all 6 operations 4kHz for the most important ones. 

and accumulate the results back to the full size matrices (3 of 8000 x 8000, 2 of 8000 x 2048, and 1 2048 x 2048) on s3df. The performance per core is around 400 Hz. We scale this up to 1MHz with 2500 cores (20 nodes).

Code Block
languagebash
(ps-4.5.26) monarin@drp-srcf-eb002 (master *) tmolw8819 👁)$ python test_fast_outer.py 
ehsd.shape=(20, 50),fzp.shape=(200,) dtype=<class 'numpy.float32'>
Elapsed Time (s): 20 blobs 0.00077 fzp: 0.00003 total:0.00080
Rate (kHz)      : 20 blobs 1.30 fzp: 33.29 total:1.25
(ps-4.5.26) monarin@drp-srcf-eb002 (master *) tmolw8819 👁)$ python test_fast_outer.py 
ehsd.shape=(20, 50),fzp.shape=(200,) dtype=<class 'numpy.float32'>
Elapsed Time (s): 20 blobs 0.00022 fzp: 0.00003 total:0.00025
Rate (kHz)      : 20 blobs 4.64 fzp: 33.45 total:4.07

With filling full matrices, 

Code Block
languagebash
ehsd.shape=(20, 50),fzp.shape=(200,) dtype=<class 'numpy.float32'>
Elapsed Time (s): 20 blobs 0.00671 fzp: 0.00006 total:0.00677
Rate (kHz)      : 20 blobs 0.15 fzp: 17.06 total:0.15
ehsd.shape=(20, 50),fzp.shape=(200,) dtype=<class 'numpy.float32'>
Elapsed Time (s): 20 blobs 0.00058 fzp: 0.00006 total:0.00064
Rate (kHz)      : 20 blobs 1.71 fzp: 16.71 total:1.55

Python script for the results above (with matrix filling):

Code Block
languagepy
titletest_fast_outer.py
import numpy as np
import time
import torch

dtype = np.float32

ctor = np

n_samples = 50
n_blobs = 20
n_fzp_samples = 200
ehsd = ctor.random.rand(1000).reshape((n_blobs, n_samples)).astype(dtype)
ihsd = ctor.random.rand(1000).reshape((n_blobs, n_samples)).astype(dtype)
fzp = ctor.random.rand(n_fzp_samples).astype(dtype)

n_events = 10
tt = ctor.zeros((n_events,3))
o_ehsd_ehsd = ctor.zeros((n_blobs, n_samples, n_samples), dtype=dtype)
o_ihsd_ihsd = ctor.zeros((n_blobs, n_samples, n_samples), dtype=dtype)
o_ehsd_ihsd = ctor.zeros((n_blobs, n_samples, n_samples), dtype=dtype)
o_ehsd_fzp = ctor.zeros((n_blobs, n_samples, n_fzp_samples), dtype=dtype)
o_ihsd_fzp = ctor.zeros((n_blobs, n_samples, n_fzp_samples), dtype=dtype)
o_fzp_fzp = ctor.zeros((n_fzp_samples, n_fzp_samples), dtype=dtype)      
for i in range(n_events):
    t0 = time.monotonic()
    for i_blob, (_ehsd, _ihsd) in enumerate(zip(ehsd, ihsd)):
        o_ehsd_ehsd[i_blob,:] = ctor.outer(_ehsd, _ehsd)
        o_ihsd_ihsd[i_blob,:] = ctor.outer(_ihsd, _ihsd)
        o_ehsd_ihsd[i_blob,:] = ctor.outer(_ehsd, _ihsd)
        o_ehsd_fzp[i_blob,:] = ctor.outer(_ehsd, fzp)
        o_ihsd_fzp[i_blob,:] = ctor.outer(_ihsd, fzp)
    t1 = time.monotonic()
    o_fzp_fzp[:] = ctor.outer(fzp, fzp)
    t2 = time.monotonic()
    tt[i, :] = [t1-t0, t2-t1, t2-t0]
    

print(f'{ehsd.shape=},{fzp.shape=} {dtype=}')
mean_tt = np.mean(tt, axis=0)
print(f'Elapsed Time (s): {n_blobs} blobs {mean_tt[0]:.5f} fzp: {mean_tt[1]:.5f} total:{mean_tt[2]:.5f}')
rate = (n_events/np.sum(tt, axis=0))*1e-3
print(f'Rate (kHz)      : {n_blobs} blobs {rate[0]:.2f} fzp: {rate[1]:.2f} total:{rate[2]:.2f}')

Memory usage on one node with 50 processes

#Cores: 2500 myhost='sdfmilan003' Time: 26.39476s Rate: 947.16kHz

This script for this performance test is test_fast_outer_filling.py and was submitted with submit_slac.shImage Removed

Performance with reduced full data

...