Page History

...

We tested the 6 outer products outlined above on drp-srcf-eb002.

Without filling out full matrices, the best rate is 1.25 kHz for all 6 operations 4kHz for the most important ones.

and accumulate the results back to the full size matrices (3 of 8000 x 8000, 2 of 8000 x 2048, and 1 2048 x 2048) on s3df. The performance per core is around 400 Hz. We scale this up to 1MHz with 2500 cores (20 nodes).

Code Block

language	bash

(ps-4.5.26) monarin@drp-srcf-eb002 (master *) tmolw8819 👁)$ python test_fast_outer.py 
ehsd.shape=(20, 50),fzp.shape=(200,) dtype=<class 'numpy.float32'>
Elapsed Time (s): 20 blobs 0.00077 fzp: 0.00003 total:0.00080
Rate (kHz)      : 20 blobs 1.30 fzp: 33.29 total:1.25
(ps-4.5.26) monarin@drp-srcf-eb002 (master *) tmolw8819 👁)$ python test_fast_outer.py 
ehsd.shape=(20, 50),fzp.shape=(200,) dtype=<class 'numpy.float32'>
Elapsed Time (s): 20 blobs 0.00022 fzp: 0.00003 total:0.00025
Rate (kHz)      : 20 blobs 4.64 fzp: 33.45 total:4.07

With filling full matrices,

Code Block

language	bash

ehsd.shape=(20, 50),fzp.shape=(200,) dtype=<class 'numpy.float32'>
Elapsed Time (s): 20 blobs 0.00671 fzp: 0.00006 total:0.00677
Rate (kHz)      : 20 blobs 0.15 fzp: 17.06 total:0.15
ehsd.shape=(20, 50),fzp.shape=(200,) dtype=<class 'numpy.float32'>
Elapsed Time (s): 20 blobs 0.00058 fzp: 0.00006 total:0.00064
Rate (kHz)      : 20 blobs 1.71 fzp: 16.71 total:1.55

Python script for the results above (with matrix filling):

Code Block

language	py
title	test_fast_outer.py

import numpy as np
import time
import torch

dtype = np.float32

ctor = np

n_samples = 50
n_blobs = 20
n_fzp_samples = 200
ehsd = ctor.random.rand(1000).reshape((n_blobs, n_samples)).astype(dtype)
ihsd = ctor.random.rand(1000).reshape((n_blobs, n_samples)).astype(dtype)
fzp = ctor.random.rand(n_fzp_samples).astype(dtype)

n_events = 10
tt = ctor.zeros((n_events,3))
o_ehsd_ehsd = ctor.zeros((n_blobs, n_samples, n_samples), dtype=dtype)
o_ihsd_ihsd = ctor.zeros((n_blobs, n_samples, n_samples), dtype=dtype)
o_ehsd_ihsd = ctor.zeros((n_blobs, n_samples, n_samples), dtype=dtype)
o_ehsd_fzp = ctor.zeros((n_blobs, n_samples, n_fzp_samples), dtype=dtype)
o_ihsd_fzp = ctor.zeros((n_blobs, n_samples, n_fzp_samples), dtype=dtype)
o_fzp_fzp = ctor.zeros((n_fzp_samples, n_fzp_samples), dtype=dtype)      
for i in range(n_events):
    t0 = time.monotonic()
    for i_blob, (_ehsd, _ihsd) in enumerate(zip(ehsd, ihsd)):
        o_ehsd_ehsd[i_blob,:] = ctor.outer(_ehsd, _ehsd)
        o_ihsd_ihsd[i_blob,:] = ctor.outer(_ihsd, _ihsd)
        o_ehsd_ihsd[i_blob,:] = ctor.outer(_ehsd, _ihsd)
        o_ehsd_fzp[i_blob,:] = ctor.outer(_ehsd, fzp)
        o_ihsd_fzp[i_blob,:] = ctor.outer(_ihsd, fzp)
    t1 = time.monotonic()
    o_fzp_fzp[:] = ctor.outer(fzp, fzp)
    t2 = time.monotonic()
    tt[i, :] = [t1-t0, t2-t1, t2-t0]
    

print(f'{ehsd.shape=},{fzp.shape=} {dtype=}')
mean_tt = np.mean(tt, axis=0)
print(f'Elapsed Time (s): {n_blobs} blobs {mean_tt[0]:.5f} fzp: {mean_tt[1]:.5f} total:{mean_tt[2]:.5f}')
rate = (n_events/np.sum(tt, axis=0))*1e-3
print(f'Rate (kHz)      : {n_blobs} blobs {rate[0]:.2f} fzp: {rate[1]:.2f} total:{rate[2]:.2f}')

Memory usage on one node with 50 processes

#Cores: 2500 myhost='sdfmilan003' Time: 26.39476s Rate: 947.16kHz

This script for this performance test is test_fast_outer_filling.py and was submitted with submit_slac.shImage Removed

Performance with reduced full data

...

Page tree

Versions Compared

Old Version 13

New Version 14

Key

Memory usage on one node with 50 processes

Performance with reduced full data