Page History

...

We tested the 6 outer products outlined above on drp-srcf-eb002. The

Without filling out full matrices, the best rate is 1.25 kHz for all 6 operations 4kHz for the most important ones.

Code Block

language	bash

(ps-4.5.26) monarin@drp-srcf-eb002 (master *) tmolw8819 👁)$ python test_fast_outer.py 
ehsd.shape=(20, 50),fzp.shape=(200,) dtype=<class 'numpy.float32'>
Elapsed Time (s): 20 blobs 0.00077 fzp: 0.00003 total:0.00080
Rate (kHz)      : 20 blobs 1.30 fzp: 33.29 total:1.25
(ps-4.5.26) monarin@drp-srcf-eb002 (master *) tmolw8819 👁)$ python test_fast_outer.py ) monarin@drp-srcf-eb002 (master *) tmolw8819 👁)$ python test_fast_outer.py 
ehsd.shape=(20, 50),fzp.shape=(200,) dtype=<class 'numpy.float32'>
Elapsed Time (s): 20 blobs 0.00022 fzp: 0.00003 total:0.00025
Rate (kHz)      : 20 blobs 4.64 fzp: 33.45 total:4.07

With filling full matrices,

Code Block

language	bash

ehsd.shape=(20, 50),fzp.shape=(200,) dtype=<class 'numpy.float32'>
Elapsed Time (s): 20 blobs 0.00671 fzp: 0.00006 total:0.00677
Rate (kHz)      : 20 blobs 0.15 fzp: 17.06 total:0.15
ehsd.shape=(20, 50),fzp.shape=(200,) dtype=<class 'numpy.float32'>
Elapsed Time (s): 20 blobs 0.0002200058 fzp: 0.0000300006 total:0.0002500064
Rate (kHz)      : 20 blobs 41.6471 fzp: 3316.4571 total:41.0755

Python script for the results above (with matrix filling):

Code Block

language	py
title	test_fast_outer.py

import numpy as np
import time
import torch

dtype = np.float32

ctor = np

n_samples = 50
n_blobs = 20
n_fzp_samples = 200
ehsd = ctor.random.rand(1000).reshape((n_blobs, n_samples)).astype(dtype)
ihsd = ctor.random.rand(1000).reshape((n_blobs, n_samples)).astype(dtype)
fzp = ctor.random.rand(n_fzp_samples).astype(dtype)

n_events = 10
tt = ctor.zeros((n_events,3))
o_ehsd_ehsd = ctor.zeros((n_blobs, n_samples, n_samples), dtype=dtype)
o_ihsd_ihsd = ctor.zeros((n_blobs, n_samples, n_samples), dtype=dtype)
o_ehsd_ihsd = ctor.zeros((n_blobs, n_samples, n_samples), dtype=dtype)
o_ehsd_fzp = ctor.zeros((n_blobs, n_samples, n_fzp_samples), dtype=dtype)
o_ihsd_fzp = ctor.zeros((n_blobs, n_samples, n_fzp_samples), dtype=dtype)
o_fzp_fzp = ctor.zeros((n_fzp_samples, n_fzp_samples), dtype=dtype)      
for i in range(n_events):
    t0 = time.monotonic()
    for i_blob, (_ehsd, _ihsd) in enumerate(zip(ehsd, ihsd)):
        o_ehsd_ehsd[i_blob,:] = ctor.outer(_ehsd, _ehsd)
        o_ihsd_ihsd[i_blob,:] = ctor.outer(_ihsd, _ihsd)
        o_ehsd_ihsd[i_blob,:] = ctor.outer(_ehsd, _ihsd)
        o_ehsd_fzp[i_blob,:] = ctor.outer(_ehsd, fzp)
        o_ihsd_fzp[i_blob,:] = ctor.outer(_ihsd, fzp)
    t1 = time.monotonic()
    o_fzp_fzp[:] = ctor.outer(fzp, fzp)
    t2 = time.monotonic()
    tt[i, :] = [t1-t0, t2-t1, t2-t0]
    

print(f'{ehsd.shape=},{fzp.shape=} {dtype=}')
mean_tt = np.mean(tt, axis=0)
print(f'Elapsed Time (s): {n_blobs} blobs {mean_tt[0]:.5f} fzp: {mean_tt[1]:.5f} total:{mean_tt[2]:.5f}')
rate = (n_events/np.sum(tt, axis=0))*1e-3
print(f'Rate (kHz)      : {n_blobs} blobs {rate[0]:.2f} fzp: {rate[1]:.2f} total:{rate[2]:.2f}')

Memory usage on one node with 50 processes

Image Added

Performance with reduced full data

...

Page tree

Versions Compared

Old Version 12

New Version 13

Key

Memory usage on one node with 50 processes

Performance with reduced full data