Realtime Data Monitoring with Jupyter

A project with NERSC

Jupyter Notebooks and Interactivity

To keep the Notebook interactive during a data monitoring session, our goal should be to offload any time-consuming processing outside of the Notebook.

We keep processing in the Notebook lightweight by:

Using asynchronous code in the Notebook where possible to process locally only when necessary, and this should be kept to a minimum.
Performing any CPU intensive tasks such as data analysis in a separate process to avoid overloading the Notebook kernel process during data monitoring.

Data collection

A Jupyter Notebook kernel cannot keep up directly with an inflow of data from a psana2 cluster without sacrificing interactivity for the user. The kernel is a single process that also processes any user execution requests from the Notebook. Instead of using the Jupyter Notebook kernel as the mechanism to collect data directly from psana2, an additional component is used to serve that need, and then a Jupyter Notebook kernel can pull data as needed from an intermediate source.

Additionally, because psana2 may have many SRV processes (set by $PS_SRV_NODES) accumulating different data at once, some type of data aggregation is also needed to assemble a final coherent data product.

Gathering data from psana2

The launching script used to start psana2 will incorporate a callback mechanism to package collected raw data and send that packaged data to another process that can store or further process that data.

Examples include using Redis and Dask, but this is not an exhaustive list of options for storing or further processing data.

Caching/Storing data from psana2 with Redis

Each raw data array can be stored as bytes in a Redis key/value store, which can either be fully in memory for smaller amounts of data or partly backed to disk, depending on how Redis is configured.

The callback function used with psana2 can package each array into a compatible format for storing in Redis, with an appropriate key created that encodes type information needed to unpack the data later.

Rate limiting before or during monitoring

It may be that data is coming in too fast to keep up with from a user interaction standpoint, and there are ways to reduce the data flow.

Data flow can be adjusted from either the psana2 (server/cluster) side or the Jupyter Notebook (client) side.

Rate limiting from the Jupyter Notebook

From the Jupyter Notebook, data is requested periodically and then rendered. The rate that data is requested can be adjusted by modifying the amount of sleep time between data requests.

Rate limiting from your psana2 code (established prior to monitoring)

As part of your event processing loop, you can include logic that determines how often a data event should be sent.

This could be simple, such as a counter that you accumulate and check, and every 10th or 100th value you could decide to publish, instead of for every data event.

Visualization/Notebook Rendering

The matplotlib library was used for rendering image data to the Notebook, taking advantage of the ipywidgets Jupyter interactive widgets framework and ipympl (matplotlib JupyterLab extension).

Page tree