Page History

NOTE: This page describes how one can save HDF5 data "coding by hand". psana psana now provides a more automated way to do this that most users use as shown here.

This script lives in /reg/g/psdm/tutorials/examplePython/userSmallHDF5.py and uses "h5py" (which is documented at http://www.h5py.org).

The first pattern would be used when you want to save all your small data at the end of the run (i.e. you know how many numbers you are going to save):

Code Block

import numpy as np
import psana
ds = psana.DataSource('exp=xpptut15:run=54:smd')
cspad = psana.Detector('cspad')

cspad_sums = []
NUMEVENTS = 3
for idx, evt in enumerate(ds.events()):
     if idx >= NUMEVENTS: break
     calib = cspad.calib(evt)
     if calib is None: continue
     cspad_sums.append(np.sum(calib))

import h5py
h5out = h5py.File("userSmallData.h5", 'w')
h5out['cspad_sums'] = cspad_sums
h5out.close()

In the second pattern we do not assume we know the final size of the dataset: we use the hdf5 chunked storage and resize functions to grow a dataset. One might use this pattern if all the data for a run can't be stored in memory. The example below writes two datasets - one has small data, the sum of a full cspad, and the other has large data - a full cspad saved as float32. When using chunked storage for datasets whose size is not known ahead of time, poor chunk sizes can lead to poor performance. Documentation at http://www.hdfgroup.org covers chunked storage in the advanced topics.

The example below specifies a chunk size of 2048 elements for the small data, and 12 elements for the large. Each large element is about 9MB, so each chunk of the large dataset is about 100MB. If you run this example over a large number of events, you will notice it takes slightly longer to process every 12th event. This is when a chunk of the large dataset gets filled and flushed to disk. If you run the example as it is, over the 3 events, you will notice that the output file is quite large, about 100MB - that is Hdf5 does not write partial chunks - only complete chunks. This script lives in /reg/g/psdm/tutorials/examplePython/userLargeHDF5.py:

Code Block

language	python

 
import numpy as np
import psana
import h5py
 
NUM_EVENTS_TO_WRITE=3
 
ds = psana.DataSource('exp=xpptut15:run=54:smd')
 
h5out = h5py.File("userData.h5", 'w')
smallDataSet = h5out.create_dataset('cspad_sums',(0,), dtype='f8', 
                                    chunks=(2048,), maxshape=(None,))
largeDataSet = h5out.create_dataset('cspads',(0,32,185,388), dtype='f4', 
                                    chunks=(12,32,185,388), 
                                    maxshape=(None,32,185,388))
cspad = psana.Detector('cspad', ds.env())
 
for idx, evt in enumerate(ds.events()):
    if idx > NUM_EVENTS_TO_WRITE: break
    calib = cspad.calib(evt)
    if calib is None: continue
    smallDataSet.resize((idx+1,))
    largeDataSet.resize((idx+1,32,185,388))
    smallDataSet[idx] = np.sum(calib)
    largeDataSet[idx,:] = calib[:]    
 
h5out.close()

Good tools to inspect an h5 files are h5ls and h5dump. For example, doing:

 h5ls -d -r userSmallData.h5

shows the dataset and its values:

Code Block
[cpo@psana1511]$ h5ls -d -r userSmallData.h5 / Group /saved Dataset {4/Inf} Data: (0) 23773.12109375, 135712.25, 65513.67578125, 16749.18359375

A more advanced tutorial on saving data to an hdf5 file can be found on the page: More Advanced Tutorial on Saving Output in Hdf5

Page tree

Versions Compared

Old Version 16

New Version Current

Key