Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In the second pattern we do not assume we know the final size of the dataset: we use the hdf5 chunked storage and resize functions to grow a dataset.  One might use this pattern if all the data for a run can't be stored in memory.  This  The example below writes two datasets - one has small data, the sum of a full cspad, and the other has large data - a full cspad saved as float32. When using chunked storage for datasets whose size is not known ahead of time, poor chunk sizes can lead to poor performance. Documentation at http://www.hdfgroup.org covers chunked storage in the advanced topics. 

The example below specifies a chunk size of 2048 elements for the small data, and 12 elements for the large. Each large element is about 9MB, so each chunk of the large dataset is about 100MB. If you run this example over a large number of events, you will notice it takes slightly longer to process every 12th event. This is when a chunk of the large dataset gets filled and flushed to disk. If you run the example as it is, over the 3 events, you will notice that the output file is quite large, about 100MB - that is Hdf5 does not write partial chunks - only complete chunks. This script lives in /reg/g/psdm/tutorials/examplePython/userSmallHDF5_2.py:

Code Block
languagepython
 
import numpy as np
import psana
import h5py
 
NUM_EVENTS_TO_WRITE=3
 
ds = psana.DataSource('exp=xpptut15:run=54:smd')
 
h5out = h5py.File("userSmallDatauserData.h5", 'w')
savedsmallDataSet = h5out.create_dataset('savedcspad_sums',(0,), dtype='f8', 
                                    chunks=True(2048,), maxshape=(None,))
largeDataSet = h5out.create_dataset('cspads',(0,32,185,388), dtype='f4', 
                                    chunks=(12,32,185,388), 
                                    maxshape=(None,32,185,388))
cspad = psana.Detector('cspad', ds.env())
 
for idx, evt in enumerate(ds.events()):
    if idx > NUM_EVENTS_TO_WRITE: break
    calib = cspad.calib(evt)
    if (calib is None): continue
    savedsmallDataSet.resize((idx+1,))
    largeDataSet.resize((idx+1,32,185,388))
   saved smallDataSet[idx] = np.sum(calib)
    largeDataSet[idx,:] = calib[:]    
 
h5out.close()

Good tools to inspect an h5 files are h5ls and h5dump. For example, doing:

...