Using h5py to Save Data

Below is a Python example that gathers some data from an experiment and saves it to a hdf5 file for future use. The gathered data will be used to select which shots to use for future analysis. It demonstrates the use of h5py to write and read compound data types to hdf5 files.

The data used is a CXI tutorial run from /reg/d/psdm/cxi/cxitut13/.

gatherSave.py

import sys
import numpy as np
import h5py
import psana

def gatherData(maxEventsToProcess=None):
    '''Goes through a specific experiment and run, 
    gathers specific values from the data,
    returns a two numpy arrays:
       dataArray: the gathered data
       timeArray: the event times for each row of the gathered data
                  has two fields - 'seconds' 'nanoseconds', both integers
    '''
    dataSource = psana.DataSource("exp=cxitut13:run=0022")
    srcDiode = psana.Source('DetInfo(CxiDg4.0:Ipimb.0)')
    srcGasDet = psana.Source('BldInfo(FEEGasDetEnergy)')
    epicsStore = dataSource.env().epicsStore()

    # define data type for data array and time array
    dataDtype = np.dtype([('aG',np.float), ('aD',np.float), ('aH',np.float),  
                          ('aX',np.float), ('aY',np.float), ('aZ',np.float)])

    timeDtype = np.dtype([('seconds',np.uint32),('nanoseconds',np.uint32)])

    # to minimize realloc of arrays, set the number of elements in a  block size
    blockSize = 100000
    dataArray = np.zeros(blockSize, dtype=dataDtype)
    timeArray = np.zeros(blockSize, dtype=timeDtype)

    # motor calibration
    EncoderScale = 0.005 
    Xoffset = 17050.
    Yoffset = 1056510.
    Zoffset = 1830400.
    
    nextArrayRow = 0
    for eventNumber, evt in enumerate(dataSource.events()):
        if maxEventsToProcess is not None and nextArrayRow >= maxEventsToProcess:
            break
        diode = evt.get(psana.Ipimb.DataV2,srcDiode)
        gasDet = evt.get(psana.Bld.BldDataFEEGasDetEnergy,srcGasDet)
        # check that all data is present in event before adding to dataArray
        if diode is None or gasDet is None:
            continue
        # grow arrays if they are to small
        if nextArrayRow >= timeArray.size:
            timeArray.resize(timeArray.size + blockSize)
            dataArray.resize(dataArray.size + blockSize)
        timeArray[nextArrayRow] = evt.get(psana.EventId).time()
        diodeSum = diode.channel0Volts() + diode.channel1Volts() + \
                   diode.channel2Volts() + diode.channel3Volts()
        dataArray[nextArrayRow]['aD'] = diodeSum
        dataArray[nextArrayRow]['aG'] = gasDet.f_11_ENRC()
        sampleX = epicsStore.getPV('CXI:SC1:MZM:08:ENCPOSITIONGET').value(0)
        sampleY = epicsStore.getPV('CXI:SC1:MZM:09:ENCPOSITIONGET').value(0)
        sampleZ = epicsStore.getPV('CXI:SC1:MZM:10:ENCPOSITIONGET').value(0)
        dataArray[nextArrayRow]['aX'] = sampleX * EncoderScale - Xoffset
        dataArray[nextArrayRow]['aY'] = sampleY * EncoderScale - Yoffset
        dataArray[nextArrayRow]['aZ'] = sampleZ * EncoderScale - Zoffset
        dataArray[nextArrayRow]['aH'] = \
                             epicsStore.getPV('CXI:KB1:MMS:07.RBV').value(0)
        if nextArrayRow % 1000 == 0:
            print "event %8d diode.channel0Volts=%f gasDet.f_11_ENCR=%f" % \
                (eventNumber, diode.channel0Volts(),gasDet.f_11_ENRC())
        nextArrayRow += 1

    # shrink arrays to number of events we stored data from
    timeArray.resize(nextArrayRow)
    dataArray.resize(nextArrayRow)

    return dataArray, timeArray

def H5WriteDataAndTime(dataArray, timeArray, h5filename):
    f = h5py.File(h5filename,'w')
    dataDtype = dataArray.dtype
    dataDset = f.create_dataset('data', dataArray.shape, dataDtype)
    dataDset[:] = dataArray
    timeDtype = timeArray.dtype
    timeDset = f.create_dataset('time',timeArray.shape, timeDtype)
    timeDset[:] = timeArray
    f.close()

def H5ReadDataAndTime(h5filename):
    f = h5py.File(h5filename)
    dataDset = f['data']
    dataArray = dataDset[:]
    timeDset = f['time']
    timeArray = timeDset[:]
    f.close()
    return dataArray, timeArray

if __name__ == '__main__':
    maxArraySize = None
    if len(sys.argv) > 1:
        maxArraySize = int(sys.argv[1])
    dataArray,timeArray = gatherData(maxArraySize)
    H5WriteDataAndTime(dataArray, timeArray, "saved_output.h5")
    dataArray,timeArray = H5ReadDataAndTime("saved_output.h5")

To use this script:

place it in your release directory
run ipython

enter the commands:

import gatherSave

dataArray,timeArray = gatherSave.gatherData()

gatherSave.H5WriteDataAndTime(dataArray, timeArray, "saved_output.h5")

dataArray,timeArray = H5ReadDataAndTime("saved_output.h5")

The function gatherData() is one that needs to be modified for different datasets. H5WriteDataAndTime and H5ReadDataAndTime will not.

dataArray is a numpy array with 6 named fields that gather different values from the events, epics pv's , a value from the gas detector, and the voltage sum of a Diode. The fields have names like 'aD' (the Diode sum) and 'aG' for the gas detector value.

One can work with the data using numpy features as follows:

logicalIndex = dataArray['aG'] > 0.84   # a mask that is 1 when 'aG' is greater than 0.84
data['aD'][logicalIndex]                # the mask is used to get diode values when 'aG' is > than 0.84
logicalIndex.nonzero()[0]               # turn the mask into a list of positions, see the documentation on 
                                        # the numpy function nonzero 
                  # http://docs.scipy.org/doc/numpy/reference/generated/numpy.nonzero.html

# likewise one can do
import numpy as np
np.where(data['aG'] > 0.84)[0]     # the [0] is necessary to get the indicies along the first axis

Details

Below we discuss how things are done in the example.

Creation of Numpy Array with Named Fields

Numpy arrays are very efficient data structures in Python. This example creates two numpy arrays to store the event data. These arrays have named fields which provides a dictionary style access to the data. Note the two numpy dtypes (data types) created that define these arrays:

    # define data type for data array and time array
    dataDtype = np.dtype([('aG',np.float), ('aD',np.float), ('aH',np.float),  
                          ('aX',np.float), ('aY',np.float), ('aZ',np.float)])
    timeDtype = np.dtype([('seconds',np.uint32),('nanoseconds',np.uint32)])

A 32 bit unsigned int is used for the seconds and nanoseconds for the timeArray as this is how these fields are stored in the xtc, but one could use np.int or np.uint as well.

For more information on numpy dtypes visit the documentation: Numpy Dtypes

Checking that all Data is Present in the Event

One needs to check that the data one wants from an event is present. In the example there are two counters - eventNumber and nextArrayRow. eventCounter keeps track of which event we are reading and is only used for printing a status message. nextArrayRow is a zero-up counter of events that include both the diode and gasDet data. If both are not present - we go on to the next event and do not increment nextArrayRow.

Using Array Blocks to Read in Data

Manipulations using numpy arrays are most efficient when the final size of the array is known ahead of time. Although numpy arrays have an append method, using it for each event can lead to a great deal of reallocation's of the arrays. Therefore we start with arrays that hold 100,000 elements, and grow them by 100,000 if we need to. In the end we shrink the arrays down to the size that is used. Note the lines:

    # to minimize realloc of arrays, set the number of elements in a  block size
    blockSize = 100000
    dataArray = np.zeros(blockSize, dtype=dataDtype)
    timeArray = np.zeros(blockSize, dtype=timeDtype)
    ...
        # grow arrays if they are to small
        if nextArrayRow >= timeArray.size:
            timeArray.resize(timeArray.size + blockSize)
            dataArray.resize(dataArray.size + blockSize)
    ...
    # shrink arrays to number of events we stored data from
    timeArray.resize(nextArrayRow)
    dataArray.resize(nextArrayRow)

Writing an HDF5 File of Compound Data

Once we have created the numpy array of named fields, it is straightforward to make a hdf5 file with one dataset that contains the numpy array. See the h5py documentation for more information.

import h5py   

f = h5py.File(h5filename,'w')
dataDtype = dataArray.dtype
dataDset = f.create_dataset('data', dataArray.shape, dataDtype)
dataDset[:] = dataArray
timeDtype = timeArray.dtype
timeDset = f.create_dataset('time',timeArray.shape, timeDtype)
timeDset[:] = timeArray
f.close()

It is important to call the close() method of the h5py.File object when done.

Confluence and Jira now require federated login. Read more.

Page tree

Details

Creation of Numpy Array with Named Fields

Checking that all Data is Present in the Event

Using Array Blocks to Read in Data

Writing an HDF5 File of Compound Data