Interfaces

Python interfaces to HDF5 include h5py and pytables (for which the module name is tables). While we actively use and support h5py, we will not discuss pytables.

H5py

Many of our datasets are one dimensional arrays of a compound type. A compound type is a well defined object in the HDF5 library - it is very much like a C struct. In hdfview (a useful tool for viewing hdf5 files provided by the hdf5 group) they often look like two dimensional arrays, but the columns are really the field names of the compound type. When you load a dataset into h5py, it will return a numpy array (with extra metadata attached) that is close to the original hdf5 layout. It will be a 1D array with a numpy dtype that exactly represents the compound type in the hdf5 file.

Issues

The LCSL data makes use of several HDF5 features that can be awkward to handle in h5py

The compound type
- when the number of fields gets large, this does not print well when interactively exploring the data in h5py
- Field names are far from the data
- dtypes, especially with enums in them, can be very complicated
enums - these are well defined objects in hdf5 - in the data they are stored as integers. A dictionary that maps symbolic names to integers is stored in one spot in the dataset.
- In h5py they are displayed as ints, extra steps are required to obtain the enum dictionary and translate ints to strings, or use strings to test the values of the enum.
vlen data
- h5py (as of version 2.2) presently only supports variable length strings
- Our EvrData uses general vlen data to represent the variable number of event codes that occur with each event
- We have patched h5py to be able to read general vlen data, such as what is in our EvrData

The current version of h5py works fine with vlen data, but older versions did not support it.

vlen data

Here is an example of how one might work with vlen data. An example of vlen data is the EvrData. During each event, the EvrData includes fifoCodes - this is a variable length list. Each element in the list has three parts, timestampHigh, timestampLow and eventCode. A lot of users will need to examine the eventCode's. Starting with EvrData::DataV4, one can access event codes without working with vlen data - one uses the dataset 'present' that has been translated into the hdf5. However we want to demonstrate how to work with vlen data. So let's write an example that takes the EvrData and flattens it out into a table, where each entry in the table is 0 or 1 depending on whether or not that eventCode fired (this is exactly the content of the 'present' dataset in the hdf5).

import h5py
import numpy as np

f=h5py.File('/reg/d/psdm/xpp/xpptut13/hdf5/xpptut13-r0179.h5','r')
evrData=f['/Configure:0000/Run:0000/CalibCycle:0000/EvrData::DataV3/NoDetector.0:Evr.0/data']
numberOfEvents = len(evrData)    # this gives 483
largestEventCode = max([max([fifoEvent['eventCode'] for fifoEvent in fifoEvents]) for fifoEvents in  evrData['fifoEvents']])
# this gives 162, this is the largest eventCode that occurs in this calib cycle.
eventCodes = np.zeros((numberOfEvents, largestEventCode+1), np.int8)
for eventIndex,fifoEvents in enumerate(evrData['fifoEvents']):
    for fifoEvent in fifoEvents:
        eventCodes[eventIndex, fifoEvent['eventCode']]=1

At this point, eventCodes is a 483 x 162 table of 0/1 - the rows are the events, and the columns are the event codes. If we wanted to find what event codes were present and in what frequency, one could do

eventCodesInData = np.where(np.sum(eventCodes,0)!=0)[0]
numberOfTimesEachEventCodeFired = dict(zip(eventCodesInData, np.sum(eventCodes,0)[eventCodesInData]))
# this dict will be
# {41: 242, 
#  42: 121, 
#  67: 98, 
# 140: 483, 
# 162: 69}

One could then construct a logical index array to quickly average the cspad over the 121 events where event code 42 fired:

eventsWith42 = eventCodes[:,42]==1
cspad=f['/Configure:0000/Run:0000/CalibCycle:0000/CsPad::ElementV2/XppGon.0:Cspad.0/data']
assert len(eventsWith42)==len(cspad), "There are cases when datasets for different types " + \
        "are not aligned due to damage, it is best to do more than this and check that the times datasets are the same between cspad and the evrdata"
cspadAt42 = cspad[eventsWith42]
cspadAt42.shape    # this returns  (121, 32, 185, 388)
avgAt42 = np.average(cspadAt42,0)
avgAt42.shape      # this returns  (32, 185, 388)

Code to Print Datasets

Below is code that provides a function, printds to display h5py datasets. The output has the following features:

The header lines give the field names, their types, and identify any vlen data
The enum fields are translated to integers
multiple lines are used to print vlen data.
Optional arguments let you control which rows and columns are printed.
An optional argument lets you control the formatting of different fields.

If you find that the h5py interface (which is basically a numpy interface) for displaying datasets is producing output that is hard to manage, try printds. At LCLS, this code is in the package h5tools that is part of analysis release 0.10.6 and later. One does

from h5tools import printds

To use the function. One can also obtain the code by downloading

printds.py

putting it in your python path, and importing the function. Note, the file printds.py is a module that contains a function named printds. When downloading the code and adding it to your python path, one would do

importing printds directly

from printds import prinds
...
printds(dataset)

to use the function. See the example below, as well as further examples in the code. Note, this function is only for printing datasets. It produces an error when passed any other h5py object (such as a group directory along the way to the dataset).

Tips for Working with H5py

Below are some tips for working with enums, vlen strings, and vlen data in h5py. We'll assume that the five datasets in a EvrData::ConfigV7 group have been loaded into the following h5py dataset objects:

From a EvrData::ConfigV7 group, the dataset: output_maps
From a EvrData::DataV3 group, the dataset: data
From a Ipimb::ConfigV2 group, the dataset config

Working with Vlen Data in EvrData::Data

The EvrData::Data dataset has a very complicated dtype that is not easy to navigate;

In [293]: data.dtype
Out[293]: dtype([('fifoEvents', [(({'type': dtype([('timestampHigh', '<u4'), ('timestampLow', '<u4'), ('eventCode', '<u4')])}, 'vlen'), '|O8')])])

Using the printds function (link to download is above) as well as our patched version of h5py, you will get output like

In [10]: printds(data)
                                fifoEvents (vlen)
rowIdx     timestampHigh, timestampLow, eventCode
                  uint32,       uint32,    uint32
     0 [(       118401,        12966,        41),
        (       118401,        11852,       140)]
     1 [(       118404,        11852,       140)]
     2 [(       118407,        12966,        41),
        (       118407,        11852,       140)]
     3 [(       118410,        11852,       140)]
     4 [(       118413,        12966,        41),
        (       118413,        11852,       140)]
     5 [(       118416,        11852,       140)]
     6 [(       118419,        12966,        41),
        (       118419,        11852,       140)]
     7 [(       118422,        11852,       140)]
     8 [(       118425,        12966,        41),
        (       118425,        11852,       140)]
     9 [(       118428,        11852,       140)]
    10 [(       118431,        12966,        41),
        (       118431,        11852,       140)]
    11 [(       118434,        11852,       140)]
    12 [(       118437,        12966,        41),
        (       118437,        11852,       140)]

From which one sees there are two fifoEvents in the first event, one in the second, two in the third, etc. To work with the data in h5py, one would do

In [297]: data[0,'fifoEvents']
Out[297]:
array([(118401L, 12966L, 41L), (118401L, 11852L, 140L)],
      dtype=[('timestampHigh', '<u4'), ('timestampLow', '<u4'), ('eventCode', '<u4')])

In [298]: data[0,'fifoEvents']['eventCode']
Out[298]: array([ 41, 140], dtype=uint32)

Comparing Strings to Enums

output_maps is 1d an array of a compound type with several rows. Each row has two different enums: source and conn. Suppose you want to test the source enum in row 0 against the symbolic strings. These are the steps:

  # you first obtain the Python dict that translates strings to ints with this
  # special h5py function for handling enums:

  output_maps_source_enum = h5py.check_dtype(enum=output_maps.dtype['source'])

  # next test as follows:

  rowToTest = 0
  if output_maps_source_enum['Pulse'] == output_maps['source'][rowToTest]:
    # deal with pulse case
  elif output_maps_source_enum['Prescaler'] == output_maps['source'][rowToTest]:
    # deal with Prescaler case
  elif output_maps_source_enum['Force_High'] == output_maps['source'][rowToTest]:
    # deal with Force_High case
  elif output_maps_source_enum['Force_High'] == output_maps['source'][rowToTest]:
    # deal with Force_High case
  elif output_maps_source_enum['Force_Low'] == output_maps['source'][rowToTest]:
    # deal with Force_Low case
  elif output_maps_source_enum['DBus'] == output_maps['source'][rowToTest]:
    # deal with DBus case

Printing Enums as Strings

If you want to print the source enum as a string, you need to do something like invert the enum dict:

  output_maps_source_int2str = {[(val,sym) for sym,val in output_maps_source_enum.iteritems()]}
  print "%s" % output_maps_source_int2str[output_maps['source'][rowToTest] ]

The Array of Enum in Ipimb::Config

The capacitorValue field of Ipimb::Config is an array of enum as opposed to an enum. There are four columns, one for each diode. To work with this enum, one would do (again, assume the h5py dataset is loaded into config):

In [20]: h5py.check_dtype(enum=config.dtype['capacitorValue'].base)    # note that we must
Out[20]:                                                               # specify .base as this
{'c_10nF': 6L,                                                         # is an array of enum
 'c_120pF': 3L,
 'c_1pF': 0L,
 'c_24pF': 2L,
 'c_3p3nF': 5L,
 'c_4p7pF': 1L,
 'c_620pF': 4L,
 'expert': 7L}

# next we save the enum dict, invert it, and use it to print the symbolic names
# for diode 0 for all events in this dataset

In [21]: capValEnum=h5py.check_dtype(enum=config.dtype['capacitorValue'].base)

In [22]: capValEnumInv = dict([(val,sym) for sym,val in capValEnum.iteritems()])

In [23]: capValEnumInv
Out[23]:
{0L: 'c_1pF',
 1L: 'c_4p7pF',
 2L: 'c_24pF',
 3L: 'c_120pF',
 4L: 'c_620pF',
 5L: 'c_3p3nF',
 6L: 'c_10nF',
 7L: 'expert'}

# this dataset happens to cover six events:
In [24]: config['capacitorValue']
Out[24]:
array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]], dtype=uint8)

# here is diode 0 for the six events
In [26]: config['capacitorValue'][:,0]
Out[26]: array([0, 0, 0, 0, 0, 0], dtype=uint8)

In [27]: [capValEnumInv[x] for x in config['capacitorValue'][:,0]]
Out[27]: ['c_1pF', 'c_1pF', 'c_1pF', 'c_1pF', 'c_1pF', 'c_1pF']

Displaying Datasets

The compound types we store can print as a great deal of data when being explored interactively, which is why we provide the function printds above. However here are a few techniques for working with the datasets directly.

Use printto hide the dtype. Doing the following:

In [5]: output_maps.value
Out[5]:
 array([(0, 0, 1, 0, 0), (0, 1, 1, 3, 0), (0, 2, 1, 4, 0), (0, 3, 1, 6, 0),
	 (0, 3, 1, 7, 0), (0, 3, 1, 8, 0), (0, 4, 1, 9, 0), (0, 5, 1, 5, 0),
        (0, 6, 1, 1, 0), (0, 6, 1, 2, 0), (0, 6, 1, 10, 0), (0, 6, 1, 11, 0)],
        dtype=[('source', [(({'vals': {'Force_Low': 4L, 'DBus': 1L, 'Force_High': 3L, 'Prescaler': 2L, 'Pulse': 0L}} 'enum'), '<i2')]), ('source_id', '<i2'), ('conn', [(({'vals': {'UnivIO': 1L, 'FrontPanel': 0L}}, 'enum'), '<i2')]), ('conn_id', '<i2'), ('module', '<i2')])

Can be a lot to look at. If you use print though, you are printing the string representation as opposed to the object representation:

 In [8]: print output_maps.value
 Out[9]:
 [(0, 0, 1, 0, 0) (0, 1, 1, 3, 0) (0, 2, 1, 4, 0) (0, 3, 1, 6, 0)
  (0, 3, 1, 7, 0) (0, 3, 1, 8, 0) (0, 4, 1, 9, 0) (0, 5, 1, 5, 0)
  (0, 6, 1, 1, 0) (0, 6, 1, 2, 0) (0, 6, 1, 10, 0) (0, 6, 1, 11, 0)]

Convert from a numpy array to a python list:

 In [9]: output_maps.value.tolist()
 Out[9]:
 [(0, 0, 1, 0, 0),
  (0, 1, 1, 3, 0),
  (0, 2, 1, 4, 0),
  (0, 3, 1, 6, 0),
  (0, 3, 1, 7, 0),
  (0, 3, 1, 8, 0),
  (0, 4, 1, 9, 0),
  (0, 5, 1, 5, 0),
  (0, 6, 1, 1, 0),
  (0, 6, 1, 2, 0),
  (0, 6, 1, 10, 0),
  (0, 6, 1, 11, 0)]

However be careful with Python lists, they are slow for scientific computations. The numpy array's that h5py returns are more efficient for calculations.

Display the field names without the type information by using the dtype.names attribute

 In [15]: output_maps.dtype.names
 Out[15]: ('source', 'source_id', 'conn', 'conn_id', 'module')

Use Dtype to Slice Array

The names in the dtype are very useful for slicing the array. For example, one can do

In [5]: output_maps['source_id']
Out[5]: array([0, 1, 2, 3, 3, 3, 4, 5, 6, 6, 6, 6], dtype=int16)

or

In [6]: output_maps['source']
Out[6]: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int16)

Above though, the enum is returned to us as an array of ints. With the output_maps_source_int2str dict defined above, we can create a list of strings to represent it as follows:

In [15]: [ output_maps_source_int2str[x] for x in output_maps['source']]
Out[15]:
['Pulse',
 'Pulse',
 'Pulse',
 'Pulse',
 'Pulse',
 'Pulse',
 'Pulse',
 'Pulse',
 'Pulse',
 'Pulse',
 'Pulse',
 'Pulse']

Page tree

Python Interface to HDF5