Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Include Page
PageMenuBegin
PageMenuBegin
Table of Contents
Include Page
PageMenuEnd
PageMenuEnd

Interfaces

Python interfaces to HDF5 include h5py and pytables. While we actively use and support h5py, we will not discuss pytables as it is not a tool that we presently work with.

...

The LCSL data makes use of several HDF5 features that can be awkward to handle in h5py

  • The compound type
    • when the number of fields gets large, this does not print well when interactively exploring the data in h5py
    • Field names are far from the data
    • dtypes, especially with enums in them, can be very complicated
  • enums - these are well defined objects in hdf5 - in the data they are stored as integers.  A dictionary that maps symbolic names to integers is stored in one spot in the dataset.
    • In h5py they are displayed as ints, extra steps are required to obtain the enum dictionary and translate ints to strings, or use strings to test the values of the enum.
  • vlen data
    • h5py (as of version 2.2) presently only supports variable length strings
    • Our EvrData uses general vlen data to represent the variable number of event codes that occur with each event
    • We have patched h5py to be able to read  general vlen data, such as what is in our EvrData

...

At LCLS, this code is in the package h5tools that is part of analysis release 0.9.16 and later.  One   One does

from h5tools import printds

To use the function (however there is a bug in tag V00-00-02 of the h5tools package that prevents this. This affects the analysis release 0.10.5 and earlier).

One can also obtain the code by downloading

printds.py

putting it in your python path, and importing the function.  Note, printds is a module that contains a function named printds. When downloading the code and adding it to your python path, one would do

Code Block
languagepython
titleimporting printds directly
from printds import prinds
...
printds(dataset)

to use your python path, and after loading a dataset ds using h5py, passing it to the function. See the example below, as well as further examples in the code. Note, this function is only for printing datasets. It produces an error when passed any other h5py object (such as a group directory along the way to the dataset).

Anchor
tips_h5py
tips_h5py

Tips for Working with H5py

Below are some tips for working with enums, vlen strings, and vlen data in h5py. We'll assume that the five datasets in a EvrData::ConfigV7 group have been loaded into the following h5py dataset objects:

Code Block

From a EvrData::ConfigV7 group, the dataset: output_maps
From a EvrData::DataV3 group, the dataset: data
From a Ipimb::ConfigV2 group, the dataset config

...

The EvrData::Data dataset has a very complicated dtype that is not easy to navigate;

Code Block

In [293]: data.dtype
Out[293]: dtype([('fifoEvents', [(({'type': dtype([('timestampHigh', '<u4'), ('timestampLow', '<u4'), ('eventCode', '<u4')])}, 'vlen'), '|O8')])])

...

From which one sees there are two fifoEvents in the first event, one in the second, two in the third, etc.  To work with the data in h5py, one would do

Code Block

In [297]: data[0,'fifoEvents']
Out[297]:
array([(118401L, 12966L, 41L), (118401L, 11852L, 140L)],
      dtype=[('timestampHigh', '<u4'), ('timestampLow', '<u4'), ('eventCode', '<u4')])

In [298]: data[0,'fifoEvents']['eventCode']
Out[298]: array([ 41, 140], dtype=uint32)

...

output_maps is 1d an array of a compound type with several rows. Each row has two different enums: source and conn. Suppose you want to test the source enum in row 0 against the symbolic strings. These are the steps:

Code Block

  # you first obtain the Python dict that translates strings to ints with this
  # special h5py function for handling enums:

  output_maps_source_enum = h5py.check_dtype(enum=output_maps.dtype['source'])

  # next test as follows:

  rowToTest = 0
  if output_maps_source_enum['Pulse'] == output_maps['source'][rowToTest]:
    # deal with pulse case
  elif output_maps_source_enum['Prescaler'] == output_maps['source'][rowToTest]:
    # deal with Prescaler case
  elif output_maps_source_enum['Force_High'] == output_maps['source'][rowToTest]:
    # deal with Force_High case
  elif output_maps_source_enum['Force_High'] == output_maps['source'][rowToTest]:
    # deal with Force_High case
  elif output_maps_source_enum['Force_Low'] == output_maps['source'][rowToTest]:
    # deal with Force_Low case
  elif output_maps_source_enum['DBus'] == output_maps['source'][rowToTest]:
    # deal with DBus case

...

If you want to print the source enum as a string, you need to do something like invert the enum dict:

Code Block

  output_maps_source_int2str = {[(val,sym) for sym,val in output_maps_source_enum.iteritems()]}
  print "%s" % output_maps_source_int2str[output_maps['source'][rowToTest] ]

...

The compound types we store can print as a great deal of data when being explored interactively, which is why we provide the function printds above. However here are a few techniques for working with the datasets directly.

  • Use printto hide the dtype. Doing the following:

    Code Block
    
    In [5]: output_maps.value
    Out[5]:
     array([(0, 0, 1, 0, 0), (0, 1, 1, 3, 0), (0, 2, 1, 4, 0), (0, 3, 1, 6, 0),
    	 (0, 3, 1, 7, 0), (0, 3, 1, 8, 0), (0, 4, 1, 9, 0), (0, 5, 1, 5, 0),
            (0, 6, 1, 1, 0), (0, 6, 1, 2, 0), (0, 6, 1, 10, 0), (0, 6, 1, 11, 0)],
            dtype=[('source', [(({'vals': {'Force_Low': 4L, 'DBus': 1L, 'Force_High': 3L, 'Prescaler': 2L, 'Pulse': 0L}} 'enum'), '<i2')]), ('source_id', '<i2'), ('conn', [(({'vals': {'UnivIO': 1L, 'FrontPanel': 0L}}, 'enum'), '<i2')]), ('conn_id', '<i2'), ('module', '<i2')])
     

    Can be a lot to look at. If you use print though, you are printing the string representation as opposed to the object representation:

    Code Block
    
     In [8]: print output_maps.value
     Out[9]:
     [(0, 0, 1, 0, 0) (0, 1, 1, 3, 0) (0, 2, 1, 4, 0) (0, 3, 1, 6, 0)
      (0, 3, 1, 7, 0) (0, 3, 1, 8, 0) (0, 4, 1, 9, 0) (0, 5, 1, 5, 0)
      (0, 6, 1, 1, 0) (0, 6, 1, 2, 0) (0, 6, 1, 10, 0) (0, 6, 1, 11, 0)]
    
    
  • Convert from a numpy array to a python list:
Code Block

 In [9]: output_maps.value.tolist()
 Out[9]:
 [(0, 0, 1, 0, 0),
  (0, 1, 1, 3, 0),
  (0, 2, 1, 4, 0),
  (0, 3, 1, 6, 0),
  (0, 3, 1, 7, 0),
  (0, 3, 1, 8, 0),
  (0, 4, 1, 9, 0),
  (0, 5, 1, 5, 0),
  (0, 6, 1, 1, 0),
  (0, 6, 1, 2, 0),
  (0, 6, 1, 10, 0),
  (0, 6, 1, 11, 0)]

...

  • Display the field names without the type information by using the dtype.names attribute
Code Block

 In [15]: output_maps.dtype.names
 Out[15]: ('source', 'source_id', 'conn', 'conn_id', 'module')

...

The names in the dtype are very useful for slicing the array. For example, one can do

Code Block

In [5]: output_maps['source_id']
Out[5]: array([0, 1, 2, 3, 3, 3, 4, 5, 6, 6, 6, 6], dtype=int16)

or

Code Block

In [6]: output_maps['source']
Out[6]: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int16)

Above though, the enum is returned to us as an array of ints. With the output_maps_source_int2str dict defined above, we can create a list of strings to represent it as follows:

Code Block

In [15]: [ output_maps_source_int2str[x] for x in output_maps['source']]
Out[15]:
['Pulse',
 'Pulse',
 'Pulse',
 'Pulse',
 'Pulse',
 'Pulse',
 'Pulse',
 'Pulse',
 'Pulse',
 'Pulse',
 'Pulse',
 'Pulse']