How to access HDF5 data from Python

Objective

Currently LCLS does not offer an uniform approach to the analysis of accumulated experemental data. Users exploit myana, pyana, MatLab, IDL, CASS, and probably something else. The work on long-awaited project of psana is in progress. The psana is going to be quite generic and probably not so simple approach. In this page we discuss a simple but absolutly flexible approach to analysis of data stored in HDF5 files. It is based on Python code with extensive exploitation of standard libraries. A few examples of how to access and process data are presented at the end of this page.

There are obvious advantages in this approach,

this approach is absolutely flexible; HDF5 file has indexed structure, that means direct access to any event data from any file from your code.
Python is a high-level scripting language allows to write transparent and compact code based on well-elaborated standard libraries.
In general code in Python works slow comparing to C++, but there are libraries like NumPy written on C++, which solve this problem for manipulation with large arrays.

There is a couple of drawbacks in this approach,

you have to know or learn Python
corrent version of the h5py library works quite slow with long HDF5 files

The first issue about Python is not really a drawback. Basic concept of this high-level language can be learned from scratches for about a couple of days. In a week you will feel yourself as an expert and will enjoy programming on this powerfull language. Second issue about slow h5py library is really anoying, but we hope that authors will account for our comments and its performane can be improved soon.

Below we assume that everything is set up to work on LCLS analysis farm, othervise see Computing and Account Setup.

Libraries

Here is a list of libraries with appropriate references which we are going to use in our examples:

These libraries can be easily imported somewhere around the header of the Python file, for example

#!/usr/bin/env python
import h5py
import numpy as np
import scipy as sp
import scipy.ndimage as spi
import matplotlib.pyplot as plt

Basic operations

Basic operations allows to access the dataset records from HDF5.

Open file, get dataset, get array for current event, and close file:

    file    = h5py.File(hdf5_file_name, 'r')   # Open hdf5 file in read-only mode
    dataset = file[dataset_name]
    arr1ev  = dataset[event_number]
    file.close()

where we assume that all necessary parameters were defined earlier, for example

    hdf5_file_name = '/reg/d/psdm/XPP/xppcom10/hdf5/xppcom10-r0546.h5'
    dataset_name   = '/Configure:0000/Run:0000/CalibCycle:0000/Camera::FrameV1/XppSb4Pim.1:Tm6740.1/image'
    event_number   = 5

The arr1ev is returned as a NumPy object. There are many methods which allow to manipulate with this object. For example, one can

print array shape and content:

    print 'arr1ev.shape =', arr1ev.shape
    print 'arr1ev =\n',     arr1ev

Advanced operations

As in previous examples we assume that

    file = h5py.File(hdf5_file_name, 'r')
    ds   = file[dataset_name]

Check if the HDF5 item is "File", "Group", or "Data"

isFile    = isinstance(ds,h5py.File)
isGroup   = isinstance(ds,h5py.Group)
isDataset = isinstance(ds,h5py.Dataset)

Get information about HDF5 item

For all HDF5 items:
these parameters are available:
```
ds.id
ds.ref
ds.parent
ds.file
ds.name
```

Dataset

 
dataset_type  = ds.dtype
dataset_shape = ds.shape
dataset_value = ds.value

Get item attributes for File or Group (if attributes available)

                    #group or file
dict_attributes    = group.attrs
attrs_keys         = group.attrs.keys()
attrs_values       = group.attrs.values()
number_of_attrs    = len(group.attrs)

# For example, one of the file attributes is a run number:
run_number         = file.attrs['runNumber']

Get group name and the list of daughters

list_of_item_names = group.items()
print list_of_item_names

or

for key,val in dict(group).iteritems():
    print key, val

Extract time

Time variable is stored in HDF5 as a tuple of two long integer numbers representing the seconds since 01/01/1970 and nanoseconds as a fraction of the second. Time can be stored in the group attributes and in the data record with name "time", which can be extracted as shown below

from the group attributes

group = file["/Configure:0000"]
time_sec  = group.attrs.values()[0]
time_nsec = group.attrs.values()[1]

from the time data record

time_dataset = file['/Configure:0000/Run:0000/CalibCycle:0002/Acqiris::DataDescV1/XppLas.0:Acqiris.0/time']

index = 0                   # this is an index in the dataset
time = time_dataset[index]  # get the time tuple consisting of seconds and nanoseconds
time_sec  = time[0]
time_nsec = time[1]

Code examples

Example 1, basic operations

#!/usr/bin/env python

import h5py
import numpy as np

eventNumber = 5

file    = h5py.File('/reg/d/psdm/XPP/xppcom10/hdf5/xppcom10-r0546.h5', 'r')
dataset = file['/Configure:0000/Run:0000/CalibCycle:0000/Camera::FrameV1/XppSb4Pim.1:Tm6740.1/image']
arr1ev  = dataset[eventNumber]
file.close()

print 'arr1ev.shape =', arr1ev.shape
print 'arr1ev =\n',     arr1ev

Example 2, advanced operations

Example, extract and print the time variables:

#!/usr/bin/env python

import h5py
import time

#-----------------------------------------------------

def print_time(t_sec, t_nsec):
    """Converts seconds in human-readable time and prints formatted time"""

    tloc = time.localtime(t_sec) # converts sec to the tuple struct_time in local
    print 'Input time :',t_sec,'sec,',  t_nsec,'nsec, '
    print 'Local time :', time.strftime('%Y-%m-%d %H:%M:%S',tloc)

#-----------------------------------------------------

file_name = '/reg/d/psdm/xpp/xpp22510/hdf5/xpp22510-r0100.h5'
file = h5py.File(file_name, 'r') # open read-only

print "EXAMPLE: Get time from the group attributes:"

group = file["/Configure:0000"]
t_sec  = group.attrs.values()[0]
t_nsec = group.attrs.values()[1]
print_time(t_sec, t_nsec)


print "EXAMPLE: Get time from the data record 'time':"

dataset = file['/Configure:0000/Run:0000/CalibCycle:0002/Acqiris::DataDescV1/XppLas.0:Acqiris.0/time']
index = 0
time = dataset[ind]
t_sec  = time[0]
t_nsec = time[1]
print_time(t_sec, t_nsec)

f.close()

#----------------------------------------------------

Example, print entire file/group structure using recursive method

#!/usr/bin/env python
import h5py

def print_group(g,offset='    '):
    """Prints the input file/group/dataset (g) name and begin iterations on its content"""
    print "Structure of the",
    if   isinstance(g,h5py.File):    print "'File'",
    elif isinstance(g,h5py.Group):   print "'Group' from file",
    elif isinstance(g,h5py.Dataset): print "'Dataset' from file",
    print g.file,"\n",g.name
    if   isinstance(g,h5py.Dataset): print offset, "(Dateset)   len =", g.shape #, subg.dtype
    else:                            print_group_content(g,offset)

def print_group_content(g,offset='    '):
    """Prints content of the file/group/dataset iteratively, starting from the sub-groups of g"""
    for key,val in dict(g).iteritems():
        subg = val
        print offset, key, #,"   ", subg.name #, val, subg.len(), type(subg),
        if   isinstance(subg, h5py.Dataset):
            print " (Dateset)   len =", subg.shape #, subg.dtype
        elif isinstance(subg, h5py.Group):
            print " (Group)   len =",len(subg)
            print_group_content(subg,offset + '    ')

def print_hdf5_file_structure(file_name):
    """Prints the HDF5 file structure"""
    file = h5py.File(file_name, 'r') # open read-only
    print_group(file)
    file.close()
    print '=== EOF ==='

if __name__ == "__main__" :
    print_hdf5_file_structure('/reg/d/psdm/xpp/xpp22510/hdf5/xpp22510-r0100.h5')

Page tree