Objective
LCLS Data Management group works on PSANA project - generic framework for analysis of any experimental data. Though this framework is going to be universal, most likely it will not be simple. In this page we discuss a simple but flexible approach to analysis of data stored in HDF5 files. It is based on Python
code with extensive exploitation of standard libraries. A few code examples of how to access and process data are presented at the end of this page.
There are obvious advantages in this approach:
- Flexibility; HDF5 file has indexed structure, that means direct access to any data of any file from your code.
Python
is a high-level scripting language allows to write transparent and compact code based on well-elaborated standard libraries.- In general, code in
Python
works slow comparing to C+, but there are libraries likeNumPy
written on C+, which solve this problem for manipulation with large arrays.
There is a couple of drawbacks in this approach,
- you have to know or learn
Python
- current version of the
h5py
library works quite slow with long HDF5 files
The first issue about Python
is not really a drawback. Basic concept of this high-level language can be learned from scratches for about a couple of days. In a week you will feel yourself as an expert and will enjoy programming on this powerful language. Second issue about slow h5py
library is really annoying, but we hope that authors will account for our comments and its performance can be improved soon.
Below we assume that everything is setup to work on LCLS analysis farm, otherwise see Computing and Account Setup.
Libraries
Here is a list of Python libraries which we use in examples below:
These libraries can be imported in the top of the Python
-code file, for example
#!/usr/bin/env python import h5py import numpy as np
HDF5 file structure
Detailed description of the HDF5 file structure can be found in HDF5 or h5py web sites. Briefly speaking, its structure resembles the file system directory tree. The top level of the HDF5 tree is a file; file may contain groups and datasets; each group may contain other groups and datasets; each dataset contains the data objects, which in most cases can be associated with NumPy
types. Group and file may also have additional parameters, which are called as attributes. So, there are three basic type of items in HDF5 file: File, Group, and Dataset. Their names are used as an access keys.
Basic operations
Basic operations allows to access the dataset records from HDF5. Here we assume that user knows explicitly the names of file and dataset and event number, for example
hdf5_file_name = '/reg/d/psdm/XPP/xppcom10/hdf5/xppcom10-r0546.h5' dataset_name = '/Configure:0000/Run:0000/CalibCycle:0000/Camera::FrameV1/XppSb4Pim.1:Tm6740.1/image' event_number = 5
- Open file, get dataset, get array for current event, and close file:
file = h5py.File(hdf5_file_name, 'r') # 'r' means that hdf5 file is open in read-only mode dataset = file[dataset_name] arr1ev = dataset[event_number] file.close()
The arr1ev
is a NumPy
object. There are many methods which allow to manipulate with this object. For example, one can
- print array shape and content:
print 'arr1ev.shape =', arr1ev.shape print 'arr1ev =\n', arr1ev
Advanced operations
As in previous case we assume that all necessary parameters are defined,
file = h5py.File(hdf5_file_name, 'r') item = file[item_name]
where item
stands for file, group of dataset.
Check if the HDF5 item is "File", "Group", or "Dataset"
isFile = isinstance(item, h5py.File) isGroup = isinstance(item, h5py.Group) isDataset = isinstance(item, h5py.Dataset)
In this example the standard Python
method isinstance(...)
returns True
or False
in each case, respectively.
Get information about HDF5 item
- For all HDF5 items:
these parameters are available:item.id # for example: <GroupID [1] (U) 33554473> item.ref # for example: <HDF5 object reference> item.parent # for example: <HDF5 group "/Configure:0000/Run:0000/CalibCycle:0000" (5 members)> item.file # for example: <HDF5 file "cxi80410-r0587.h5" (mode r, 3.5G)> item.name # for example: /Configure:0000/Run:0000/CalibCycle:0000/Camera::FrameV1
- For Dataset
ds.dtype # for example: ('seconds', '<u4'), ('nanoseconds', '<u4')] ds.shape # for example: (1186,) ds.value # for example: (1297610252L, 482193420L)
- Get item attributes for File or Group (if attributes available)
In this example theitem
might be a group or fileitem.attrs # for example: <Attributes of HDF5 object at 230141696> item.attrs.keys() # for example: ['start.seconds', 'start.nanoseconds'] item.attrs.values() # for example: [1297608424L, 627075857L] len(item.attrs)
For example, one of the file attributes is a run number,
run_number = file.attrs['runNumber']
- Get the list of daughters in the
group
or convert the group in dictionary and iterate over their key and values,list_of_item_names = group.items() print list_of_item_names
for key,val in dict(group).iteritems(): print key, val
Extract time
Time variable is stored in HDF5 as a tuple of two long integer numbers representing the seconds since 01/01/1970 and nanoseconds as a fraction of the second. Time is usually stored in the group attributes and/or in the data record with name "time", which can be extracted as shown below
- from the group attributes
group = file["/Configure:0000"] time_sec = group.attrs.values()[0] time_nsec = group.attrs.values()[1]
- from the
time
data recordtime_dataset = file['/Configure:0000/Run:0000/CalibCycle:0002/Acqiris::DataDescV1/XppLas.0:Acqiris.0/time'] index = 0 # this is an index in the dataset time = time_dataset[index] # get the time tuple consisting of seconds and nanoseconds time_sec = time[0] time_nsec = time[1]
Code examples
Example 1: basic operations
#!/usr/bin/env python import h5py import numpy as np eventNumber = 5 file = h5py.File('/reg/d/psdm/XPP/xppcom10/hdf5/xppcom10-r0546.h5', 'r') dataset = file['/Configure:0000/Run:0000/CalibCycle:0000/Camera::FrameV1/XppSb4Pim.1:Tm6740.1/image'] arr1ev = dataset[eventNumber] file.close() print 'arr1ev.shape =', arr1ev.shape print 'arr1ev =\n', arr1ev
Example 2: advanced operations
Extract and print the time variables:
#!/usr/bin/env python import h5py import time #----------------------------------------------------- def print_time(t_sec, t_nsec): """Converts seconds in human-readable time and prints formatted time""" tloc = time.localtime(t_sec) # converts sec to the tuple struct_time in local print 'Input time :',t_sec,'sec,', t_nsec,'nsec, ' print 'Local time :', time.strftime('%Y-%m-%d %H:%M:%S',tloc) #----------------------------------------------------- file_name = '/reg/d/psdm/xpp/xpp22510/hdf5/xpp22510-r0100.h5' file = h5py.File(file_name, 'r') # open read-only print "EXAMPLE: Get time from the group attributes:" group = file["/Configure:0000"] t_sec = group.attrs.values()[0] t_nsec = group.attrs.values()[1] print_time(t_sec, t_nsec) print "EXAMPLE: Get time from the data record 'time':" dataset = file['/Configure:0000/Run:0000/CalibCycle:0002/Acqiris::DataDescV1/XppLas.0:Acqiris.0/time'] index = 0 time = dataset[ind] t_sec = time[0] t_nsec = time[1] print_time(t_sec, t_nsec) f.close() #----------------------------------------------------
Print entire file/group structure using recursive method
#!/usr/bin/env python import h5py def print_hdf5_file_structure(file_name): """Prints the HDF5 file structure""" file = h5py.File(file_name, 'r') # open read-only item = file #["/Configure:0000/EvrData::ConfigV4"] print_hdf5_item_structure(item) file.close() print '=== EOF ===' def print_hdf5_item_structure(g,offset=' '): """Prints the input file/group/dataset (g) name and begin iterations on its content""" print "Structure of the", if isinstance(g,h5py.File): print "'File'", elif isinstance(g,h5py.Group): print "'Group' from file", elif isinstance(g,h5py.Dataset): print "'Dataset' from file", print g.file,"\n",g.name if isinstance(g,h5py.Dataset): print offset, "(Dateset) len =", g.shape #, subg.dtype else: print_group_content(g,offset) def print_group_content(g,offset=' '): """Prints content of the file/group/dataset iteratively, starting from the sub-groups of g""" for key,val in dict(g).iteritems(): subg = val print offset, key, #," ", subg.name #, val, subg.len(), type(subg), if isinstance(subg, h5py.Dataset): print " (Dateset) len =", subg.shape #, subg.dtype elif isinstance(subg, h5py.Group): print " (Group) len =",len(subg) print_group_content(subg,offset + ' ') if __name__ == "__main__" : print_hdf5_file_structure('/reg/d/psdm/XPP/xppcom10/hdf5/xppcom10-r0546.h5')