Introduction

One of the responsibilities of the offline system is to translate the data coming from online system to the "scientific format". The basis for the scientific data format is HDF5 - Hierarchical Data format - developed and supported by HDF Group.

This document describes general structure and organization of the data in HDF5 files produced by offline translator.

HDF5 Overview

HDF5 data file has hierarchical organization consisting of the groups and dataset. Groups can contain other groups and datasets, datasets contain complex multi-dimensional data. The organization looks similar to the regular UNIX file system, groups are analogous to directories, and datasets contain arbitrary data like regular files. Like directories, groups name the objects contained in them. This allows easy navigation from the "top" of the file to any object in that file, something like /groupA/groupB/dataset1.

Datasets in HDF5 file have defined element type and layout. Layout determines the dimensionality of the data and the size of each dimension (shape of the dataset). Special case of a layout called scalar which means that dataset contains exactly one element of the given type. More common layouts are multidimensional layouts. Element type describes the structure of the basic element of these multidimensional arrays or scalars. Type can be atomic, compound, or array. Atomic types include usual numeric types such as integer, floating, etc. Compound type is a collection of atomic types or other compound types, very much like structures in programming languages. Array type for elements means that each dataset element is itself an array. One has to remember that all elements of a dataset have the same type and structure, e.g. if element type is an array then all elements have to have exactly the same dimensions.

Groups and datasets in HDF5 could have a number of attributes attached to them which can be used to store some meta-information about the group or dataset as a whole. Each attribute has a name and associated value which usually hase some basic type - integer, float, string, etc. Attributes could be used for example to store begin/end time of the run or the run number.

Versions and variations of HDF5 structure.

The structure and representation of data inside HDF5 will evolve over time. To keep track of these changes and available features the support for schema versions has been added at some point. The schema version is a single number which is stored inside a file when the file is created. Version number is incremented when important changes happen to representation of the data inside files. Schema version number is stored in HDF5 file as an integer attribute of the top-level group (root or "/" group) with the name ":schema:version". Older HDF5 files may not have this attribute, if the attribute is missing it should be assumed that schema version is 1. Check section below for the list of known version numbers and their corresponding features.

Some dynamic features of the schema can be turned on or off independently of the schema version number. To simplify use of those features the top-level group may contain a number of attributes which specify if the feature is available or not. One example would be ":schema:timestamp-format" attribute which stores a string, either "short" or "full" depending of the format of "time" datasets.

Group structure of the HDF5 files

The modeling of the nested data structure of the XTC files is done with the groups and datasets in HDF5. HDF5 group structure roughly corresponds to the nested DAQ transitions, but only reflects a subset of transitions that are relevant to offline world: Configure, Run, CalibCycle, and L1Accept.

Following groups are defined for HDF5 format:

  • One or more groups corresponding to Configure/Unconfigure transitions. Typically there will be only one Configure transition per HDF5 file, so there will be only one "Configure" group.
  • Zero or more child groups inside a "Configure" group corresponding to BeginRun/EndRun transitions.
  • Zero or more child groups inside a "Run" group corresponding to BeginCalibCycle/EndCalibCycle transitions.
  • Zero or more child groups inside "CalibCycle" group corresponding to the data objects appearing inside L1Accept transitions.

The names of the groups in HFD5 are derived from the names of the corresponding transition names and the object types. Because there may be multiple nested transitions of the same type inside one parent transition one needs to distinguish these multiple transitions and give them different names. Our current approach for the groups derived from transitions is to have a transition name, such as CalibCycle extended with the sequential number of a transition of the same type, for example Run:0000 or CalibCycle:0004. In case there is only transition of the specific type in XTC file there will be an option to enable "simplified" groups names, in which case group names will look like Configure, Run, etc. Here is an example of the group structure for a fictional data file containing data from several runs, each run containing few calibration cycles:

/Configure:0000
    |
    +- Run:0000
    |    |
    |    +- CalibCycle:0000
    |    +- CalibCycle:0001
    |
    +- Run:0001
         |
         +- CalibCycle:0000
         +- CalibCycle:0001
         +- CalibCycle:0002

Configure and CalibCycle groups will each have a number of the data groups in them as well, data groups contain actual data objects (as datasets) which come from corresponding transitions. Configure group will usually contains data from configuration objects, while CalibCycle will have data groups collecting data from all L1Accepts transitions (event data) but also a small set of groups for configuration object associated to BeginCalibCycle transition itself (if any). Data groups are two levels deep, first level correspond to the type name of the data contained in the data group, second level identifies the origin of the data (particular device, also known as source). The type name of the data corresponds to its C++ class name and may look like Camera::FrameV1 or CsPad::ConfigV4. This type name determines the structure and types of the datasets stored inside data group. Second-level group name is constructed from DAQ device identifier which has format <Detector>.<DetectorID>:<Device>.<DeviceID> for regular devices or <BldType> for Beam Line Data sources. Examples of names could be AmoIms.0:Acqiris.1 for one particular AMO Acqiris device or FEEGasDetEnergy for BLD FEE gas detector.

Here is an example which shows group structure including data groups for few devices (BLD Ebeam and couple of CsPad devices), this does not show any datasets yet:

/Configure:0000/
/Configure:0000/Bld::BldDataEBeamV3/
/Configure:0000/Bld::BldDataEBeamV3/EBeam/
/Configure:0000/CsPad::ConfigV4/
/Configure:0000/CsPad::ConfigV4/CxiDs1.0:Cspad.0/
/Configure:0000/CsPad::ConfigV4/CxiDsd.0:Cspad.0/
/Configure:0000/Run:0000/
/Configure:0000/Run:0000/CalibCycle:0000/
/Configure:0000/Run:0000/CalibCycle:0000/Bld::BldDataEBeamV3/
/Configure:0000/Run:0000/CalibCycle:0000/Bld::BldDataEBeamV3/EBeam/
/Configure:0000/Run:0000/CalibCycle:0000/CsPad::ElementV2/
/Configure:0000/Run:0000/CalibCycle:0000/CsPad::ElementV2/CxiDs1.0:Cspad.0/
/Configure:0000/Run:0000/CalibCycle:0000/CsPad::ElementV2/CxiDsd.0:Cspad.0/

As seen in this example BLD data groups can appear in both Configure and CalibCycle groups, one other data type which can do this is EPICS data. Configuration data types which come from BLD sources may also appear in both locations.

Special cases – EPICS

Standard XTC information (such as data source) does not provide enough information to identify individual PVs, instead identification information is stored inside the data itself. To simplify data access for EPICS data we store it in HDF5 by splitting data into multiple additional groups based on PV names. The structure of the groups for EPCIS data hase one additional group level below device group, names of these groups are the names of PVs. Here is an example of few EPICS groups:

/Configure:0000/Run:0000/CalibCycle:0000/Epics::EpicsPv/EpicsArch.0:NoDevice.0/CXI:DG1:CLZ:01.RBV/
/Configure:0000/Run:0000/CalibCycle:0000/Epics::EpicsPv/EpicsArch.0:NoDevice.0/CXI:DG1:JAWS:XTRANS.C/
/Configure:0000/Run:0000/CalibCycle:0000/Epics::EpicsPv/EpicsArch.0:NoDevice.0/EVNT:SYS0:1:LCLSBEAMRATE/
/Configure:0000/Run:0000/CalibCycle:0000/Epics::EpicsPv/EpicsArch.0:NoDevice.0/GATT:FEE1:310:R_ACT/

Additionally, DAQ defines set of aliases for EPICS PV names which provide easy-to-remember meaningful names for PVs. These aliases are represented in HDF5 by symbolic names, the alias name pointing to some PV name becomes a symbolic link inside an Epics group pointing to corresponding PV group name:

"/Configure:0000/Run:0000/CalibCycle:0000/Epics::EpicsPv/EpicsArch.0:NoDevice.0/Gas detector 1 pressure" -> "VGBA:FEE1:240:P"
"/Configure:0000/Run:0000/CalibCycle:0000/Epics::EpicsPv/EpicsArch.0:NoDevice.0/KB MOTIONS" -> "CXI:KB1:MMS:05.RBV"

EPICS data appears both inside Configure and L1Accept transition (in sligtly different forms). As a result EPICS data groups will appear inside /Configure and /Configure/Run/CalibCycle groups. There will be very few (maybe just one) data entries inside /Configure data groups and a lot more data, depending on total number of events inside CalibCycle.

Dataset structure

Actual data are stored inside data groups as a number of HDF5 datasets. For the purpose of describing organization of the datasets we distinguish two types of data objects – configuration-type and event-type:

  • configuration-type data appears only once per group, many types of this data appear during the Configure transition and there is only a single instance of them, some configuration types are updated on every BeginCalibCycle but there is still just one instance per calibration cycle;
  • event-type data usually appears in L1Accept transitions, and they are updated in every event or less frequently but there still may be more that one instance of the data object per CalibCycle.

Many data types are stored as a single dataset (usually with the name "config" or "data" depending on type of the data), some more complex data types can be split into multiple datasets. Splitting is done to improve performance of data reading when only a subset (slice) of the data may be needed for particular analysis.

Non-split configuration-type data is usually stored as a scalar dataset with the name "config". If configuration type is split into multiple datasets then types of the individual dataset may vary. If, for example, one piece of data is an array it will be stored as array dataset. For example camera data can be split into two datasets, one containing small piece of general information about image and another containing the image itself, if analysis only needs information from first dataset it can read it significantly faster than if datasets were not split. Here is an example of dataset definitions (output from h5ls) for two configuration data types, one is not split, another is split:

# non-split data type stored as scalar dataset "config"
/Configure:0000/CsPad::ConfigV5/CxiDs1.0:Cspad.0 Group
/Configure:0000/CsPad::ConfigV5/CxiDs1.0:Cspad.0/config Dataset {SCALAR}

# split data type stored as two scalar datasets and 3 array (rank=1) datasets
/Configure:0000/EvrData::ConfigV7/NoDetector.0:Evr.0 Group
/Configure:0000/EvrData::ConfigV7/NoDetector.0:Evr.0/config Dataset {SCALAR}
/Configure:0000/EvrData::ConfigV7/NoDetector.0:Evr.0/eventcodes Dataset {2}
/Configure:0000/EvrData::ConfigV7/NoDetector.0:Evr.0/output_maps Dataset {9}
/Configure:0000/EvrData::ConfigV7/NoDetector.0:Evr.0/pulses Dataset {3}
/Configure:0000/EvrData::ConfigV7/NoDetector.0:Evr.0/seq_config Dataset {SCALAR}

All datasets which store event-type data are 1-dimensional arrays. Like configuration type event types can be split as well and stored as multiple datasets. In addition to regular datasets produced from event data few special datasets are stored for event-type data:

  • "time" – required, stores corresponding event time
  • "_damage" – optional, stores corresponding DAQ damage value
  • "_mask" – optional, stores "data is usable" flag

All datasets in a group corresponding to event-type data, including dataset produced from data and above special datasets, are 1-dimentional datasets with identical number of entries in each dataset. All datasets in the same group are aligned meaning that entries with the same index correspond to the same event. For example, data stored in data[100] comes from the event whose time is defined by time[100] and it has damage defined in _damage[100] and it can be used if _mask[100] is non-zero.

One should be aware that datasets from different groups are not guaranteed to be aligned, they can have different sizes, and even when their sizes are equal there is no guarantee that they correspond to the same events. To match events from different groups one has to use content of "time" dataset.

Here is an example of dataset definitions (output from h5ls) for event data types, one is not split, another is split:

# non-split event type has only one "data" dataset (plus three special datasets)
/Configure:0000/Run:0000/CalibCycle:0000/EvrData::DataV3/NoDetector.0:Evr.0 Group
/Configure:0000/Run:0000/CalibCycle:0000/EvrData::DataV3/NoDetector.0:Evr.0/_damage Dataset {9310/Inf}
/Configure:0000/Run:0000/CalibCycle:0000/EvrData::DataV3/NoDetector.0:Evr.0/_mask Dataset {9310/Inf}
/Configure:0000/Run:0000/CalibCycle:0000/EvrData::DataV3/NoDetector.0:Evr.0/data Dataset {9310/Inf}
/Configure:0000/Run:0000/CalibCycle:0000/EvrData::DataV3/NoDetector.0:Evr.0/time Dataset {9310/Inf}

# split event type has "data" and "element" datasets (plus three special datasets)
/Configure:0000/Run:0000/CalibCycle:0000/CsPad::ElementV2/CxiDs1.0:Cspad.0 Group
/Configure:0000/Run:0000/CalibCycle:0000/CsPad::ElementV2/CxiDs1.0:Cspad.0/_damage Dataset {9310/Inf}
/Configure:0000/Run:0000/CalibCycle:0000/CsPad::ElementV2/CxiDs1.0:Cspad.0/_mask Dataset {9310/Inf}
/Configure:0000/Run:0000/CalibCycle:0000/CsPad::ElementV2/CxiDs1.0:Cspad.0/data Dataset {9310/Inf}
/Configure:0000/Run:0000/CalibCycle:0000/CsPad::ElementV2/CxiDs1.0:Cspad.0/element Dataset {9310/Inf}
/Configure:0000/Run:0000/CalibCycle:0000/CsPad::ElementV2/CxiDs1.0:Cspad.0/time Dataset {9310/Inf}

Time dataset

Special dataset "time" contains timestamp of the event as it comes from DAQ. This information is sufficient to identify the event and match this event with the data from other groups. In most cases the data in HDF5 is stored in event time order, meaning that timestamps in time dataset should be in increasing order (making it easier to do binary search for a timestamp). Time dataset is a structure (compound data type in HDF5 speak) with these fields:

  • seconds – whole seconds part of the timestamp
  • nanoseconds – nanoseconds (range 0-999999999) part of the timestamp
  • fiducials – fiducials counter for the event
  • ticks – 119MHz counter within the fiducial
  • vector – event counter since Configure (not necessarily in increasing order)
  • control – EVR event code

Older schema only had first two fields, last four fields are defined only if the global attribute (attribute of "/" group) with the name ":schema:timestamp-format" exists and its value is "full".

Damage dataset

Optional dataset "_damage" contains information about damage recorded by DAQ for this data source. Usually when damage is non-zero the data is not usable, with the exception of BLD Ebeam data which can be used if user damage flags are set. "_mask" dataset provides exact information about when the data is usable. See also description of mask dataset for more details about schema versions. Damage dataset is a structure (compound data type in HDF5 speak) with these fields:

  • bits – packed value of the damage bits, non-user damage only
  • userBits – packed value of the damage bits, user damage only
  • DroppedContribution – value of particular damage bit
  • OutOfOrder – value of particular damage bit
  • OutOfSynch – value of particular damage bit
  • UserDefined – value of particular damage bit
  • IncompleteContribution – value of particular damage bit

Mask dataset

Special dataset "_mask" contains calculated value based on the damage bits and type of data which has non-zero value when corresponding data are usable. If the mask value for the specific event is zero that one should not attempt to read or interpret the data from regular dataset as the data may not be consistent. Currently when mask value is zero corresponding entries in data datasets will be zero-filled but that can change in the future. Time dataset type is an integer number, so it is an integer array, one number per recorder event.

Old versions of translator did not create mask or damage datasets, instead all damaged data were simply dropped and were never saved. That lead to misaligned datasets for different types of data when one type could have damaged events. With new scheme (which can be detected by presence of "_mask" datasets) translator now saves damaged events as well. There is still no strong guarantee that datasets from different data types or sources are always aligned, offline cannot provide such guarantee because of the nature of XTC format and DAQ operation.

Special cases – EPICS

For EPICS PV data we do not store mask or damage datasets, EPICS data groups only contain "data" and "time" datasets.

EPICS data that come in Configure transition have slightly different format than those from L1Accept transitions (control vs time data structures in EPICS speak). As a result the "data" dataset will have different structure (type) in Configure and CalibCycle groups.

One has to remember that EPICS PV data is not stored on every shot, so the number of entries in EPICS data group should be much lower than total number of events.

Special cases – BLD

Beam line data is special because BLD devices are not controlled by DAQ and their life cycle is not synchronized to DAQ transitions. As a result both event-type data from BLD and configuration data can appear in both Configure and L1Accept transition. Effectively this makes configuration data that comes from BLD sources very similar to event-type data because configuration object may appear multiple times in L1Accept during the same run.

There are complications in the dataset structure due to that. First, because the true event-type data from BLD can appear inside Configure transition there will be corresponding event-type datasets (array datasets) in the groups inside /Configure:0000 group. Here is an example of BLD EBeam data datasets inside /Configure:

/Configure:0000/Bld::BldDataEBeamV3 Group
/Configure:0000/Bld::BldDataEBeamV3/EBeam Group
/Configure:0000/Bld::BldDataEBeamV3/EBeam/_damage Dataset {3/Inf}
/Configure:0000/Bld::BldDataEBeamV3/EBeam/_mask Dataset {3/Inf}
/Configure:0000/Bld::BldDataEBeamV3/EBeam/data Dataset {3/Inf}
/Configure:0000/Bld::BldDataEBeamV3/EBeam/time Dataset {3/Inf}

In the example above arrays have 3 entries, this is a result of multiple Configure transitions coming from multiple (3 in this case) streams. All this entries should be identical and starting with the schema version 3 these multiple entries should be compressed into just one entry. BLD event data that comes in L1Accept is stored inside CalibCycle:NNNN groups as usual.

Second complication affects BLD configuration data. As explained above configuration data from BLD becomes event-type data. It is also stored as event-type data in HDF5. In particular in schema version before 3 the data that comes inside Configure transition is stored inside event-type (array) datasets in /Configure:0000 groups, just as event-type data explained above. Starting with schema version 3 BLD configuration data from Configure transition is stored as scalar dataset just as any other type of configuration data. BLD configuration data that appears in L1Accept transitions ius always stored in event-type (array) datasets inside .../CalibCycle:NNNN/ groups.

Group attributes

Various groups in a file can define attributes that serve as metadata describing content of a group or the whole file. Below is description of the existing attributes.

Schema attributes

Small set of attributes provides information about the structure of a file, they are usually defined at the global level for the topmost "/" group:

  • :schema:version – integer number, defines schema version of the whole HDF5 file
  • :schema:timestamp-format – string, either "full" or "short", defines the structure of "time" datasets in event-data groups
  • :schema:bld-shared-split – integer number, non-zero means that shared BLD data is split into individual components
  • :schema:bld-config-as-evt – integer number, if non-zero then BDL configuration data types are stored in event-type datasets (arrays)

Global attributes

  • UUID – string, defines unique file identifier (not used currently)
  • created – string, time when the file was created
  • experiment – string, name of the experiment
  • origin – string representing the user or process who created the file ("translator" is probably most common)
  • runNumber – integer, run number
  • runType – string, currently contains "DATA"

Start/end times

Groups that correspond to DAQ transitions (Configure, Run, and CalibCycle) may define attributes which store start/end times of their corresponding intervals:

  • start.seconds – integer, whole seconds of the starting timestamp
  • start.nanoseconds – integer, nanoseconds part of the starting timestamp
  • end.seconds – integer, whole seconds of the end timestamp
  • end.nanoseconds – integer, nanoseconds part of the end timestamp

In some cases (when corresponding transition is missing from a file) timestamps may not be known, in this case corresponding attributes will be missing.

Data group attributes

Individual data groups (groups that contain datasets) may have these attributes defined:

  • _schemaVersion – integer, schema version number used for defining datasets (their number and types)
  • _xtcSrc – 64-bit integer, representation of DAQ source (Pds::Src value) where data originated

Schema versions and features

This section lists all know schema versions and features that existed or continue to exist. Schema version which which particular file was generated can be determined from the global attribute :schema:version, if attribute does not exist then schema version should be assumed as 1.

Schema version 1

This is the schema that was used since the beginning of experiments until November 2012. Some features of this schema version:

  • "time" datasets had only seconds and nanoseconds fields, fiducials, ticks and other fields are not defined
  • schema attributes are not defined

Schema version 2

This version was introduced around November 2011. Changes since previous schema version:

  • added schema attributes

Some features are dynamic and can be turned on or off independently of the schema version number. Here are few features:

  • time dataset format: if global attribute :schema:timestamp-format is present and contains "full" then time datasets will have all fields defined above, otherwise only seconds and nanoseconds fields defined (EPICS time dataset always have only seconds/nanoseconds)
  • damage/mask datasets: if these datasets are present then their content can be used to filter out unusable data, if these datasets are missing then damaged data is simply discarded
  • shared BLD data: if global attribute :schema:bld-shared-split is present and is non-zero then shared BLD data is split into individual object before storing in HDF5. Recently DAQ started splitting shared BLD before writing to XTC, so it should not matter anymore for recent data.
  • BLD configuration data: if global attribute :schema:bld-config-as-evt is present and is non-zero then configuration data from BLD sources is stored in the same way as regular event data (in rank-1 array datasets instead of scalar datasets)

Schema version 3

This version was introduced around September 2013. Changes since previous schema version:

  • BLD configuration data from Configure transition is stored in scalar datasets instead of event-type array datasets in /Configure:0000 group
  • translator now processes only one Configure transition instead of one-per-stream, as a result number of entries in event-type datasets in /Configure:0000 groups should be 1.

Schema version 4

This version was introduced around February 2014. It introduced some new features and changes discussed in the Outdated: The XTC to HDF5 Translator page. In particular,

  • CsPad calibration constants have moved
  • Some File attributes are not stored
  • A few integer types changed size
  • All Epics pv's are stored in the source folder EpicsArch.0:NoDevice.0.  Before they could appear in several folders. As epics pv names uniquely identify the data, the source information should not be needed
  • OutOfOrder damage is by default no longer translated.
  • Creation of a Filtered:000x group under run when users filter events using "do_not_translate" key.
  • For split calib cycle translation, schema for split calib files started at CalibCycle:000x rather than /Configure:0000.

Schema version 5

This version was introduced around February 2015 with Translator Tag V00-02-15 and ana-release 0.13.17.

  • Remove Filtered:0000x groups introduced in Schema 4.
  • Changed schema for split files - it now looks the same as schema for main/master file
  • Translator will look for user data from Psana modules during begin/end job/run/calibcycle. Data found during the end* methods of a Psana module can trigger creation of a EndData subgroup to either CalibCycle:000x, Run:000x or Configure:000x groups.
  • We would like to reserve the creation of a new group from root for a simpler, source as opposed to type, based alternative schema. Something like
    • /Configure:0000
    • /usr
    This would be in addition to the existing schema - providing an simpler hierarchy to the data.


  • No labels