Introduction
This document is for developing a new schema for LCLS Hdf5 files. The schema defines how the data is layed out in the hdf5 files, and the user interface to that data. There are two main parts to the schema,
- group hierarchy - the names the user works through to navigate to the data
- data types in datasets - the actual datatypes used in the datasets, this includes the names for all the subfields in compound data types (that look like C struct's in the data)
Presently we are only proposing changes to the group hierarchy. Issues we wish to address in the current schema
- Presently our hierarchy requires users to go through a potentially long list of Type names before they get to the data. These typenames come from the C++ code and can be complicated.
- Users are generally more familiar with sources, and in particular the DAQ aliases for the sources, these are currently not in our hierarchy.
- CalibCycle is not intuitive, and misleading for XPP due to when calibration data is created. They would like to call them CalibCycles steps.
There are several other things that seem good to do as well. These have been listed below. A few alternatives were considered that are discussed below.
Initially, this schema could sit alongside the current schema and use softlinks to the actual data. This would not brake anybodies code. However two schemas that do the same thing adds confusion, so I am interested in developing something robust enough that we could use to replace the current schema. The below schema should be readable by frameworks as well as users browsing hdf5 files. We do not want to do schema changes that brake peoples code unless necessary. If we are going to change the schema, we would like to cover all the issues we can all at once. If you have any comments or suggestions, please feel to use confluence to add comments to the bottom of the document, add to the document, or email me at davidsch@slac.stanford.edu.
Current Schema
Here is an example of the current schema. Click on the box to expand the schema.
List of Schema Changes
Here are changes to make:
- We don't need to number Configure:0000 and Run:0000
- we only translate one run per file.
- Lets make a root group called Data. Main groups:
/Data
/Data/Run
- Make a new group for Configure Data
Now, we have
/Configure:0000/TypeA
/Configure:0000/TypeB
/Configure:0000/Run:0000
why not put all Config data, TypeA, TypeB in one place?
/Data
/Data/Config this in turn will have TypeA and TypeB as children
/Data/Run
Put Epics in its own group and remove Epics Source name with Arch from Schema
/Data/EpicsConfig
/Data/Config
/Data/Run
/Data/Run/Step:0000/Epics/pvName
Invert Type/Src relationship
/Data/Config/SrcA/TypeA
/Data/Config/SrcA/TypeB
/Data/Config/SrcB/TypeA
Use DAQ aliases when possible.
/Data/Config/alias
/Data/Config/alias/TypeA
/Data/Config/alias/TypeB
/Data/Config/SrcB/TypeA
Translator option for a alias.
Sometimes inverting type/src makes it harder to find data. An example is timetool data. It will be attached to a source like opal_1, but the user may not know this. They will be looking for timetool data and now they have to go through all the sources. The thought is to allow the user to specify a one group alias name for a src/type combination. So, while the hdf5 file has
/Data/Run/Step:0000/
/Data/Run/Step:0000/opal_1/TimeToolData
we'll also create
/Data/Run/Step:0000/TimeToolData {Soft Link to} /Data/Run/Step:0000/opal_1/TimeToolDataUse type aliases in place of full C++ type names with version
Basically, we will eliminate the V* and the :: from the typenames
see section below for all aliases
Use Step:000x rather than CalibCycle:000x
separate config and epics in steps
/Data/Run/Step:0000/Config
/Data/Run/Step:0000/Epics
Psana Module Keystrings treated like types
if a module does
evt.put(myndarray,src,"mykey")
then we translate
/Data/Run/Step:0000/src/mykey
List of Type Aliases
There are three classes of types in use - DAQ, CalibStore, and user - from the event store
DAQ
Below is a list of Type Aliases for Daq types. For the most part, this alias is remove version and the :: a few exceptions are notated with a - character. Click on the box to see all the type aliases.
Note, the shared types should not show up in the translation. Psana per-processes them and puts the sub types in the event.
Calib Store
We also need to introduce simpler type names for the calibStore types:
CsPad2x2Pedestals pdscalibdata::CsPad2x2PedestalsV1 CsPad2Pedestals pdscalibdata::CsPadPedestalsV1
This is not a complete list, calib store types are not in the DDL
User Types
This refers to types the Translator finds in the Event that other Psana modules place there. Per the schema change "Use event key strings like types" we will not be using a type, just the key string - so no alias is required. If for some reason a user adds an ndarray or string to the event without a keystring, then we'll use the below aliases:
ndarray will be an alias for all of these: ndarray<T,R>, ndarray<const T,R> as well as the special vlen versions of these ndarrays that the Translator understands
string std::string
Alternatives
There are a few alternatives I was thinking about
Remove the Types
Removing the types is complicated because there can be several types associated with one source. If one puts all the datasets associated with the different types into one group, the issue is name collisions for datasets with the same name (like standard dataset names like 'data' or 'config' or 'image'). Moreover the different types may have different _damage or _mask datasets. More important for users is different time datasets that affect alignment, but another project is to align the 'DAQ readout groups' which means all types from each source will be aligned.
Smaller and fewer Type Aliases
Just use one alias for both config and data. For example:
UsdUsb UsdUsb::ConfigV1, UsdUsb::DataV1
The thinking is that since config vs data is in the hierarchy paths, it will be clear from context, i.e.
/Data/Run/Step:0000/srcAlias/Config/UsdUsb # the config data
/Data/Run/Step:0000/srcAlias/UsdUsb # the event data
vs
/Data/Run/Step:0000/Config/srcAlias/UsdUsbConfig # the config data
/Data/Run/Step:0000/srcAlias/UsdUsbData # the event data
The drawback is a higher risk of a name collision (see problems below). For instance if there is both config and regular event data occurring during the event, then the Translator will try to put them in the same group.
A new Group for EventData
Since there is a Config group, it seems like a good idea to have a group for EventData, i.e:
/Data/Run/Step:0000/Config/UsdUsbConfig # the config data
/Data/Run/Step:0000/EventData/UsdUsbData # the event data
However maybe some will find this new group gets in the way.
Just starting with a Run group
Have the hierarchy start here
/Run
/Run/Config
/Run/EpicsConfig
/Run/Step:0000
/Run/Step:0000/Config/srcAlias/UsdUsbConfig # the config data
/Run/Step:0000/srcAlias/UsdUsbData # the event data
The drawback is that this is not how the xtc data is formed. In xtc files, the beginRun transition is preceded by the Configure transition. Collapsing the information from both transitions into a Run group is probably reasonable, but makes it more awkward to recover information that belonged to one xtc transition and not the other (users generally don't care about this, but a framework like psana does),
Problems/Issues/Surprises
Group Name Collisions
A group name collision occurs when the Translator has already made a group for one kind of data, when all of a sudden another kind of data comes along with the same name.
Presently, there should not be a collision. If one happens, it is treated as a fatal error.
They don't happen because currently there is a near 1-1 mapping between the psana event keys from which the Translator gets the data, to the group names. This mapping uses the distinct pairs of C++ type names and DAQ source names in the event keys.
That means one can always add a new Type without colliding with existing types, as long as the Translator uses a fully qualified C++ typename for the group name.
Likewise for DAQ sources.
An example of a collision would be
- Daq Alias called noSrc
- user does evt.put(myndarray,'mykey')
The Translator already uses the string noSrc for user data without a source - collision.
Another example would be
- user does evt.put( myarray, psana.Source('cspad_front'), "CsPadElement") # myarray is a ndarray
- user does evt.put( mystring, psana.Source('cspad_front'), 'CsPadElement") # mystring is a Python string
- The xtc files include CsPad::DataV2 from source 'cspad_front'
All three of these want to go to /Data/Run/Step:0000/cspad_front/CsPadElement
I think though, collisions will be rare and most likely something a user can change by specifying different output keys for the psana modules they load. So the default will be to throw a fatal error, but I'll add an option to make this a non-fatal error and have the Translator rename the colliding group.
For example, there is some old data where occasionally a CsPad::DataV1 would be sent while most all the data was DataV2. This was to debug the new compression being used. In the non-fatal mode, the Translator will start with the type alias above, CsPadElement, based on the first data it sees.
The next data will get called CsPadElement_01. Instead, I could
- Rename the first, then the users gets the original messy names CsPad::ElementV1 and CsPadElementV2. However if reading while writing ever works, this seems very problematic. What if you started reading from a group that got renamed?
- So making a new name for the second - CsPadElement_01 - this seems more reasonable.
Document Attributes
With corner cases like that, users, and frameworks, may find they need to know exactly what type they are dealing with. This will be stored in hdf5 attributes to the groups (Exactly how to extract this information will be documents for users and framework writers).
Programmatic Interface
The programmatic interface to the new schema is more difficult - without using exact information in the attributes, that is just basing your code on the group names, some Issues
- When you read the group
/Data/Run/Step:0000/EvrData
you don't know if you are reading a V3 or a V4. If it is V4, there will be two datasets (data and present) but for V3 there will only be 1.- In general, the full type information must be discovered by looking at the types in the datasets, as well as the number of datasets.
- Another place where you might like to use full names, is looping over sources by the id. Suppose the experiment has the four sources
XppGon.0:Cspad2x2.0, XppGon.0:Cspad2x2.1, XppGon.0:Cspad2x2.2 and XppGon.0:Cspad2x2.3
but the have been aliased to cspad2x2_front, cspad2x2_left, cspad2x2_right, cspad_2x2_back. - You need to know the DAQ aliases to find the data. This may make things more difficult for a framework. Another idea is to write a separate group with the full DAQ source name, and have the DAQ alias be a soft link to this group.
So I think it important to have the full source and typename available in the group attributes.
Could DAQ Aliases be Confusing?
For example, if there are mistakes in the aliases, or suppose just one of several similar sources is aliases, then a user browsing the hdf5 would see
/Data/Config/evr0/Evr daq alias
/Data/Config/NoDetector.0:Evr.1/Evr no alias
Is Src/Type ever more confusing than Type/Src?
If DAQ aliases are not used for all sources, there can be a number of technical looking source names that show up. For instance
NH2-SB1-IPM-01
New Schema
This is what the new schema might look like.
Lets say the user has specified two shortcuts
Opal_1/TimeToolConfig -> MyTimeToolConfig
Opal_1/TimeToolData -> MyTimeToolData
and calibrated data is translated. And they are translating ndarrays and strings from psana modules that output during BeginRun, BeginCalibCycle, EndCalibCycle and EndRun, as well as during regular events.
Feedback
Here is some feedback I have gotten.
Keep hiearchy close to EventKeys
Simplifying the hiearchy too much could be confusing, keeping closer to what one sees with psana EventKeys is helpful
Original Sources are Useful
Just having the DAQ aliases may not be good, One could put both the DAQ alias and native source in the name, or have them side by side, one a link, or keep aliases separate from original native source names, in different groups.
compound types vs. Basic Types
Use basic types in place of compound types.
Flattened Input
A tool to gather up, and event build the particluar data a user is intereted in. This may be a few fields from EBeam, a particular EPICS PV, links to camera images.