Basic idea

The calibration suite scripts can now ingest .h5 data written by any script as well as official psana data.  We communicate info to the h5 file using python dicts, lists of key:value pairs.  This page describes how to organize and fill such a .h5 file.

File header

Each file (= run) should declare what kind of data it contains - low flux, intensity scan, dark, pedestal, ...  The .h5 file should have runDict['ancillaryData']['scriptType'] = 'delayScan' or similar.  A list of recognized types can be found in the calibration suite script foo.py: as of this writing, the list is 'foo', 'bar', 'fooBar', ...

We should also have 'acquisitionScript', 'commandLine', perhaps 'version', 'date', location (e.g. 'Bldg84' or 'ASC') and in particular 'run', which I figure should be seconds since the epoch to ensure uniqueness.

Detector data format

Psana uses the xtc or xtc2 format.  For us, what's important is that the data is organized into steps, representing states of the system like detector configuration (pedestal run) or flux (linearity scans) or time delay.  And each step contains some number of events, which can vary.  Each event is just some data at a particular time, like a detector readout or BLD data.  This arrangement is a ragged array, so probably in numpy one will want to build an array with n steps and m events per step, where m is the largest number of events any of the steps contains, and the code that reads the file will return None if m > nEvents[step].  Or we could organize the data in the input .h5 as a dict:

{'step0': data0, 'step1': data1, ...} or more simply {0: data0, 1: data1, ...} and runDict['detectorData'] = {0:..., 1:..., ...}

There must always be one step, even if the run is just a simple dark or a plain Fe55 collection.

The per-event detector data must be

[module, row, col]

even if there is just one module.  So a pure numpy representation of a run would be

[step, event, module, row, col].

Probably the dict approach is most flexible, and the data per step would be 

[event, module, row, col].

The detector array should be in a group called 'epixM' or whatever; additional per-event data not tied to a stepwise scan should be in e.g. 'wave8'.  If there is more than one detector array they must use parallel step/event structures.  There should additionally be a timestamp per event if the script wants to make that available, which is often useful, for example looking at the effect of configuration between steps.  Or the various detectors should be under the step subgroup.

Ancillary data format(s)

Scan scripts will also use additional data to describe what's going on for each step.  We could also have a secondary detector (e.g. a wave8 or some other detector reporting flux) for each event.  For scans, you can save one value per step or one value per event per step. So 

runDict['ancillaryData'] = {}
runDict'ancillaryData']['ancillaryDataType'] = 'foo'
Then in your step loop:
runDict['ancillaryData']['foo'][step] = delay (or voltage or flux). [Per step]
or inside the event loop per step (after making a per-step dict):
runDict['ancillaryData']['foo'][step][event] = delay (where delay or whatever should be the same for every event in the step).

If we decide we're never going to have two types of ancillary data in a file we could skip nesting with dataType here.

Example script to write a .h5 file with test data

See writeFakeRogueH5.py in the calibration suite git repos here for an example.

  • No labels