SmallDataAna - Binned Data / Cube Production

Cube data - Overview

'Cube' is the nickname we gave to the intermediate analysis result which is a 2D detector image as a function of certain variables, meaning the first level of data reduction is achieved through average. The other variable is most typically the time tool resorted delay time, but can be other variables as well such as laser power, temperature, etc. The code techinally does not average, but rather sum the requested data for all events in a given bin passing some filter. The number of events in a slice is also stored.

The code is now rewritten to use the SmallDataAna(_psana) interface as that allows a more flexible definition of both binning variables and selection variables using derived variables rather than being restricted to the values saved directly in the hdf5 file. As for the SmallData production, there is a driver script (makeCube) and a "production" python file called "MakeCube.py". An example of the production file can be found here:

https://github.com/slac-lcls/smalldata_tools/blob/master/examples/MakeCube.py

The relevant lines are here:

ana.addCut('lightStatus/xray',0.5,1.5,'on')
ana.addCut('lightStatus/laser',0.5,1.5,'on')
ana.addCube('cube','delay',np.arange(13.,15.,0.05),'on')
cs140Dict = {'source':'cs140_rob','full':1}
ana.addToCube('cube',['ipm2/sum','ipm3/sum','diodeU/channels',cs140Dict])
anaps.makeCubeData('cube', onoff=2)

Cubing - Step-by-step

Now let us disect what this is doing:

ana.addCut('lightStatus/xray',0.5,1.5,'on')

ana.addCut('lightStatus/laser',0.5,1.5,'on')

We are defining an event selection called "on". At this point we only require there to be both laser and X-rays. Typically one would add requirement on the incoming intensity and, if interested in the timetool, some quality requirement on the time tool signal.

ana.addCube('cube','delay',np.arange(13.,15.,0.05),'on')

Here we are defining a cube called "cube": we give it a name (here "cube"), a variable we want to bin in (here "delay"), the bins we would like to use for the binning and lastly the name of the filter/event selection we defined previously (here "on").

cs140Dict = {'source':'cs140_rob','full':1}

ana.addToCube('cube',['ipm2/sum','ipm3/sum','diodeU/channels',cs140Dict])

Now we tell what data we would like to bin. You can either pass the names of variables in the littleData or the name of detectors in the "big" data. This is not being passed as a dictionary with the source name (the alias) and then information of what information you would like to add to the cube (main use case of the full data, passed as above asking 'full':1 as value pair - the 1 is unimportant, the code only checks the presence of the "full" key).

anaps.makeCubeData('cube')

At least we will now make the cube. Note that we are calling this on "anaps" (!). "ana" has the same function: this will only bin the data present in the smallData file (or the derived fields attached to the xarray), it will quietly ignore the variables only to be gotten from the xtc. Because this will get data from the xtc file, you will want to run this using mpi using the driver script, but checking the cube definition (correct definition of bins,....) can be done using "ana" interactively.

The cube name will be used to name the hdf5 file that will get written by the function. The "ana" function by default will NOT write a file and only return a dictionary with the binned data. It has a parameter that will make it write an hdf5 file. The "anaps" function will always write the hdf5 file as this is integral to how the events are distributed among cores and how the data is reassembled in the end.

Binning variables

The primary binning variable is defined in the cube definition. It either needs to be a variable in the smallData originally or an added variable. Using "delay" will create a derived variable for the X-ray-laser delay using the scan variable (if applicable), the timetool and the fast delay stage encoder value. If the bins are not passed, the code will try to use the np.unique(scanValue) which will only work for "step" scans.

In addition to the primary binning variables, you can now add more binning variables to make a higher dimensional "cube". This can be done like this:

ana.add_BinVar(addBinInfo)

You can pass either a list like [varname, bins] or a dict with variables names as keys and bin boundaries as values.

Add variables from the smalldata hdf5 (ana)

You can add lists of variables in the smallData, wether they were present "originally or if they were added to the data. Now there are two ways to also bin droplet data: You can either save an image based on droplets/bin or make square arrays in x/y/adu for each bin. This is specified like this:

ana.addToCube('cube',['droplet:epix/droplets:image','droplet:epix_2/droplet:array])

Add variables from the xtc (images)

Data from the xtc file are added as dictionaries as described above. Option for the dictionary include:

full: save full detector data

image: if present, save image.

thresAdu: require pixels in to be added image to be above threshold

thresRms: -"-

common_mode: number identifying the common mode method

OnOff parameter in makeCubeData

makeCubeData takes an optional parameter onoff=[0/1/2]. It is two by default which makes that the filter will be applied as given. onoff=1 means select laser-on events. The requirement that the optical laser is on is added, nothing else is changed. The outputfile will end on <…>_on.h5 onoff=0 selected laser-off events. If present, the laser-on requirement is flipped and criteria involving the timetool will be dropped. All other criteria remain. The outputfiles will not end in <…>_on.h5

Production script

Analog to SmallDataProducer.py and smallDataRun, there is a script called MakeCube.py in the examples folder with a partnering cubeRun script. cubeRun has -h option to list all other options. Most options are an analog to smallDataRun, minor differences are:

-n <#> means that only the first # events are considered for each bin in the image. All data from the smallData will contain all data, so the cube is not self-consistent. This option is for testing only!

-d <dirname>: the smallData file is read from <dirname>. The cube files will also be written there. At this point, the cubes will always be written where the smallData is read from.

-j <number of MPI jobs>

If you specify a number of jobs exceeding the number of bins significantly, a fake variable will be added for binning to spread the bin-based calculation to more cores/nodes. This intermediate variable will be taken out at the very end.

Cube data format

By default, the data gets saved in an hdf5 file in /reg/d/psdm/<instrument>/<expname>/hdf5/smalldata

and the files have names like Cube_<expname>_Run<runnumber>_<cubename><optional _on/off>.h5

h5ls -r on an example file looks like this:

/Cfg__epix10ka2m__calib_mask Dataset {16, 352, 384}
/Cfg__epix10ka2m__gain   Dataset {7, 16, 352, 384}
/Cfg__epix10ka2m__ix     Dataset {16, 352, 384}/
Cfg__epix10ka2m__iy     Dataset {16, 352, 384}
/Cfg__epix10ka2m__mask   Dataset {16, 352, 384}
/Cfg__epix10ka2m__ped    Dataset {7, 16, 352, 384}
/Cfg__epix10ka2m__rms    Dataset {7, 16, 352, 384}
/Cfg__epix10ka2m__x      Dataset {16, 352, 384}
/Cfg__epix10ka2m__y      Dataset {16, 352, 384}
/binVar                  Dataset {13}
/binVar_bins             Dataset {13}
/cubeSelection           Dataset {1}
/epix10ka2m              Dataset {13, 16, 352, 384}
/ipm5__sum               Dataset {13}
/nEntries                Dataset {13}
/std_binVar_bins         Dataset {13}
/std_delay               Dataset {13}
/std_ipm5__sum           Dataset {13}

/Cfg__<detnames>__<parametername>

are calibration parameters for a detector. This will be added whenever detector data from the big xtc file is added.

/nEntries is somewhat self-explanatory

/binVar_bins are the bin boundaries for the binning variable.

/std__<varname> are the std calculated from all data points in the bin.

/ipm5__sum and /epix10ka2m are the main data: ipm5__sum is from the smalldata and in this case can be used for normalization of the incoming beam intensity. epix10ka2m is the full data for the epix10ka2m detector.

cubeSelection is a dataset stored for its attribute which is the text of the selection criteria used for the creating of this cube.

Advanced options (in progress)

Multi-dimensional binning

You can add more dimensions to bin the data in by using <cube>.add_BinVar:

 def add_BinVar(self, addBinVars):        
"""        
add extra dimensions to bin the data in        
parameters: addBinVars: dict or list                    
list: only 1 extra variable [varname, bin1, bin2, ....]                    
dict: {varname: bins}        
"""

Indices for the events in all dimensions will be created and turned into one big flat index after which xarray binning is used. Finally, the data is reshaped into the expected form.

Returning event-by-event data (from smallData ONLY when using SmallDataAna.makeCubeData)

ana.addToCube(<cubename>, targetVar, isIdxVar=True)

When variables are added to the cube is isIdxVars=True (default is False), in additional to the cube-dictionary, a second dictionary with list of values for each bin of the cube are returned. This only works for the SmallDataAna method, NOT the SmallDataAna_psana one. You can then treat the returned data further, e.g. by pic,ing one event for each bin out of the list and write out the data for those events. This is demonstrated in PickedEventsCube.py.

Getting data from laser-off images closest in time for each cube slice

For each 'on' event, we are selecting the closest <n> 'off' events. At this moment, they are also summed together, each event is NOT normalized by its own off-events.

Normalizing the image event-by-event

It is not yet possible to do this.