You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 14 Next »


Introduction

This page collects information about Python-based analysis framework for LCLS called pyana. This framework design borrows heavily from the various sources such as Online Analysis (a.k.a. myana), BaBar framework, etc. It's main principles are summarized here:

  • oriented on XTC processing, but could be extended to work with HDF5 data
  • should be easy to use and extend for end users
  • support re-use of the existing analysis code
  • allow parallel processing on multi-core systems
  • common simple configuration of user analysis code

Framework Composition

The centerpiece of the framework is a regular Python application (pyana) which can load one or more user analysis modules which are also written in Python. The core application is responsible for the following tasks:

  • loading and initializing all user modules
  • reading XTC data from a list of input files
  • calling appropriate user methods based on the data being processed
  • providing data access to user modules
  • providing other services such as histogramming to user modules

The core application have a number of configuration options which can be set or changed from configuration file or from a command line. If the same option appears both in configuration file and command line the n the command line value overrides value in the configuration file.

User Modules

User analysis module is a regular Python module (a Python file) which satisfies additional requirements:

  • it contains a class with the same name as the module name
  • the class defines constructor method with optional arguments and five regular methods: beginjob(), beginrun(), event(), endrun(), and endjob().

The application loads one or more user modules, the names of the modules to load are specified either in the job configuration file or on the command line. After loading the modules the application creates one or more instance objects of the class defined in the module. More than one instance may be useful if one wants to run the same analysis with different set of parameters in the same job. Number of instances and their parameters are determined by the job configuration file (see below).

Initialization

User analysis class can define zero or more parameters in its constructor (__init__() method). Parameters are initialized from the values defined in the job configuration file (see below). All parameters are passed to the Python code as strings, if the code expects a number or some other type then it's the code responsibility to convert the strings to appropriate type. If there are no default values defined for some parameters in constructor declaration then those parameters must be present in the configuration file.

For quick example suppose that we have this class defined in user module:

mypackage/src/myana.py
# user analysis class
class myana(object):
    def __init__(self, name, lower, uppper, bins=100)
        self.lower = float(lower)
        self.upper = float(upper)
        self.bins = int(bins)
    ...

and this job configuration file:

pyana.cfg
[pyana]
modules = mypackage.myana mypackage.myana:wide

[mypackage.myana]
lower = 0
upper = 100
name = default

[mypackage.myana:wide]
lower = 0
upper = 1000
bins = 1000
name = wide

With this the analysis job will instantiate two analysis objects with different parameters, equivalent to this pseudo-code:

# import class myana
from mypackage.myana import myana

# create instances
instances = [ myana(lower = "0", upper = "100", name = "default"),
    myana(lower = "0", upper = "1000", bins = "1000", name = "wide") ]

(the order of parameters in constructor and configuration file does not matter as all parameters are passed as keyword parameters.)

Data access

There are two types of data that framework passes to the user analysis modules – event data and environment data. Event data contains the data corresponding to current event that triggered the call to the user methods. In case of XTC input the event data contains complete datagram as read from DAQ. Event data in user module is represented with a special object of type pyana.event.Event which has an extended interface for extracting individual object from datagram. This interface is described in the reference guide.

Environment data include all kinds of data which are not part of the event data. Usually environment data either stay the same for the whole job or change at a slower rate than event data. Example of the environment data could be configuration data read from XTC at the beginning of the job, EPICS data which is not updated on every event, and few other things. Environment data is represented for user code through the object of type pyana.event.Env. Its interface is described in the reference guide.

Data Source Address

For some pieces of data one needs to specify data "address" which identifies (maybe partially) particular DAQ device which produced the data. This is needed because the instrument setup may include multiple devices producing the same data type. The DAQ defines a type which serves as a most specific device identification, the type is xtc.DetInfo in package pypdsdata. One can pass this DetInfo instance to a method which accepts device address to select that specific device. DetInfo object contains four essential pieces of information:

  • detector – one of the DetInfo.Detector.* values
  • detId – ID number selecting one of multiple detectors
  • device – one of the DetInfo.Device.* values
  • devId – ID number selecting one of multiple devices

One can build DetInfo out of these four values, but DetInfo constructor (which mimics C++ behavior) needs also a nuisance processId parameter. In most cases using the address string described below is preferred to manually building DetInfo objects.

In many cases the DetInfo object is not known to the user or the code uses only partial information to identify one or many data sources. In such cases it is easier to use address string as an argument to methods accepting address parameter. The most generic format of address string is:

[detector][-[detId]]["|"[device][-[devId]]]

In words address string is an optional detector name followed by optional dash and detector ID separated by vertical bar from optional device name followed by optional dash and device ID. Detector name and device name are the names defined in DetInfo.Detector and DetInfo.Device enums respectively. Examples of detector names are AmoGasdet, AmoETof, Camp, etc. Examples of device names are Acqiris, pnCCD, Evr, etc. For complete list check the Reference Manual. If any piece in the address string is missing or is replaced with a special '*' character then it means "match any value".

Here are few example address strings:

  • "AmoETof-0|Acqiris-0" – selects data produced by detector AmoETof, detId 0, device Acqiris, devId 0
  • "AmoETof|Acqiris" – selects data produced by detector AmoETof, any detId, device Acqiris, any devId
  • "AmoETof-*|Acqiris-*" – same as above
  • "AmoETof-0" – selects data produced by detector AmoETof, detId 0, any device, any devId
  • "|Acqiris-0" – selects data produced by any detector, any detId, device Acqiris, devId 0
  • "*-*|Acqiris-0" – same as above

Configuration

Analysis job can read the configuration options from the command line and/or the configuration file. Command line can be used to set options only for the pyana application itself but not user analysis modules. Options for user modules can be set in configuration file only.

Configuration File

The default name for the configuration file is "pyana.cfg" in the current directory. The name can be changed with the --config or -c command line option. The format of the configuration file follows the widely accepted syntax for INI files. File consists of series of sections, beginning of each section is marked by the section name in square brackets. Each section contains arbitrary number of options in the form option = value. Both section header and option name must start on first column, white space at the beginning of the line means that line is a continuation of the previous option line. Empty lines and lines beginning with word # (hash sign) or ; (semicolon) are considered comments. Semicolon anywhere on the line as considered a beginning of the comment if it follows space character. The word REM (case insensitive) at the beginning of the line starts a comment if it is followed by space, TAB or new-line.

Here is an example of the configuration file syntax:

# Comment line
; Another comment
[section1]
name = Default    ; In-line comment

# next line shows line continuation
modules = module.A module.B
          module.C

REM REM-style comment
[section2]
limit = 1000

Core Options

By default the core application options are read from [pyana] section of the configuration file. If the option C name or -config-name=name is given on the command line then additional section [pyana.name] is read and values in that section override values from [pyana] section.

Here is the list of all command line and configuration file options availabale currently:

Short

Long

Config File

Option type

Default

Description

-v

--verbose

verbose

integer

0

Command line options do not need any values but can be repeated multiple times, configuration file option accepts single integer number.

-c file

--config=file

 

path

pyana.cfg

Name of the configuration file.

<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="2f96c857-1bd4-459a-8d77-edd47fb8b53c"><ac:plain-text-body><![CDATA[

-C name

--config-name=name

 

string

 

If non-empty string is given then configuration will be read from section [pyana.name] in addition to [pyana].

]]></ac:plain-text-body></ac:structured-macro>

-l file

--file-list=file

file-list

path

 

The list of files to process will be read form a given file which must contain one file name per line.

-n number

--num-events=number

num-events

integer

0

Maximum number of events to process, this counter will include damaged events too.

-j name

--job-name=name

job-name

string

 

Sets job name which is accessible to user code via environment method. Default name is based on the input file names.

-m name

--module=name

modules

string

 

User analysis module(s). Command line options can be repeated several times, configuration file option accepts space-separated list of names.

-p number

--num-cpu=number

num-cpu

integer

1

Number of processes to run, if greater than 1 then multi-processing mode will be used.

User Module Options

For every user module the configuration file may contain one or more configuration sections. The section header for the user module has format [module] or [module:name]. When defining the user modules either with --module command line option or modules configuration file option one can optionally qualify the module name with a colon followed by arbitrary single-word string. Without this optional qualification the framework will load the user module and will use the options from [module] section to initialize the instance of the analysis class (as explained in Initialization section). If, on the other hand, the qualified name is used then the framework will initialize the instance with the options combined from the sections [module] and [module:name] with the latter section overriding the values from the former section. One could use several qualified forms of the same module name to produce several instances of the analysis class in the same job with different options.

Here is an almost identical example from Initialization section above which illustrates the inheritance and overriding of the user options:

pyana.cfg
[pyana]
modules = mypackage.myana mypackage.myana:wide

[mypackage.myana]
lower = 0
upper = 100
name = default

[mypackage.myana:wide]
; 'lower' option will be reused from [mypackage.myana] section
bins = 1000   ; this overrides default module value
; two options below will override [mypackage.myana] values
upper = 1000
name = wide

Multi-processing

The framework can be run in single-process or multi-process mode, by default everything runs in singe-process mode. In multi-process mode analysis job spawns a number of processes all running on the same host. In that case framework is also responsible for distributing individual events to a single or multiple processes and collecting and merging the results of the processing at the end of the job.

  • No labels