Skimmer New User Guide (starting from v6r0)

Prerequisites

The skimmer is only usable on linux. For what concerns the external tools, skimmer v5r0 depends on :

Perl 5, which should be found with "/usr/bin/env perl".
ROOT 5.10.00 to 5.18.00 : the user can specify $ROOTSYS to any ROOT release, and it will be used as is by the skimmer, but the only validated releases are 5.10.00, 5.14.00g, 5.16.00-gl1 and 5.18/00 ; if not defined, the skimmer will search for $GLAST_EXT/ROOT/v5.10.00/root ; if GLAST_EXT is not defined, it will be set to /afs/slac/g/glast/ground/GLAST_EXT/$CMTCONFIG ; if $CMTCONFIG is not defined, it will be set to rh9_gcc32.

What the skimmer basically do

The basic task of the skimmer is to take Glast ROOT files, containing ROOT trees, and produce similar output files with a subset of branches and events. The search for ROOT data files to be skimmed is called here mining. The eventual non-copy of some branches is called pruning. The copy of only a subset of events is called filtering.

A skimmer execution is organized as a sequence of steps. All but the last one are preparation steps. Each preparation step is expected to prepare a specific parameter file. If the parameter file already exists, either from a previous execution or because it has been written from scratch by the user, then the corresponding preparations step is not applied (this default behavior can be changed). Currently, the steps are :

GET_FILE_LIST : establish the list fo the input ROOT data files to be skimmed.
GET_LIBRARY_LIST : eventually find out the release of the corresponding C++ code, and associated libraries.
GET_BRANCH_LIST : establish the list of branches to be duplicated.
GET_EVENT_LIST : establish the list of events to be duplicated.
SKIM : the actual skimming.

The skimmer is also known to GLAST people as the Data Server Back End. It has a command-like interface which can be used directly, or it can also be used through a web interface, also known as the Data Server Front End. Here, one will only find the documentation of the command-like interface, but this can also help to understand the corresponding web interface.

The behavior of the skimmer is all tuned by some predefined shell variables. For a complete list of those variables, one can type "skimmer help", but the explanations will hardly make sense if you have not read this guide before.

Overview of parameters

As one can see in the steps given above, before the skimmer can proceed, it is collecting much information about the files to be skimmed, what they contain and what to extract. This is all tuned by some shell variables, and the information is mostly stored in few parameter files. The use of those text files bring several benefits :

They ease the debug of the job.
They can be eventually modified, and reused as input for later jobs.
If an advanced user is well aware of what such a file should contain, he can even edit it from scratch.

All the parameter files can contain any number of empty lines and comments starting with "#". The lines starting with "#!" are called special comments. The first special comment in any parameter file should be about the global file format release, currently "CEL TXT 0.1". The second special comment should of the form "SECTION <name>", where <name> depends on the kind of information in the rest of the file. For example, if the file contain the list of input files, <name> will be "Files". Several examples will be given below.

One will find below the description of the parameter files and shell variables which are tuning a skimmer job.
Worth to note, for each of the official skimming step given previously, there is few systematic variables :

SK_SKIP_<STEP> : do not execute the step, whatever the values of other parameters and existence of parameter files.
SK_FORCE_<STEP> : do execute the step, whatever the values of other parameters and existence of parameter files.
SK_DEBUG_<STEP> : will display additionnal information about that specific step.

Obviously enough, for a given <STEP>, SK_FORCE_<STEP> and SK_SKIP_<STEP> should never be true at the same time. Let's know see the details of each step.

Data files mining parameters

The list of the data files to be processed is something the skimmer can get from the Pipeline I Oracle Database, if you provide the data types, the task name and the runs range through the shell variables SK_DATA_TYPES, SK_TASK, SK_RUN_MIN and SK_RUN_MAX. For the Pipeline II data files, read the note at the end of this section.

SK_DATA_TYPES should be a ":" separated list of data types. The current recognized types are "merit", "mc", "digi" and "recon". If SK_DATA_TYPES is empty, a default value of "merit:mc:digi:recon" will be used. SK_TASK should be any of the tasks recognized by the Pipeline I Oracle Database. If SK_RUN_MAX is set to 0, all the runs will be taken into consideration.

The resulting list of ROOT data file names is expected to be stored in a textual file whose complete path is defined by SK_FILE_LIST_FILE. The skimmer will require the list from Oracle only if this file does not already exist, or if SK_FORCE_GET_FILE_LIST is set to true. Also, if this file does not exists and you want to prevent its creation, you must set SK_SKIP_GET_FILE_LIST to true (rarely useful).

The file defined by SK_FILE_LIST_FILE, after a skimmer execution, can be freely edited and eventually reused. Also, if you have a set of data files you want to process, you can write such a file from scratch. Each line of the file is simply expected to be the full path of a given ROOT file, eventually prefixed by its data type. For example :

#
#! CEL TXT 0.1
#

#! SECTION Files
(recon)/nfs/farm/g/glast/u35/MC-tasks/BeamTest-0100/output/000001/BeamTest-0100_000001_recon_RECON.root
(recon)/nfs/farm/g/glast/u35/MC-tasks/BeamTest-0100/output/000002/BeamTest-0100_000002_recon_RECON.root
(mc)/nfs/farm/g/glast/u35/MC-tasks/BeamTest-0100/output/000001/BeamTest-0100_000001_mc_MC.root
(mc)/nfs/farm/g/glast/u35/MC-tasks/BeamTest-0100/output/000002/BeamTest-0100_000002_mc_MC.root
(merit)/nfs/farm/g/glast/u35/MC-tasks/BeamTest-0100/output/000001/BeamTest-0100_000001_merit_merit.root
(merit)/nfs/farm/g/glast/u35/MC-tasks/BeamTest-0100/output/000002/BeamTest-0100_000002_merit_merit.root
(digi)/nfs/farm/g/glast/u35/MC-tasks/BeamTest-0100/output/000001/BeamTest-0100_000001_digi_DIGI.root
(digi)/nfs/farm/g/glast/u35/MC-tasks/BeamTest-0100/output/000002/BeamTest-0100_000002_digi_DIGI.root

When a SK_FILE_LIST_FILE is already available, there is no obligation to define SK_TASK, yet it is recommended because you often need it for the default value of some other shell variable.

Here are the default values of the shell variables for this section :

SK_TASK=""
SK_RUN_MIN=0
SK_RUN_MAX=0
SK_DATA_TYPES="merit:mc:digi:recon"
SK_FILE_LIST_PATH="${PWD}/${SK_TASK}_FileList.txt"
SK_SKIP_GET_FILE_LIST="false"
SK_FORCE_GET_FILE_LIST="false"
SK_DEBUG_GET_FILE_LIST="false"

Since the arrival of Pipeline II data files, the automatic generation of the list of input data files is not any more under the responsability of the skimmer, but expected to be preprocessed by the web frontend. If you directly use this tool with such PII input data files, you will have to write yourself the list of input files, as described above. Yet, it is still recommended to define SK_TASK which is used in several default values, and it stays mandatory to define SK_DATA_TYPES unless you want to skim any possible data types.

Release libraries determination parameters

When managing data such as recon, mc and/or digis, the skimmer needs to load the corresponding C++ libraries, with the correct release, and it will take them in a dedicated file, whose name is defined by variable SK_LIBRARY_LIST_FILE. In this file, each line is the full path of a shared library, eventually prefixed by its data type. Example of such a file :

#
#! CEL TXT 0.1
#

#! SECTION Libraries
/nfs/farm/g/glast/u09/builds/rh9_gcc32/BeamtestRelease/BeamtestRelease-v3r0907p0/lib/libcommonRootData.so
(mc)/nfs/farm/g/glast/u09/builds/rh9_gcc32/BeamtestRelease/BeamtestRelease-v3r0907p0/lib/libmcRootData.so
(digi)/nfs/farm/g/glast/u09/builds/rh9_gcc32/BeamtestRelease/BeamtestRelease-v3r0907p0/lib/libdigiRootData.so
(recon)/nfs/farm/g/glast/u09/builds/rh9_gcc32/BeamtestRelease/BeamtestRelease-v3r0907p0/lib/libreconRootData.so

This listcorresponds to a single code release, which is the one relevant for the current SK_TASK. If a user is writing himself the file defined by SK_FILE_LIST_FILE, he must be careful that all the listed ROOT files have been generated with the same code release.

If the file defined by SK_LIBRARY_LIST_FILE does not already exists, the skimmer will try to guess the release and generate the lacking file. This guess is based on the eventual instance of FileHeader which is staying in the first file of each data kind. The latter can be modified by several tricky variables, but this is outside the scope of this guide (see the FAQ).

Here are the default values of the shell variables for this section :

SK_LIBRARY_LIST_FILE="${PWD}/${SK_TASK}_LibraryList.txt"
SK_SKIP_GET_LIBRARY_LIST="false"
SK_FORCE_GET_LIBRARY_LIST="false"
SK_DEBUG_GET_LIBRARY_LIST="false"

Events filtering parameters

The actual skimming will only select the events given in a parameter file, whose name is defined by the shell variable SK_EVENT_LIST_FILE. Perhaps this file already exists, generated from a previous skimmer execution (and you eventually modified it), or because you wrote it from scratch. If not, the skimmer will generate one, based on the values of SK_TCUT_DATA_TYPE and SK_TCUT. The syntax of SK_TCUT should be the ROOT one. Currently, the only valid value for SK_TCUT_DATA_TYPE is merit.

The expected content of the file defined by SK_EVENT_LIST_FILE is a simple list of pairs, one by line, where each pair is made of a run id followed by an event id. The "events" section header can be followed by two special comments which recall the number of events before and after the cut (this information is currently not used by the skimmer). For example :

#
#! CEL TXT 0.1
#

#! SECTION Events
#! 2000 entries in original dataset.
#! 7 events after cut:
1 8
1 183
1 344
1 553
2 117
2 517
2 980

The possibility to define a TCut which is going through several data types is not possible today, but it will be studied when we have several possible value for SK_TCUT_DATA_TYPE. Here are the default values of the shell variables for this section :

SK_TCUT_DATA_TYPE="merit"
SK_TCUT=""
SK_EVENT_LIST_FILE="${PWD}/${SK_TASK}_EventList.txt"
SK_SKIP_GET_EVENT_LIST="false"
SK_FORCE_GET_EVENT_LIST="false"
SK_DEBUG_GET_EVENT_LIST="false"

In the case you do want to keep all the entries, which is not the default expectation, you need to act on several variables : the value of SK_TCUT should be "", the file defined by SK_EVENT_LIST_FILE must not exist, and the value of SK_SKIP_GET_EVENT_LIST should be "false". This way, the skimmer has no information about how to select the events, and will merge them all.

Branches pruning parameters

The skimmer can also take into account a list of the branches to be activated or desactivated. This list is given through a file, whose full file (including path) is given by variable SK_BRANCH_LIST_FILE. If this file does not exists, skimmer will generate it, and activate all the branches.

As usual, you can edit this generated file, or write one from scratch. Each line should contains a data type prefix, the name of the tree, a + or a - (so to activate or desactivate respectively), and the specification of one or several branches (with the ROOT syntax). The lines are applied one after the other : you can desactivate all the branches of a given type with *, then activate the only ones of interest. For example :

#
#! CEL TXT 0.1
#

#! SECTION Branches
(recon)Recon+ReconEvent
(recon)Recon+m_eventId
(recon)Recon+m_runId
(recon)Recon+m_acd
(recon)Recon-m_cal
(merit)MeritTuple-*
(merit)MeritTuple+PtT*
(merit)MeritTuple+Cal*

Here are the default values of the shell variables for this section :

SK_BRANCH_LIST_FILE="${PWD}/${SK_TASK}_BranchList.txt"
SK_SKIP_GET_BRANCH_LIST=false
SK_FORCE_GET_BRANCH_LIST=false
SK_DEBUG_GET_BRANCH_LIST=false

The actual final skimming

We are now to the point where to say which types of data we want to skim. This is said by shell variable SK_DATA_TYPES, the same which is described in the "data files mining" section above.

The skimmed files will be stored in the directory defined by shell variable SK_OUT_DIR, in
files called SK_OUT_FILE_BODY_<datatype>.root. Yet, if they turned to be very big files, ROOT could automatically close the first file and open new ones, appending a rank number to the file name. The maximum size of each ouput ROOT file can be changed with shell variable SK_MAX_FILE_SIZE. If the value of 0 is given to this variable (this is the default), ROOT will use its own default value. Also, if the value is 0 and the job is merging all the events, the ROOT fast merging method will be used.

Here are the default values of the shell variables for this section :

SK_DATA_TYPES="merit:mc:digi:recon"
SK_OUT_DIR=${PWD}
SK_OUT_FILE_BODY=${SK_TASK}
SK_MAX_FILE_SIZE=0
SK_SKIP_SKIM=false
SK_DEBUG_SKIM=false

If one has already executed the skimmer, and the generated parameter files are still available, most of the skimmer shell variables will not affect the new execution, because the skimmer will directly reuse the parameter files. Only the few variables above, corresponding to the actual skimming step, will be taken into account.

Space shortcuts

Child pages

Prerequisites

What the skimmer basically do

Overview of parameters

Data files mining parameters

Release libraries determination parameters

Events filtering parameters

Branches pruning parameters

The actual final skimming