Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin
Warning
titleWarning

The skimming tool has been generalized and externalized. The information below is obsolete and related to an old release. If possible, use a more recent release and get the corresponding documentation from the new TRAC server.

Preface

The Skimmer is also known to GLAST people as the Data Server Back End or Skimmer Back End. It has a command-like interface which can be used directly from a linux shell. If you skim your data thanks to a web interface, you are going through an additional layer known as the Data Server Front End, or Skimmer Web Application, or Skimmer Front-End. Here, you will only find the documentation of the back-end tool with a command-like interface, which we will call simply skimmer, but maybe this can also help you to understand the front-end layer and its web interface.

The skimmer is only usable on linux. For what concerns the external tools, skimmer v6r0 v6r1 depends on :

  1. Perl 5, which should be found with "/usr/bin/env perl".
  2. ROOT 5.10.00 to 5.18.00b : the user can specify $ROOTSYS to any ROOT release, and it will be used as is by the skimmer, but the only validated releases are 5.10.00, 5.14.00g, 5.16.00-gl1 and 5.18/00b 00c-gl1 ; if not defined, the skimmer will search for $GLAST_EXT/ROOT/v5.10.00/root ; if $GLAST_EXT is not defined, it will be set to /afs/slac/g/glast/ground/GLAST_EXT/$CMTCONFIG ; if $CMTCONFIG is not defined, it will be set to rh9_gcc32.

What the skimmer basically do

The basic task of the skimmer is to take Glast ROOT files, containing ROOT trees, and produce similar output files with a subset of branches and events. The search for ROOT data files to be skimmed is called here mining. The eventual non-copy of some branches is called pruning. The copy of only a subset of events is called cutting.

...

  1. MAKE_FILE_LIST : establish the list of the input ROOT data files to be skimmed.
  2. MAKE_LIBRARY_LIST : eventually find out the release of the corresponding C++ code, and search for the associated shared libraries.
  3. MAKE_BRANCH_LIST : establish the list of branches to be duplicated.
  4. MAKE_EVENT_LIST : establish the list of events to be duplicated.
  5. SKIM : the actual skimming.
  6. CHECK : optional check of output data, which could take a long time to perform.

How to control the skimming job

As one can see in the steps given above, before the skimmer can proceed, it is collecting much information about the files to be skimmed, what they contain and what to extract. This is all tuned by some shell variables, and some of the information can come from an input ROOT CEL file (documented elsewhere) or from some textual parameter files, meant to be the textual flavor of the different subparts of a ROOT CEL.

...

One will find below the description of the parameter files and shell variables which are meaningfull for a skimmer job.
Worth to note, for each of the official skimming step given previously, there is a SK_DEBUG_* variable which can trigger the display of additionnal information about that specific step. Let's now see the details of each step.

Data files mining parameters

The list of input data files can be obtained from different sources :

  1. From a CompositeEventList : if a CEL file is given as input to the skimmer, and defined with variable SK_INPUT_CEL.
  2. From a textual parameter file made by the user : the format is given below, and the path of the parameter file is the value of the variable SK_INPUT_FILE_LIST.
  3. From the Pipeline I Oracle Database : if the data to be skimmed has been generated with the Pipeline I, one can define SK_INPUT_TASK.

One and only Only one of those three variables must be non-null.In the case of Pipeline I products, sources will be used for a given job. The variables SK_INPUT_FILE_LIST and SK_INPUT_TASK is enough and should not be any of the tasks recognized by the Pipeline I Oracle Database. On top of that, one can select a subset of the task runs through the shell variables SK_RUN_MIN and SK_RUN_MAX. If the value of SK_RUN_MAX is 0, all the runs will be taken into consideration.

In the case of a textual parameter file written by the user, it must conform to the usual rules for the skimmer parameter files : the header is made of a first special comment which express the global file format release (#! CEL TXT 0.1), followed by a special comment which express the section name (#! SECTION Files). Then, each line of the file is the full path of an input ROOT file, eventually prefixed by the data types of the trees within the file. For example :

defined at the same time. If defined, the variable SK_INPUT_CEL will only be used if both SK_INPUT_FILE_LIST and SK_INPUT_TASK are empty. In future releases, we could try to intersect the sets defined by SK_INPUT_FILE_LIST and SK_INPUT_CEL, but we are not yet sure it is useful and usable.

In the case of Pipeline I products, SK_INPUT_TASK is enough and should be any of the tasks recognized by the Pipeline I Oracle Database. On top of that, one can select a subset of the task runs through the shell variables SK_RUN_MIN and SK_RUN_MAX. If the value of SK_RUN_MAX is 0, all the runs will be taken into consideration.

In the case of a textual parameter file written by the user, it must conform to the usual rules for the skimmer parameter files : the header is made of a first special comment which express the global file format release (#! CEL TXT 0.1), followed by a special comment which express the section name (#! SECTION Files). Then, each line of the file is the full path of an input ROOT file, eventually prefixed by the data types of the trees within the file. For example :

Panel

#
#! CEL TXT 0.1
#

#! SECTION Files
(recon)/nfs/farm/g/glast/u35/MC-tasks/

Panel

#
#! CEL TXT 0.1
#

#! SECTION Files
(recon)/nfs/farm/g/glast/u35/MC-tasks/BeamTest-0100/output/000001/BeamTest-0100_000001_recon_RECON.root
(recon)/nfs/farm/g/glast/u35/MC-tasks/BeamTest-0100/output/000002/BeamTest-0100_000002_recon_RECON.root
(mc)/nfs/farm/g/glast/u35/MC-tasks/BeamTest-0100/output/000001/BeamTest-0100_000001_mc_MC.root
(mc)/nfs/farm/g/glast/u35/MC-tasks/BeamTest-0100/output/000002/BeamTest-0100_000002_mc_MC.root
(merit:pointing:jobinfo)/nfs/farm/g/glast/u35/MC-tasks/BeamTest-0100/output/000001/BeamTest-0100_000001_merit_merit.root
(merit:pointing:jobinfo)/nfs/farm/g/glast/u35/MC-tasks/BeamTest-0100/output/000002/BeamTest-0100_000002_merit_merit.root
(digi)/nfs/farm/g/glast/u35/MC-tasks/BeamTest-0100/output/000001/BeamTest-0100_000001_digi_DIGI.root
(digi)/nfs/farm/g/glast/u35/MC-tasks/BeamTest-0100/output/000002/BeamTest-0100_000002_digi_DIGI.root

...

No Format
SK_INPUT_CEL = ""
SK_INPUT_FILE_LIST = ""
SK_INPUT_TASK = ""
SK_RUN_MIN = 0
SK_RUN_MAX = 0
SK_OUTPUT_FILE_LIST=""
SK_DEBUG_FILE_LIST="false"

Shared libraries determination parameters

When managing data such as recon, mc and/or digis, the skimmer sometimes needs to load the corresponding C++ shared libraries. It needs the ones which were used when generating the data, compiled with the correct release. The list of those shared libraries can be provided by the user in a dedicated file, whose name is defined by variable SK_INPUT_LIBRARY_LIST. In this file, each line is the full path of a shared library, eventually prefixed by the data types associated with the library. If there is no such prefix, the library is to be loaded for any data type. Example of such a file :

...

Instead of the file above, if the user knows about it, he can provide the data code release with the variable SK_EXPECTED_RELEASE, and a set of directories where to search for the shared libraries, defined by SK_LIBRARY_DIRS (which has a default value relevant for SLAC site). The latter is a ':' set of directories paths. SK_EXPECTED_RELEASE should have the form <main_package>/<main_package>-<release>, as one can see in the example above. The exact names of the libraries for a given data type ar currently hardcoded, and described in the guide /Skimmer at SLAC/. For example, For each <dir> element in SK_LIBRARY_DIRS, and a given <main_package> and a given <release>, the skimmer will look for <dir>/<main_package>/<main_package>-<release>/lib/libcommonRootData.so.

At last, if there is no usable SK_INPUT_LIBRARY_LIST and no SK_EXPECTED_RELEASE, the skimmer will inspect the data files searching for some GLAST file header, where it could find out the code release used for this data. It should also find the original directory where the libraries where stying when generatig the data, and will inspect this directory first, before trying the ones in SK_LIBRARY_DIRS.

The same as for the list of input data files, one can obtain a file containing the final list of shared libraries by giving a value to SK_OUTPUT_LIBRARY_LIST. This list is restrained to the data kinds which were used in the skimming job. The ouput format is the same as the input format above.

Here are the default values of the shell variables for this section :

No Format

SK_INPUT_LIBRARY_LIST=""
SK_EXPECTED_RELEASE=""
SK_LIBRARY_DIRS=""/nfs/farm/g/glast/u09/builds/rh9_gcc32:/nfs/farm/g/glast/u30/builds/rh9_gcc32:/afs/slac.stanford.edu/g/glast/ground/releases/rh9_gcc32opt"
SK_OUTPUT_LIBRARY_LIST=""
SK_DEBUG_LIBRARY_LIST="false"

Events cutting parameters

The list of selected events can be obtained from different sources :

...

the ones in SK_LIBRARY_DIRS.

The same as for the list of input data files, one can obtain a file containing the final list of shared libraries by giving a value to SK_OUTPUT_LIBRARY_LIST. This list is restrained to the data kinds which were used in the skimming job. The ouput format is the same as the input format above.

Here are the default values of the shell variables for this section :

No Format

SK_INPUT_LIBRARY_LIST=""
SK_EXPECTED_RELEASE=""
SK_LIBRARY_DIRS=""/nfs/farm/g/glast/u09/builds/rh9_gcc32:/nfs/farm/g/glast/u30/builds/rh9_gcc32:/afs/slac.stanford.edu/g/glast/ground/releases/rh9_gcc32opt"
SK_OUTPUT_LIBRARY_LIST=""
SK_DEBUG_LIBRARY_LIST="false"

Events cutting parameters

The list of selected events can be obtained from different sources :

  1. From a CompositeEventList : if a CEL file is given as input to the skimmer, and defined with variable SK_INPUT_CEL.
  2. From a textual file made by the user : the format is given below, and the file path is given thanks to variable SK_INPUT_EVENT_LIST.
  3. Indirectly with a cut : the skimmer can generate an event list, based on the values of SK_TCUT and SK_TCUT_DATA_TYPE. The syntax of SK_TCUT should be the ROOT one. Currently, the only valid value for SK_TCUT_DATA_TYPE is merit.

Only one of those three sources will be used for a given job. The variables SK_INPUT_EVENT_LIST and SK_TCUT should not be defined at the same time. If defined, the variable SK_INPUT_CEL will only be used if both SK_INPUT_EVENT_LIST and SK_TCUT are empty. In future releases, we could try to intersect the sets defined by SK_INPUT_EVENT_LIST, SK_TCUT and SK_INPUT_CEL, but we are not yet sure it is useful and usable

...

.

The expected content of the file defined by SK_INPUT_EVENT_LIST is a simple list of pairs, one by line, where each pair is made of a run id followed by an event id. The "events" section header can be followed by two special comments which recall the number of events before and after the cut (this information is currently not used by the skimmer). For example :

Panel

#

#! CEL TXT 0.1
#

#! SECTION Events
#! 2000 entries in original dataset.
#! 7 events after cut:
1 8
1 183
1 344
1 553
2 117
2 517
2 980

In the case you do want to keep all the entries, i.e. to merge the input data files, you should not give any kind of cut to the skimmer : SK_INPUT_CEL, SK_INPUT_EVENT_LIST and SK_TCUT should be empty.

In theory, we should also be able to combine any number of those cuts above, but this is not yet implemented. Currently, you must define either an input CEL, or a textual file, or a cut. Also, it It is not yet possible to define a TCut which is going through several data types, but it will be studied as soon as we have several possible values for SK_TCUT_DATA_TYPE.

As usual, one can obtain a file containing the final list of events by giving a value to SK_OUTPUT_EVENT_LIST, whose ouput format is the same as the input format above.

...

No Format
SK_INPUT_CEL=""
SK_INPUT_EVENT_LIST=""
SK_TCUT_DATA_TYPE="merit"
SK_TCUT=""
SK_OUTPUT_EVENT_LIST=""
SK_DEBUG_EVENT_LIST="false"

Branches pruning parameters

The skimmer can also take into account a list of the branches to be activated or desactivated. This list is given through a file, whose full path is given by variable SK_INPUT_BRANCH_LIST. Each line should contains a data type prefix, the name of the tree, a + or a - (so to activate or desactivate respectively), and the specification of one or several branches (with the ROOT syntax). The lines are applied one after the other : you can desactivate all the branches of a given type with -*, then activate the only ones of interest. There is a first implicit +* for all the data types used in the skimming job (see SK_DATA_TYPES in next section). So, all the data types which are not explicitly in the branch list will have all their branches activated. Here is an example of such file :

...

No Format
SK_INPUT_BRANCH_LIST=""
SK_OUTPUT_BRANCH_LIST=""
SK_DEBUG_BRANCH_LIST=false

The actual final skimming

We are now to the point where to say which types of data we want to skim. This is said by shell variable SK_DATA_TYPES, which should be a ":" separated list of data types. The current recognized types can be found in the guide /Skimmer at SLAC/. If SK_DATA_TYPES is empty, a default value of "merit:mc:digi:recon" will be used.

The skimmed files will be stored in the directory defined by shell variable SK_OUTOUTPUT_DIR, in files called SK_OUTOUTPUT_FILE_BODY_<datatype>.root. Yet, if they turned to be very big files, ROOT could automatically close the first file and open new ones, appending a rank number to the file name. The maximum bytes size of each ouput ROOT file can be changed with shell variable SK_MAX_FILE_SIZE. If the value of 0 is given to this variable (this is the default), ROOT will use its own default value. Also, if the value is 0 and the job is merging all the events, the ROOT fast merging method will be used.

...

If the events have been cut (thanks to SK_INPUT_EVENT_LIST or SK_TCUT) and SK_OUTPUT_CEL is defined, no deep copy of data will take place. Instead, the skimmer will create a ROOT CEL file which is giving the list of valid files and entries for each relevant data type. In this case, the pruning of branches will not be applied. The relevant data types are the ones which have a branch giving the run id and a branch giving the event id. For other data types, the usual merging of files will take place, even if SK_OUTPUT_CEL is defined, because the CEL format does not support yet such kind of data.

...

No Format
SK_DATA_TYPES="merit:mc:digi:recon"
SK_OUTOUTPUT_DIR=${PWD}
SK_OUTOUTPUT_FILE_BODY=${SK_INPUT_TASK}
SK_MAX_FILE_SIZE=0
SK_SKIP_SKIM=false
SK_OUTPUT_CEL=""
SK_DEBUG_SKIM=false

...