General

The GLAST Data Catalog is a virtual file system maintained in an Oracle database.  GLAST data may be stored at several locations at SLAC, Lyon and elsewhere.  The files themselves may be stored on disk in AFS, NFS, or XROOTD managed servers, in one of several tape archive systems, or any combination of these.  The Data Catalog simplifies access to data by providing a uniform view of files irrespective of their physical location.  The Data Catalog provides features that are not available in standard file systems.  These include:

  • Files may be tagged with meta-data attributes (typed name/value pairs) that provide additional information about the data they contain
  • Files may be tagged with several physical locations allowing a file to exist in multiple places for more convenient access
  • The Data Catalog maintains a conventional folder structure but also provides a group structure which allows files of different pedigree to be separated while coexisting within the same folder
  • The Data Catalog provides access to files in two ways:
    • by requesting a file or set of files at a specific location (folder / group)
    • by requesting a set of files via a meta data query

Access to the Data Catalog is provided via a Java API.  This API is under continued development and features are regularly added.  Any Java program running within the SLAC firewall may use this API to take advantage of the full Data Catalog feature set.  The Java API is provided in a child page for reference (See link at bottom of this page.)  Currently there are two methods of accessing the API:

  • A line-mode client available from SLAC UNIX machines
  • Jython scriptlet processes in the GLAST Pipeline

Line-mode Client

The Line-mode client is available from the UNIX command line at SLAC.  It represents a subset of the full Data Catalog API.

The Data Catalog Line-mode executable is available at:

    /afs/slac.stanford.edu/g/glast/ground/bin/datacat

Invoking the executable with no parameters will display the help screen.  One may obtain command-specific help explicitly by executing:

    /afs/slac.stanford.edu/g/glast/ground/bin/datacat -h <command>

 (The usage is similar to CVS)  The following commands are currently available:

find  (for Dataset searches)

Linemode-client find command 

registerDataset

Adds a new dataset to the catalog

 Usage:

    datacat registerDataset [-options] <data type> <logical folder> <file path>

Required Parameters:

  <data type>        Type of data in the file (merit, MC, DIGI, RECON, etc.)  See Java API child page for a full list.
  <logical folder>   Dataset Folder Path under which to create the new dataset.
  <file path>          Physical location of file to add to Data Catalog.

Optional Parameters:

Long Form

Short Form

Parameter

Default Value

Description

--name

-n

dataset name

file name

Name to give new dataset in the catalog

--group

-G

group name

none

Group under which to store the dataset

--format

-F

file format

file extension

Format of the file (root, fits, etc.)

--site

-S

site name

SLAC

Site where dataset physically exists (SLAC, SLAC_XROOT, etc.)

--define

-D

"name=value"

none

Define a meta data name/value pair for the new dataset.  This option may be used more than once.  For naming rules, see the Java API child page

 Example:

    datacat registerDataset -n 000002 -G merit -D nEvt=2500 -S SLAC -F root merit /ServiceChallenge/Interleave3h-GR-v11r17/runs  /nfs/farm/g/glast/u43/MC-tasks/Interleave3h-GR-v11r17/data/merit/Interleave3h-GR-v11r17-000002-merit.root

addLocation

Adds an additional physical location to an existing dataset.  Use this routine to specify that a dataset exists in more than one physical location (ie:  it's on SLAC NFS and in SLAC XROOT.)  Except for <file path> all of the parameters and options are used to identify the existing dataset entry to which you want to add an additional physical location.

 Usage:

    datacat addLocation [-options] <dataset name> <logical folder> <file path>

 Required Parameters:

  <dataset name>     Name of existing dataset
  <logical folder>      Data Catalog Folder Path under which the dataset lives.
  <file path>             Additional physical location of file to add to the dataset entry.

 Optional Parameters:

Long Form

Short Form

Parameter

Default Value

Description

--group

-G

group name

none

Dataset Group in the Data Catalog under which the dataset lives.

--site

-S

site name

SLAC

Site at at which the additional physical location exists. 

 Example: 

    datacat addLocation -G merit -S SLAC_XROOT  000002  /ServiceChallenge/Interleave3h-GR-v11r17/runs  root://glast-rdr//glast/mc/ServiceChallenge/Interleave3h-GR-v11r17/merit/Interleave3h-GR-v11r17-000002-merit.root

addMetaData

Adds meta data entrie(s) to an existing dataset.

 Usage:

    datacat addMetaData [-options] <logical folder>

 Required Parameters:
  <logical folder>   Logical Folder Path where the group or dataset lives, or to tag with meta data if no dataset or group specified.

Optional Parameters:

Long Form

Short Form

Parameter

Default Value

Description

--dataset

-n

dataset name

file name

Name of existing dataset

--group

-G

group name

none

Dataset Group in the Data Catalog under which the dataset lives.

--define

-D

"name=value"

none

Define a new meta data name/value pair for the dataset.  This option may be used more than once.  (And must be used at least once!)  For naming rules, see the Java API child page

 Example:

    datacat addMetaData -n 000002 -G merit -D nEvt=2500 /ServiceChallenge/Interleave3h-GR-v11r17/runs

Pipeline Jython Scriptlets

Jython scriptlet processes withing the pipeline enjoy access to the full Java API.  Access to the Data Catalog is provide via an object named "datacatalog".

As an example, dataset registration is performed by calling:

  datacatalog.registerDataset(DATA_TYPE, DATA_CATALOG_LOCATION, DISK_LOCATION [, META_DATA])

where:

  • DATA_TYPE is the type of data within the file.
    • Typical values are MERIT, MC, RECON, ...
    • (See the Java API link below for a full list.)
  • DATA_CATALOG_LOCATION has the following form:  <logical folder path>[<dataset group name>:]<dataset name>
    • <logical folder path> is required and has the form: /folder1/sub-folder/.../
      • It denotes the location within the Data Catalog folder-tree where the dataset will be registered.
      • The folder need not exist, it will be created if necessary.
    • <dataset group name> is optional.
      • If present, it must be followed by a ":" (colon) character.
      • The name is a simple alphanumeric string (spaces are not permitted.)
      • A dataset group is used to bundle together datasets which are fragments of a larger dataset.
      • For example, all merit files of a large monte carlo task are generally cataloged together using a dataset group.
      • <dataset name> is required.
        • It is simply the name of the dataset.
        • It is an alphanumeric string (spaces are not permitted.)
        • It must be unique within the folder or group where it will be placed.
  • DISK_LOCATION has the following form: <disk file path>[@<site name>]
    • <disk file path> is required.
      • It is the full path on disk (or in XRootd, etc.) to the file that is being registered.
    • <site name> is optional.
      • If specified, it must be preceded by a "@" (ampersand) character.
      • The site name tells the data catalog where to find the physical file.
      • Currently it may be one of:
        • SLAC, SLAC_XROOT, IN2P3, IN2P3_HPSS, UW
        • If no site name is specified, a default of "SLAC" is assumed.
  • META_DATA is optional.  If specified, the supplied meta-data will be attached to the dataset upon registration.  Meta-data provide a basis for searching the Data Catalog for datasets.  A META_DATA expression has the following form: <name>=<value>[:<name2>=<value2>[...]]
    • <name> is required.
      • It is simply the name of the meta-data object, but it's form is significant because it denotes the object type of the <value> parameter. The Data Catalog will perform a type conversion and store the <value> parameter internally based on the type specified by the name:
        • n[A-Z]+.* (ex: nEvents, nSecondsMET) indicates a numeric value
        • t[A-Z]+.* (ex: tStartDate, tEndDate) indicates a timestamp value
        • Anything else (ex: RunStatus, myDogsName) indicates a string value
    • <value> is required and must be separated from <name> by a single '=' (equals) character.
      • The value must reflect the type specified by <name> or an error will be thrown, and the registration will fail.
        • Numeric values have 38 decimal digits of precision for integers and 18 for floats. Leading and trailing zeros will be removed during conversion.
        • Timestamp values must be supplied in the following format: yyyy-mm-dd hh:mm:ss.[fff...]
          (fff... is an optional, fractional seconds component with nanosecond precision.)
        • String values are simply ASCII strings. Put whatever you want in there, even numbers.
    • Multiple <name>=<value> pairs may be supplied if separated by ":" (colon) characters

Below is an example. The parameters are interpreted as follows:

  • It registers a "merit" type dataset. 
  • The dataset is placed under the Data Catalog folder "/ServiceChallenge/Interleave3h-GR-v11r17/runs/", in the group "merit", with a name of  "000002". 
  • The file is found on disk at "/nfs/farm/g/glast/u43/MC-tasks/Interleave3h-GR-v11r17/data/merit/Interleave3h-GR-v11r17-000002-merit.root", and is assumed to be located at SLAC (because no site name was specified.)

datacatalog.registerDataset("merit","/ServiceChallenge/Interleave3h-GR-v11r17/runs/merit:000002","/nfs/farm/g/glast/u43/MC-tasks/Interleave3h-GR-v11r17/data/merit/Interleave3h-GR-v11r17-000002-merit.root")

  • No labels