Data Catalog Users Guide

compared with
Current by Daniel Flath
on Jul 03, 2008 13:32.


 
Key
These lines were removed. This word was removed.
These lines were added. This word was added.

View page history


There are 2 changes. View first change.

 h1. General
  
 The GLAST Data Catalog is a virtual file system maintained in an Oracle database.  GLAST data may be stored at several locations at SLAC, Lyon and elsewhere.  The files themselves may be stored on disk in AFS, NFS, or XROOTD managed servers, in one of several tape archive systems, or any combination of these.  The Data Catalog simplifies access to data by providing a uniform view of files irrespective of their physical location.  The Data Catalog provides features that are not available in standard file systems.  These include:
 * Files may be tagged with meta-data attributes (typed name/value pairs) that provide additional information about the data they contain
 * Files may be tagged with several physical locations allowing a file to exist in multiple places for more convenient access
 * The Data Catalog maintains a conventional _folder_ structure but also provides a _group_ structure which allows files of different pedigree to be separated while coexisting within the same folder
 * The Data Catalog provides access to files in two ways:
 ** by requesting a file or set of files at a specific location (folder / group)
 ** by requesting a set of files via a meta data query
  
 Access to the Data Catalog is provided via a Java API.  This API is under continued development and features are regularly added.  Any Java program running within the SLAC firewall may use this API to take advantage of the full Data Catalog feature set.  The Java API is provided in a child page for reference (_See link at bottom of this page._)  Currently there are two methods of accessing the API:
 * A line-mode client available from SLAC UNIX machines
 * Jython scriptlet processes in the GLAST Pipeline
  
 h2. Line-mode Client
  
 The Line-mode client is available from the UNIX command line at SLAC.  It represents a subset of the full Data Catalog API.
  
 The Data Catalog Line-mode executable is available at:
  
     /afs/slac.stanford.edu/g/glast/ground/bin/datacat
  
 Invoking the executable with no parameters will display the help screen.  One may obtain command-specific help explicitly by executing:
  
 &nbsp;&nbsp;&nbsp; /afs/slac.stanford.edu/g/glast/ground/bin/datacat \-h <command>
  
 &nbsp;(The usage is similar to CVS)&nbsp; The following commands are currently available:
  
 h3. find&nbsp;
  h3. find&nbsp; (for Dataset searches)
  
  
 [Linemode-client find command]&nbsp;
  
 h3. registerDataset
  
 Adds a new dataset to the catalog
  
 h5. &nbsp;Usage:
  
 &nbsp;&nbsp;&nbsp; datacat registerDataset \[-options\] <data type> <logical folder> <file path>
 \\
  
 h5. Required Parameters:
  
 &nbsp; <data type>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Type of data in the file (merit, MC, DIGI, RECON, etc.)&nbsp; See Java API child page for a full list.
 &nbsp; <logical folder>&nbsp;&nbsp; Dataset Folder Path under which to create the new dataset.
 &nbsp; <file path>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp; Physical location of file to add to Data Catalog.
  
 h5. Optional Parameters:
  
 || Long Form \\ || Short Form || Parameter \\ || Default Value || Description \\ ||
 | \--name | \-n | dataset name \\ | file name | Name to give new dataset in the catalog \\ |
 | \--group | \-G | group name \\ | none | Group under which to store the dataset \\ |
 | \--format | \-F | file format \\ | file extension | Format of the file (root, fits, etc.) \\ |
 | \--site | \-S | site name \\ | SLAC | Site where dataset physically exists (SLAC, SLAC_XROOT, etc.) \\ |
 | \--define | \-D | "name=value" \\ | none | Define a meta data name/value pair for the new dataset.&nbsp; This option may be used more than once.&nbsp; For naming rules, see the Java API child page \\ |
  
 h5. &nbsp;Example:
  
 &nbsp;&nbsp;&nbsp; datacat registerDataset \-n 000002 \-G merit \-D nEvt=2500 \-S SLAC \-F root merit /ServiceChallenge/Interleave3h-GR-v11r17/runs&nbsp; /nfs/farm/g/glast/u43/MC-tasks/Interleave3h-GR-v11r17/data/merit/Interleave3h-GR-v11r17-000002-merit.root
 \\
  
 h3. addLocation
  
 Adds an additional physical location to an existing dataset.&nbsp; Use this routine to specify that a dataset exists in more than one physical location (ie:&nbsp; it's on SLAC NFS and in SLAC XROOT.)&nbsp; Except for <file path> all of the parameters and options are used to identify the existing dataset entry to which you want to add an additional physical location.
 \\
 \\
  
 h5. &nbsp;Usage:
  
 &nbsp;&nbsp;&nbsp; datacat addLocation \[-options\] <dataset name> <logical folder> <file path>
  
 h5. &nbsp;Required Parameters:
  
 &nbsp; <dataset name>&nbsp;&nbsp;&nbsp;&nbsp; Name of existing dataset
 &nbsp; <logical folder>&nbsp;&nbsp; &nbsp;&nbsp; Data Catalog Folder Path under which the dataset lives.
 &nbsp; <file path>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp; Additional physical location of file to add to the dataset entry.
  
 h5. &nbsp;Optional Parameters:
  
 || Long Form \\ || Short Form || Parameter \\ || Default Value || Description \\ ||
 | \--group | \-G | group name \\ | none | Dataset Group in the Data Catalog under which the dataset lives. |
 | \--site | \-S | site name \\ | SLAC | Site at at which the additional physical location exists.&nbsp; |
  
 h5. &nbsp;Example:&nbsp;
  
 &nbsp;&nbsp;&nbsp; datacat addLocation \-G merit \-S SLAC_XROOT&nbsp; 000002&nbsp; /ServiceChallenge/Interleave3h-GR-v11r17/runs&nbsp; root://glast-rdr//glast/mc/ServiceChallenge/Interleave3h-GR-v11r17/merit/Interleave3h-GR-v11r17-000002-merit.root
 \\
  
 h3. addMetaData
  
 Adds meta data entrie(s) to an existing dataset.
 \\
  
 h5. &nbsp;Usage:
  
 &nbsp;&nbsp;&nbsp; datacat addMetaData \[-options\] <logical folder>
  
 &nbsp;Required Parameters:
 &nbsp; <logical folder>&nbsp;&nbsp; Logical Folder Path where the group or dataset lives, or to tag with meta data if no dataset or group specified.
  
 h5. Optional Parameters:
  
 || Long Form \\ || Short Form || Parameter \\ || Default Value || Description \\ ||
 | \--dataset | \-n | dataset name \\ | file name | Name of existing dataset |
 | \--group | \-G | group name \\ | none | Dataset Group in the Data Catalog under which the dataset lives. |
 | \--define | \-D | "name=value" \\ | none | Define a new meta data name/value pair for the dataset.&nbsp; This option may be used more than once.&nbsp; (And must be used at least once\!)&nbsp; For naming rules, see the Java API child page \\ |
  
 h5. &nbsp;Example:
  
 &nbsp;&nbsp;&nbsp; datacat addMetaData \-n 000002 \-G merit \-D nEvt=2500 /ServiceChallenge/Interleave3h-GR-v11r17/runs
 \\
  
 h2. Pipeline Jython Scriptlets
  
 Jython scriptlet processes withing the pipeline enjoy access to the full Java API.&nbsp; Access to the Data Catalog is provide via an object named "datacatalog".
  
 As an example, dataset registration is performed by calling:
  
 &nbsp; datacatalog.registerDataset(DATA_TYPE, DATA_CATALOG_LOCATION, DISK_LOCATION \[, META_DATA\])
  
 where:
 * DATA_TYPE is the type of data within the file.
 ** Typical values are MERIT, MC, RECON, ...
 ** (See the Java API link below for a full list.)
 * DATA_CATALOG_LOCATION has the following form:&nbsp; <logical folder path>\[<dataset group name>:\]<dataset name>
 ** <logical folder path> is required and has the form: /folder1/sub-folder/.../
 *** It denotes the location within the Data Catalog folder-tree where the dataset will be registered.
 *** The folder need not exist, it will be created if necessary.
 ** <dataset group name> is optional.
 *** If present, it must be followed by a ":" (colon) character.
 *** The name is a simple alphanumeric string (spaces are not permitted.)
 *** A dataset group is used to bundle together datasets which are fragments of a larger dataset.
 *** For example, all merit files of a large monte carlo task are generally cataloged together using a dataset group.
 *** <dataset name> is required.
 **** It is simply the name of the dataset.
 **** It is an alphanumeric string (spaces are not permitted.)
 **** It must be unique within the folder or group where it will be placed.
 * DISK_LOCATION has the following form: <disk file path>\[@<site name>\]
 ** <disk file path> is required.
 *** It is the full path on disk (or in XRootd, etc.) to the file that is being registered.
 ** <site name> is optional.
 *** If specified, it must be preceded by a "@" (ampersand) character.
 *** The site name tells the data catalog where to find the physical file.
 *** Currently it may be one of:
 **** SLAC, SLAC_XROOT, IN2P3, IN2P3_HPSS, UW
 **** If no site name is specified, a default of "SLAC" is assumed.
 * META_DATA is optional.&nbsp; If specified, the supplied meta-data will be attached to the dataset upon registration.&nbsp; Meta-data provide a basis for searching the Data Catalog for datasets.&nbsp; A META_DATA expression has the following form: <name>=<value>\[:<name2>=<value2>\[...\]\]
 ** <name> is required.
 *** It is simply the name of the meta-data object, but it's form is significant because it denotes the object type of the <value> parameter. The Data Catalog will perform a type conversion and store the <value> parameter internally based on the type specified by the name:
 **** n\[A-Z\]+.\* (ex: nEvents, nSecondsMET) indicates a numeric value
 **** t\[A-Z\]+.\* (ex: tStartDate, tEndDate) indicates a timestamp value
 **** Anything else (ex: RunStatus, myDogsName) indicates a string value
 ** <value> is required and must be separated from <name> by a single '=' (equals) character.
 *** The value must reflect the type specified by <name> or an error will be thrown, and the registration will fail.
 **** Numeric values have 38 decimal digits of precision for integers and 18 for floats. Leading and trailing zeros will be removed during conversion.
 **** Timestamp values must be supplied in the following format: yyyy-mm-dd hh:mm:ss.\[fff...\]
 (fff... is an optional, fractional seconds component with nanosecond precision.)
 **** String values are simply ASCII strings. Put whatever you want in there, even numbers.
 ** Multiple <name>=<value> pairs may be supplied if separated by ":" (colon) characters
 \\
  
 Below is an example. The parameters are interpreted as follows:
 * It registers a "merit" type dataset.&nbsp;
 * The dataset is placed under the Data Catalog folder "/ServiceChallenge/Interleave3h-GR-v11r17/runs/", in the group "merit", with a name of&nbsp; "000002".&nbsp;
 * The file is found on disk at "/nfs/farm/g/glast/u43/MC-tasks/Interleave3h-GR-v11r17/data/merit/Interleave3h-GR-v11r17-000002-merit.root", and is assumed to be located at SLAC (because no site name was specified.)
  
 datacatalog.registerDataset("merit","/ServiceChallenge/Interleave3h-GR-v11r17/runs/merit:000002","/nfs/farm/g/glast/u43/MC-tasks/Interleave3h-GR-v11r17/data/merit/Interleave3h-GR-v11r17-000002-merit.root")