Development of a RESTful API for the Datacatalog

Generic development of the datacatalog's RESTful API started with an implementation for EXO which targeted only getting information about datasets. There have been some changes since then, but the core of the features are the same. As we've moved away from an EXO specific implementation, we've added several core features that will enable the use of the datacatalog in general for many experiments, and adopted a plug-in architecture for experiment-specific additions.

Core Features

Resources

The core resources start with a base REST URI. Here are some examples and definitions of the URI structure.

Base REST application URI
http://srs.slac.stanford.edu/rest/datacat
REST Application + REST resource base URI
http://srs.slac.stanford.edu/rest/datacat/path
Application + REST resource base + resource path (for folder /LSST/SensorAnalysis )
http://srs.slac.stanford.edu/rest/datacat/path/LSST/SensorAnalysis

Core resources with HTTP methods and the operations allowed

REST Resource base URI

Resource path

GET requests

POST requests

PATCH(PUT) requests

DELETE requests

/path

Directly access any datacatalog element by full path

Will return a PATH item (folder, group, dataset)
and information about it.
Will not return any sort of derived statistics, it's meant to
be quick.

Unsupported.

Unsupported.

Unsupported.

/datasets

If the resource path is a group or a folder, this resource will
only support GET requests intended to query for datasets.
If the resource path is a dataset, we will support all methods.

Directly address a dataset and get full information about
it's version, the sites for a version, etc...
Optionally, address a Group or Folder and search,
with option to recurse if it is a Folder.
Search can contain metadata or plugin-specific data.
See "Searching" below.

[Not Implemented Yet]
Register a new dataset.
Initial registration should contain a complete dataset,
this is, info about the dataset, it's version, and a location.
If a dataset already exists, and you are adding a new 
version, an error will be returned because PATCH/PUT
should be used.
The full path specified should contain the dataset's name.
The POST payload should be XML/JSON, and include all information necessary.

[Not Implemented Yet]
Add new dataset version.
Add new location for a dataset version.
Modify metadata for a dataset version.

[Not Implemented Yet]
Remove a dataset, a dataset version,
or a dataset version's location.
(Will likely be disabled by default, 
but enabled on a per-user basis)

/children

The children of a given container object (Folder or Group)

When supplied with a folder, return all groups and
datasets directly in that folder.
When supplied with a group, return all datasets in that group.

Not Supported

Not Supported

Not Supported

/groups 
[Not implemented yet]

If the resource path is a folder, this resource will only support
GET requests intended to query for groups starting at that
folder path.If the resource is a group, support all methods.

Directly access a group and get full information about it's
metadata, and the quantity of datasets in the group.
When the resource is a Folder, search that Folder, 
with option to recurse for groups.
Search can contain metadata or plugin-specific data.

Create a new Group with the given location.
Will return an error if the group already exists.
The POST payload should be XML/JSON, and include 
all information necessary, including metadata.

Update the group name,
parent folder, or modify metadata
for a group.

(Not likely to be supported for a while)
Remove a group

/folders
[Not Implemented Yet]

Address a Folder only

Get the folder and information about objects in it.
Will get all information about all datasets/groups in it and 
compute things like total file size.

Create a new Folder with the given location.
Will return an error if the Folder already exists. 
The POST payload should be XML/JSON, and include  
all information necessary, including metadata.

Move/Rename/modify metadata for
a folder.

(Not likely to be supported for a while)
Remove a folder

Plugin resources

Resources are also able to be programmed in via a plugin. For more information on plugins and how they work, see below. They will be available at the Application base URI

REST Application + REST resource base URI
Experiment:
http://srs.slac.stanford.edu/rest/datacat/exo
Experiment Resource "runs":
http://srs.slac.stanford.edu/rest/datacat/exo/runs

Experiment:
http://srs.slac.stanford.edu/rest/datacat/lsst
Experiment Resource "sensors":
http://srs.slac.stanford.edu/rest/datacat/lsst/sensors

Security and Authentication

Authentication is handled two ways. An end user can authenticate with CAS, either in the browser or via some HTTP libraries in the language they are using, and use CAS management. This means a user in a browser will see the application like they would any other application we make. It also means we can use the REST resources for web applications.

The second way an authentication is handled is via HMAC. A user will be provided a UUID and a shared secret key which they will use to sign parts of their request. In their request, they will include an Authorization header with their UUID and the signed hash.
The path after the REST application should be the path used for the signing. All requests must include a date, and the request must be received by the server in a specific amount of time (5 seconds? a minute?)

"GET" + "\n" + // (HTTP Method)
"/path/LSST/SensorAnalysis" + "\n" + // (Rest resource + path up to query string)
"" + "\n" + // (Content-MD5)
"" + "\n" + // (Content-Type)
"Thu, 29 Aug 2013 20:16:48 GMT" + "\n" (Date)

Searching

Searching can be done a variety of ways.

In general, searching will be performed starting at a base resource URI. Most often it will be used from the datasets resource, and a user will supply a filter string.

Expressions

A search query/filter is now backed with a domain-specific language to compose expressions for searching the datacatalog.
Each datacatalog item, datasets, groups, and folders, will be able to be searched for with this language.
The language itself has type recognition, and supports three basic types: Strings, Numbers, and Timestamps.

A single expression will be composed of an identifier, an operations, and a value.

[IDENTIFIER] [OPERATOR] [VALUE]
datasetName eq 'myDataset'
datasetName == 'myDataset'
runMin gt 3248
createDate > d'2012-01-01T13:00:01'

To make things easier with URLs, you can use a variety of ways of specifying your operations. For example, "&&" is the same as "and", "eq" for "==", etc.
Also, inspired by python, a date type is written as a string, but prepended with the letter 'd' so the parser knows it's a date. A date string can be varying lengths, but should conform to ISO 8601. It's a little more flexible than ISO 8601 because you can supply it with varying lengths, i.e. d'2012-03' is extended to mean the first day of march, d'2013-02-03T15' will be the first minute and second at 3PM. Time zone information is also allowed. Both single quotes and double quotes are allowed for denoting strings and date strings. Numbers are never to be quoted.

Expressions can be composed in pretty intricate ways.

( (datasetName EQ 'myDataset' AND runMin IN (1234, 1235, 1239, 1329)) OR (datasetName EQ 'myOldDataset' AND runMin GT 1200) )
Identifiers

It's recommended to limit your metakey names in the database to the form:

[a-zA-Z][a-zA-Z0-9_\.\-]

However, the language lexer will recognize identifiers in these forms:

[a-zA-Z_][a-zA-Z0-9_\.\-\:]
Scope

Each identifier is resolved based on a hierarchy of three scopes. The are checked in order.

Table

The first scope is effectively the columns of the type of objects (dataset, group, folder) you are searching on. They will generally be of the form:

name == "datasetName" and runMin > 3000 and site == "SLAC_XROOT"

The second scope is the plugin scope. A plugin will usually correspond to an external table that's not in the datacatalog's core set of tables. This will usually be an experiment-specific feature. As such, they will generally be in the form:

[experiment].[identifier]
exo.runQuality in ('GOLDEN', 'GOOD')
exo.runQuality not in ('BAD', 'UNSET')
lsst.sensorId == 3458
Plugin

The plugins themselves will know how to join their data to the data of the core tables of the datacatalog. When a plugin identifier is found, the parser will ask the plugin join tables necessary to the current selection.

Metadata

The final scope is metadata. In the datacatalog, there are three tables corresponding to metadata, Numbers, Strings, and Timestamps. The metadata key you search on must exist in one of those tables. The datacatalog will do a lookup to see which metadata tables may be applicable, and check those tables for the search.

 

Valid-searching the Metanumber table
nRun eq 4000
Valid-searching the Metanumber table
nRun eq '4000'
Priority

When an identifier is encountered, it's first looked for in the table scope, then the plugin scope, then the metadata scope. So, metadata named exo.runQuality would interfere with the exo plugin's runQuality selector.

  • No labels