WARNING: SOME OF THE CONCEPTS IN THIS DOCUMENT SHOULD BE READ WITH THE UNDERSTANDING THAT THIS IS THE THEORY OF THE DATACATALOG WHICH IS LIKELY TO BE, NOT NECESSARILY WHERE IT IS CURRENTLY.

Many of the concepts apply to the current Datacatalog, especially that in use by Fermi, but they have been generalized.

 

Why the Datacatalog?

A digression on a UNIX filesystem

The UNIX filesystem has been around a long time. The term itself, filesystem, can be a little ambiguous because it may refer to the underlying format and software used to store and read files, like ext4 or XFS, or it can refer to specific instance with all of your data on it.

The UNIX filesystem and it's hierarchy

First, some information about how a UNIX filesystem operates may help with some context. In a UNIX file system, the index node, or inode, is an on-disk data structure which contains a filesystem object's metadata: object type (a file, pipe, or directory, for example), disk block pointers, gid, uid, acls, creation times, modification times, access times, and file size, among other attributes. When an inode is a directory, list for all children of that directory, and a hash table for all children of that directory is stored on the inode's disk blocks. Each entry in the hash table is a directory entry, or dentry. At runtime, the operating system caches these entries for fast lookup.

Metadata support

Some UNIX filesystems also support extended attributes similar to those implemented by the XFS filesystem. These are key:value attributes which can be stored on a file (again, everything is a file, so directory and pipe too), with names up to 256 bytes long. This functionality is often exposed to the user through attr or xattr commands. Simple equality testing of attributes is supported by the OS, but it would be up to a user to implement meaningful comparisons on this extended metadata, such as numeric relations. Other interesting metadata which could potentially stored as an xattr could also be a MIME type, for example.

Locality

The normal use case for a UNIX filesystem is that the filesystem instance appears somewhere in your virtual filesystem, typically under a mount point, like /nfs, /afs, /a, /mnt for example. For mounting a typical networked filesystem instance, you typically need the software necessary to mount the instance and talk to it, the configuration for the instance, potentially additional software necessary for authentication/authorization, and access to an instance through a network. For scientific applications, you typically need low-latency, high bandwidth network links.

These requirements make it hard and nearly impossible to have a true global, performant file system. In fact, we believe it is impossible to do without global adoption of tightly-coupled storage solution. What kinds of tightly-coupled storage solutions exist? 

There exists distributed filesystems, such as AFS, which allow a global view of a filesystem to discover files. However, AFS isn't for high performance I/O, especially over very large distances. There also exists filesystem-like services and proxies which can provide global access to files with reasonable performance and caching, such as xrootd or dcache. However, none of these solutions truly have global adoption, and they severely restrict the protocols you can access a file with. Furthermore, they abstract out native access protocols which could probably just be exposed to the user in the first place.

The Datacatalog as Service

The Datacatalog is centralized service for maintaining and discovering Datasets and their metadata, providing versioning capabilities, and the registration of loosely-coupled file replicas.

Tell us the Datasets you want (or how we can find them by metadata), and the way you'd like to access it, and we'll give you file locations you can read.

We believe that users should be provided a uniform way of locating and discovering files, like a filesystem, but that our users should not be restricted to a given filesystem, filesystem instance, or protocol. We intend to provide strong metadata tagging and search capabilities on the datasets that are registered. We've also added Versioning as a first class citizen.

What the Datacatalog provides for User Interaction:

  • A RESTful API: Our interface for interacting directly with the Datacatalog, the hierarchy, and metadata.
  • Command line clients and libraries: These abstract out some basic of the the actions which interface to the RESTful API.
    • Including Pipeline-II integration
  • A Browser User Interface: A web application for users to browse it's hierarchy and modify metadata.

The Datacatalog Hierarchy

All Datacatalog objects exist in a hierarchy.

The core object of the datacatalog is also a node. For this document, every node in our datacatalog will be called a catalog node, or cnode for short.

We have three types of cnodes: Folders, Groups, and Datasets. At it's core, all three cnodes should implement similar functionality to inodes and dentries. That is, our cnodes should contain times of creation, and modification, ACLs, and extended attributes. Groups and Folders are similar to directories; they are designated as nodes which may contain other nodes. Datasets are equivalent to files, with one important caveat: a Datasets has one or more Version entries. The canonical version is always the most recent, or Latest Version, and all Datasets must have at least one Version entry. Furthermore, Versions have one or more Location entries where identical copies of the underlying files may be found. When the first Location entry is created, it usually serves as a canonical representation for all Location entries, which should be identical. The canonical location is called the Master Location.  Metadata about the files pointed to by registered Locations, such as file size, and checksum, is populated asynchronously via the Crawler application.

The Crawler

The Crawler is an application which asynchronously scans registered Dataset Locations. Typically it will load file metadata, such as file size and checksum, file content metadata, such as event counts, and user-defined metadata into a dataset.

As mentioned before, the crawler looks for Dataset Locations which have been registered but not scanned. It's primary purpose is first and foremost to retrieve file metadata from those Locations, such as file size and checksum.

Secondly, through either predefined plugins or user plugins, it may record a Dataset's primary file content metadata for run-like datasets, like event counts, runmin, and runmax. This is handled in a MIME-type sort of way, where the Crawler examines the logical Dataset's data type and file format, as defined in the parent Dataset entry. This primary data is stored contiguously with the Location entry.

And finally, those same plugins which loaded the file content metadata may also load user-defined metadata into the Dataset's Version entry's metadata.

For many experiments using the Datacatalog, the way primary run-like metadata is stored imposes restrictions which do not always make sense for a given experiment or Dataset. As such, it is recommended that the user actually store run-like metadata in the Version entry.

 

The Dataset View

A Dataset View is a representation of a Dataset node, a Version, and Locations tied to that Version.

Because there are typically many filesystems, filesystem instances, protocols and replicas for accessing any given file, we need to abstract out some information about the file itself and create the logical Dataset.

A logical Dataset, at it's most basic, is a unit or collection of data output from some process. The file is the physical representation of that output. Of course, every dataset should have an underlying file, just like every file on an operating system should have some data (although there are cases where it is useful not to). If a Dataset contains no Version or Location defined, we will call this the null Dataset (it exists, but it's not completely useful).

We define the Dataset View to be the information conveyed to a user about a Dataset. This information, at it's most basic, would include just the logical Dataset, so merely the name, declared data type and file format, and it's parent. We will call this the empty dataset view, because this view contains no information about a version or underlying files. The next view would include some information about this version. By default, most queries will return the latest version of any dataset. Similar to the empty dataset view, we could have the empty versioned view, which would include the empty Dataset view, a version ID, and whether or not the version was the master. It may also include metadata for the version, but it's not required. Once a location is added to a version, we will finally have a complete versioned dataset view. We can have a few types of complete views.

When only one location is present in our complete view, we would call this the Flat view. This includes all the information present to read a given file. Typically, the first location registered will be the master location for our given version. The Flat view which contains the master location for the location is the Master view. Currently, when you browse the Datacatalog, you are actually presented the Master view. Often times, especially for data processing at the host institution, this is the most widely used Dataset view, which is why it's often the default. Typically, clients which are reading from the Datacatalog are only interested in knowing where the master file is located, or where a file at it's preferred site is located, so they are intrinsically interested in a Flat view.

When multiple locations are desired from the Datacatalog, we can return the Full view. This view includes the Dataset node information, the version information, and all locations registered for that Dataset. This is useful in many scenarios as well.

Again, the default versioned view for any Dataset will be that of the the latest version. The views mentioned, Flat, Master, and Full, may be applicable to any versioned view.

Foreign metadata

Foreign metadata is external information (and typically externally managed) which, through a well-defined relationship, we can use to augment a Datacatalog object.

Most physics Datasets, in the context of a running experiment, are typically the result of a run. A typical physics run may produce any number of Datasets, such as environmental data, detector data, calibration data, and event data. A run will typically be defined by some first-class metadata, such as start and stop times and quality. A run may also have user-defined metadata of it's own which can help with Dataset discovery. All of this metadata may be created and updated before and after a Dataset is produced.

Because of this, it wouldn't always be convenient to store all metadata about a Dataset inside the Version entry alone. If the metadata is typically replicated over a large amount of datasets and the metadata may change, for example, it would require a significant amount of work for the user to maintain the update of all metadata entries for all Datasets. In addition, the metadata entries are limited to key:value entries without any sort of binding, and the entries themselves have limits on the amount of data you may store. The physics run is a good example of how this could be cumbersome; data becomes denormalized, and software must be written to maintain synchronization of this denormalized data.

The Run data itself is relational as well, and we typically have some way of relating a Dataset to a Run. This could be done via the run fields in the Location entry, for example, or it could (* should) be done via an metadata entry in our Dataset's Version. We call this relation the Foreign Metadata View.

In short: Foreign metadata support means we can temporarily augment the representation (view) of a Datacatalog object with foreign metadata, through the relationship of an attribute or metadata local to that Datacatalog object.

Augmenting a Datacatalog object with foreign metadata

Typically, we are usually augmenting the Dataset view with Foreign metadata.

In this example, we have Datasets in two different groups. Each Dataset has an entry in it's Version metadata, nRunId. The value of that entry is the primary key for a Foreign metadata table, called PhysicsRuns. For the sake of simplicity, the Datasets are presented in a table with minimal information. This information is not representative of how Datasets are actually stored in the database.

Datasets (partial entries with group name attached for example) 

datasetNamegroupNameversionIdmetaKeymetaValue
events0001raw1nRunId101
events0002raw1nRunId102
events0003raw1nRunId103
events0004raw1nRunId104
calib0001calib1nRunId101
calib0002calib1nRunId102
calib0003calib1nRunId103
calib0004calib1nRunId104

PhysicsRuns

runIdstartDaystopDayqualitymessagecoreDump
1012014-01-012014-01-05GOOD  
1022014-01-082014-01-12GOOD  
1032014-01-152014-01-16BADADC died<2MB blob>
1042014-01-232014-01-26GOOD  

 

Without the capability of relating via a Foreign key, we would need to replicate the data from the PhysicsRuns tables to our each Dataset's Version entry. So events0003 and calib0003, for example, would contain entries for quality, startDay, stopDay. We may or may not store PhysicsRuns.message, but it would be impossible to store the PhysicsRuns.coreDump, for example. In addition, those fields don't add to the Datacatalog's primary goal: Dataset discovery.

Instead, by informing the Datacatalog about Foreign metadata relations, we've taught the Datacatalog to understand a user query that may contain a metadata search like PhysicsRuns.quality = 'BAD'.

Internally, a plugin to the Datacatalog would likely perform a SQL JOIN on the table, and exposed the PhysicsRuns fields to the Dataset view for filtering.

Plugins

Historically, the idea of foreign metadata has been integrated into the Datacatalog directly, like the RUN table for Fermi. The fields available for searching, like RunQuality and RunStatus, were dumped directly into a generic environment for dataset searching. This means a user historically may have been able to query the datacatalog with a filter like  ` RunMin < 240000000 && RunQuality != 'GOOD' `. Unfortunately, this poses problems in a multi-tenant Datacatalog, where EXO and LSST each have their own RunQuality fields. It mostly means there would need to be two specific Datacatalog search implementations, and the implementation would need to be decided by some external means. In the context of a RESTful API, this would mean parallel deployments of our RESTful web applications. Instead of this, going forward, I feel it's best to go with the plugin model to emulate some sort of namespacing. The Datacatalog would be aware of all experiment specific plugins, for example, so you could have EXO.RunQuality and LSST.RunQuality.

Caveats

Performance. Since Foreign Metadata Views are inherently dynamic, they are also slow. Denormalizing the data and duplicating common fields that are searched, especially if those fields aren't updated frequently, is recommended. This eliminates additional joins. When searching, the user should do their best to filter as much as possible using first-class metadata and Version metadata in datasets. Recursive searching using Foreign Metadata is also not advised.

Restrictive. Currently, basic Foreign Metadata Views are best implemented on the Datacatalog's host database. Through plugins, it is possible, but not trivial, to get around this limitation.

 

Ways the Datacatalog are un/like things

  1. Like a filesystem, the Datacatalog facilitates file discovery.
  2. Unlike a filesystem, however, the Datacatalog's does not facilitate I/O.
    1. The Datacatalog does not define file access protocols. 
    2. External modules interfacing with the Datacatalog may, however, facilitate I/O.
  3. Like Subversion, the Datacatalog facilitates file versioning.
  4. Like a Distributed WAN filesystem, the Datacatalog can provide a user a unified gateway to discovering files.
  5. Unlike a Distributed WAN filesystem, the Datacatalog has no facilities for automated replication.



 

 

 

Beyond User scope

Metadata searching and foreign databases

It's possible to implement them in other ways, but it involves writing custom plugins to do so. Some database systems support database links which look to the database like a local table, and this may be useful, but could likely be extremely slow in the case of searching a large number of datasets.

See also: 
https://wiki.postgresql.org/wiki/Foreign_data_wrappers 
http://docs.oracle.com/cd/B28359_01/server.111/b28310/ds_admin002.htm 
http://www.pythian.com/blog/how-to-access-mysql-from-oracle-with-odbc-and-sql/

Potential future solutions for improving foreign metadata

Use Presto to present a uniform view of external databases, similar to Postgres Foreign data wrappers.

http://prestodb.io/

Statistics

This could quickly allow elimination of a given folder/group and prevent costly joins.

For any given container, maintain some sort of statistics/histogram which can be used to determine if the datasets may contain any given value.

One for native metadata - take a hash of all possible keynames, and bin all possible value ranges, make a 2D histogram.

One for foreign metadata - take a hash of all possible keynames, and bin all possible value ranges, make a 2D histogram.

Similarities of the Datacatalog with the HDFS NameNode in Hadoop

HDFS is a distributed file system. In HDFS, data is distributed at the block level. The NameNode facilitates lookups and manages metadata, similar to how the Datacatalog might. More info:

http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html

High-Availability for a NameNode is managed by using a ZooKeeper service, utilizing it for locking and leader election, and a HDFS journaling cluster (called Quorum Journaling manager), which only the leader may write to, but all failover NameNode servers may read from. It's more than acceptable for a (3,5) node Quorum Journal Cluster to also be the single-purpose ZooKeeper cluster for the NameNode.

More Reading:

  1.  Filesystem path walking as implemented in the Linux VFS:

    1. http://lwn.net/Articles/419811/
    2. https://www.kernel.org/doc/Documentation/filesystems/path-lookup.txt

 

Optimizations available if a RESTful interface has exclusive access to the Datacatalog tables

We could maybe use potentially use commons-vfs and some aggressive caching in the web applications to reduce database load.

 http://commons.apache.org/proper/commons-vfs/

  • No labels