Data Universe

"Data Universe" concept

Preliminary ideas, draft 0.1

We define the data universe concept as a distributed and indexed cache of pre-extracted data with object wrappers. There could be already such a concept in GLAST but what do I know .

Examples can be found elsewhere (Google and other references bellow). One simple example is creating files of data extracted from a database and distributing and indexing these files on several servers in different geographies. Another example would be raw data files or event set files which are not in a database, but need to be tracked in a distributed file system.

The object wrappers are used to load file content into applications and applications interact with files through these wrappers. Object wrappers are shared by all layers and applications while files remain under control of the operating system(s) (csv files, comma separated values, are a simple example of such objects). An idea would be to use an asynchronous messaging system to make these wrapper objects be more aware of the other objects in the system. Could be expensive.

Couple of characteristics of such a 'data universe':

It is not a database, but a universal persistance/marshalling mechanism (files are persistent, marshalling is here a translation of universal nomenclature to specific file name and location). Could provide input to loaders, could hold results of extracts, could be used for inter-process communication.
Contains discrete pieces of raw or extracted data (files). These pieces are atomic units with object wrappers (components) and enterprise-wide unique names guids. (guid - global unique identifier.) This allows for a partition of huge files if necessary.
It maintains central index for determining set of unique names based on keyed attributes (could be a single index or distributed). We would hold information about file guids and their locations on the system in a database.
The caching distribution layer maps quids to file locations.
It permits client side joins (possibly feeding back to cache).

SWOTT (Strengths, Weaknesses, Opportunities, Threats and Thrends):

Strengths:

Abstraction. Achieved through object wrappers so that multiple applications can use data in the same manner.
Performance. Caching and indexing achieves that.
Reusability. A dataset can be used by more than one application.
Multi-tier architecture. We have layered roles (acquisition, backbone, client)
Redundancy. Distributed caching can be redundant, ideal for server farms, like our Linux farms. Perhaps we can use AFS to achieve distribution and locational transparency.
Data locality. User does not know where the files are, user thinks that data are local.
Integrated view. All data can be treated uniformly (whether they are in a file or in a database)

Weaknesses:

Sizeable project
No applications currently available for this model

Opportunities: (data location transparency vs data source transparency?)

Uniform access mechanism to database and files can be published and
others can develop applications to use the data.
Provides a consistent schema for data location.

Threats:

Dilutes other efforts.

Trends (Similar, related, and possibly useful technology:

Google (caching) http://www.cs.rochester.edu/sosp2003/papers/p125-ghemawat.pdf
Enterprise search (indexing) : various
Semantic Web (typing) http://www.w3.org/RDF
Distributed Index: http://fxagents.stanford.edu/~daishi/papersgisp_ieeep2p2002.pdf

Space shortcuts

Child pages

"Data Universe" concept