Dashboard > Extremely Large Databases > ... > Ideas > Common Requirements
Common Requirements
Added by Jacek Becla, last edited by Jacek Becla on Feb 19, 2008  (view change) show comment
Labels: 
(None)


This page attempts to define a potential list of common requirements coming from big scientific projects. Please keep in mind this is a working draft, and the list has not been endorsed by any scientific community or a scientific project.

We are planning to work in the near future with representatives from different scientific communities to help us refine this list. Those working on large-scale scientific data management issues are encouraged to edit this page or send us comments and suggestions.

Aggregate Statistics on Trillions of Rows

This is a fundamental requirement of analyzing very large data sets. Performing such aggregations rapidly almost certainly requires parallel query execution and typically requires physical partitioning of the data. Often multidimensional aggregations will need to be performed, further adding to the complexity.

"Needle in a Haystack" Searches for Anomalies in Huge Tables

This is another fundamental requirement, in which varying combinations of attributes are searched for unusual values. This may involve successively narrower ranges of attribute values or successive additions of attribute restrictions. This type of query is executed frequently by many applications, both scientific and industrial.

Efficient Joins on Huge Tables

Joining two (or more) billion-row or trillion-row tables on a primary key is typically an expensive operation in an RDBMS. It can be even more expensive if the two tables are partitioned differently. Still worse are joins not involving the primary key. Executing such joins efficiently, or finding ways of eliminating them, will be necessary.

Flexible Schemas

In the course of an analysis, derived data and annotations are often generated. New transformations of existing data may also be introduced. All of these data elements are most naturally thought of as additional columns appended to existing tables. Row-oriented RDBMSes often have an extremely high cost for adding columns to tables with large numbers of rows. The alternative of adding new tables joined to the original one is not much better, as it not only emphasizes the previous requirement but also increases the complexity of queries.

Efficient Support of Complex Types

Relational databases should support or efficiently simulate complex types such as vectors, sets and multidimensional arrays. Current mechanisms using additional tables require often-slow joins on huge data sets, and queries on such tables are not always expressible in a natural way.

Lineage and Provenance

As compute cycles become cheaper, it is often the case that reconstructing intermediate data is preferred over storing it. In order to perform this reconstruction, accurate recording and management of the lineage and provenance of each data item is required. With very large data sets and frequently-changing data processing pipelines, this lineage information may itself be very large. Automated determination and use of lineage information is a current research project (TRIO).

Support for Uncertainty

Scientists frequently deal with uncertain data. One example would be an observed astronomical object, which, due to poor visibility, can not be classified with 100% probability. The database should allow the classification probability to be stored with the data, and it should automatically calculate the accuracy of an aggregate result based on the probabilities of the input data. Again, this issue is already on the drawing boards of some researchers (TRIO).

Spatial/Spherical Geometry Support

Some scientific communities and some industrial users require indexes, functions and algebra that support spatial and spherical geometry. Some vendors already support this; an example would be the Oracle Spatial Toolkit.

Procedural Language

Many complex queries are difficult to express in pure SQL, if it is possible at all. User-defined functions and stored procedures may not be sufficient to overcome this barrier. The extreme version of this requirement is the HEP community's use of custom C++ code to query the data. While this gives ultimate flexibility, it also maximizes the complexity of the user interface and minimizes the opportunities for automatic optimization. A middle ground between pure SQL and pure C++ is needed. Some industrial users have already extended their systems by introducing procedural languages (Sawzall, Pig Latin).

Support for Long Running Queries

Many complex queries run for long time, measured in days or even tens of days, this is true both in scientific and industrial applications (eg BaBar, Google). A DBMS should provide ability to pause and restart queries without losing work.

Fault Tolerance

Large setups consist of hundreds or thousand of database servers. A single node failure should not take down the entire system. The DBMS as well as running queries must be able to survive node failure without having to start over.

Partial Results

For long running queries it would be useful for a user to be able to observe results as they are computed, in order to make decisions whether to continue query execution. This might be anything from estimating query progress, through sampling to intermediate computations, to actual final results.

System Efficiency

With more and more servers/nodes getting combined into ever growing extra large clusters (100s-1000s of boxes), efficiency is tremendously important. Ideally it would be good to achieve linear scalability.

Powered by Atlassian Confluence, the Enterprise Wiki. (Version: 2.5.8 Build:#814 Oct 02, 2007) - Bug/feature request - Contact Administrators