The LSST project, which will need to query upwards of 10 petabytes of astronomical data, is considering an architecture that uses the popular Map/Reduce distributed processing paradigm to control a shared-nothing cluster of individual relational database nodes. These nodes may in turn use an underlying distributed filesystem to provide data migration and replication capabilities.
Goals:
- Attain scalability and fault tolerance by using shared-nothing and Map/Reduce.
- Reduce required coding by utilizing the database as a scan and combine execution engine.
- Maximize familiarity for end users by providing SQL access. A top-level SQL parser would still be necessary to convert the input query into the queries to be executed by each database node.
We are considering using Hadoop as the Map/Reduce implementation with MySQL as the underlying database engine.