Luiza has set up a small project in the UFRJ Reference center to provide big data analysis/mining of PingER multidimensional data
Luiza has proposed three approaches:
- Conventional. Utilization of Pentaho environment to handle big multidimensional data, which enables utilization of enhanced user interfaces.
- Linked Data. Benchmarking of more sophisticated Triple Stores than the one we use today at PingER LOD (Sesame). Preferably, we should analyze parallel and distributed solutions. CumulusRDF is an example.
- Renan is investigating an alternative to Hadoop, which utilizes a Scientific Workflow Management System and makes use of Map/Reduce paradigm to help both querying and provenance of the Linked Data (RDF) data.
- Ibrahim is investigating an approach that utilizes Hadoop Map/Reduce in a Key/Value store with PingER data in RDF.
- Utilization of Greenplum (http://en.wikipedia.org/wiki/Greenplum). this is an intensive high performance database from EMC with many features such as caching. It is partly from the EMC acquisition of Pivotal. There is also a DBMS called Grindplan that explores lots of features using Pivotal.
Les will make available via FTP examples of PingER data. There are two types:
- Raw data as gathered daily from all the monitoring hosts. This data is ie measured at 30 minute intervals and is quite dirty.
- Analyzed data by metric. This has been cleaned up. Les recomends UFRJ uses the cleaned up data.
The instructions for the data will also be sent to Luiza