Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

1. Introduction

Wiki Markup
                                      PingER has a huge amount of data and, until the conclusion of this project, the easiest way to retrieve it is through Pingtable \[ref\]. Pingtable provides a friendly web interface to retrieve PingER has a huge amount of data and, until the conclusion of this project, the easiest way to retrieve it is through Pingtable \[ref\]. Pingtable provides a friendly web interface to retrieve PingER raw data (millions of text files) and load it into a human readable HTML page. However, this is not a web standard and crossing PingER data to generate very specific information may not be possible or extremely difficult using the existing way to retrieve PingER data. This project attempts to provide a standard semantic web format to data retrievable in Pingtable.

Wiki Markup
                                      Semantic web \[ref\] is a W3C recommendation \[ref\] and is very useful to publish data on the web. The way this project is doing it is storing PingER data in an Open Linked standard format for data interchange on the Web. More Semantic web \[ref\] is a W3C recommendation \[ref\] and is very useful to publish data on the web. The way this project is doing it is storing PingER data in an Open Linked standard format for data interchange on the Web. More specifically, the project is converting PingER data into graph database in Resource Description Framework (RDF) format \[ref\]. This graph database is very flexible; the schema can be efficiently evolved with time; and the whole data can be retrieved using a Sparql Query \[ref\].

                                      The advantages of putting data in this format are that since it is a W3C standard recommendation, there is a great community familiar with it and developing researches on it. Hence, it should be very comfortable to those people to retrieve PingER data. If someone interested in the data never heard about RDF, it is very simple to learn what is needed to use in the project. In addition, we are publishing data in a totally open format so anyone can use it. More importantly, since it is open, someone can come up with a never imagined interesting usage of the data.

Wiki Markup
                                      Moreover, our data is being linked to many other existing datasets, such as DBPedia (a graph database for Wikipedia) \[ref\], Geonames (geographic information), and Freebase \[ref\]. The data can also be easily linked to any other existing database published in this format. The number Moreover, our data is being linked to many other existing datasets, such as DBPedia (a graph database for Wikipedia) \[ref\], Geonames (geographic information), and Freebase \[ref\]. The data can also be easily linked to any other existing database published in this format. The number of initiatives publishing data in this format has been growing over time so we may also find an even better and totally different usage of this project in a near future \[ref\]. The graph database is obviously internally linked with its own PingER data so we can create very specific, rich, interesting queries to retrieve data that would never (or extremely difficult to) be retrieved without the project.

Wiki Markup
                                      Finally, the existing APIs to handle RDF provide well-known publishing formats such as JSON \[ref\], CSV \[ref\], and XML \[ref\]. We can conveniently get the results of a query and put it into visualization Finally, the existing APIs to handle RDF provide well-known publishing formats such as JSON \[ref\], CSV \[ref\], and XML \[ref\]. We can conveniently get the results of a query and put it into visualization libraries to come up with very interesting visualizations of the data.

2. Ontology

a. Goal

                                      To define the vocabulary used by PingER as well as its terms, concepts, taxonomy, and relations with each other.

b. Progress

Wiki Markup
                                      We are basing our ontology on the one proposed by the project MOMENT\[1\] described by the article _Measurement Ontology for IP traffic (MOI); IP traffic measurement ontologies architecture_, by the European Telecommunications Standards Institute (ETSI) \[2\].

Wiki Markup
                                      More terms and relations were added and edited on the based ontology. More specifically, metrics were properly added (Directivity, Conditional Loss Probability, Duplicate Packets, Inter-Packet Delay Variation, Out of Order Packets, TCP Throughput, Ping Unpredictability, Ping Unreachability, and Zero Packet Loss Frequency). Statistical measurements were also added (Inter-quartile Range, Median, 90%, 95%, and 75%). Information about the nodes were also added based on the Confluence specifications \[3\] (Site Name, Nick Name, Full Name, Contact for the host, Comments, Trace Server, App User, Ping Size, Ping Server, URL, Project Type, and GMT).

                                      Other existing ontologies are also being linked to our ontology:

  • Wiki Markup
    Freebase \[ref\], an open large graph database.
  • Wiki Markup
    DBPedia \[6\], the ontology and the resources, will be used to provide more information about any geographic location or any other thing Wikipedia has that can be connected to the PingER. So it should be possible to make very specific queries.

c. To do

Wiki Markup
                                      After doing very complex _mashups_ \[ref\] using the ontology, we need to verify if the ontology needs any adjustments.

3. RDF Repository

a. Goal

                                      To establish a good environment for the RDF repository.

b. Progress

Wiki Markup
                                      We analyzed the existing technologies to make it possible. There are well-known Triple Storages such as Jena, Sesame, and Virtuoso \[7\].  According to \[7\], Virtuoso struggles to load large datasets (>1M triples). Hence, we first decided to try Jena SDB (with MySQL) and then Sesame Native.

  1.                                                   i.            Jena SDB results

Wiki Markup
                                                         After establishing Jena SDB 1.3.6 with a native Java API for storing and querying a relational database MySQL \[ref\], \[ref\], we were able to load a reasonable large amount of data to test.

                                                         More specifically, data of all nodes and all yearly measurement data were loaded. However, querying the dataset was very disappointing. Some very simple queries, such as listing the value of a given metric in a given year for a given pair of sites took 7 minutes to run. We only had less than 100K triples.

Wiki Markup
                                                         Since we would have much more than 100K triples and 7 minutes is way too long for such a simple query, we decided to change the RDF repository. In addition, the results of the articles \[7\] and \[8\] show that there are better alternatives. Therefore, Jena SDB may work very well for smaller datasets, which is not PingER’s case.

  1.                                                 ii.            Sesame Native results

Wiki Markup
                                                         Using the results of \[7\] and \[8\], we decided to try Open RDF Sesame Native 2.7.2. This solution requires a Java Webserver as an interface with the loading program and the RDF repository itself. We are using the latest Tomcat version, Tomcat 7 \[ref\]. The physical layer implementation of the repository provides customization of indexes, which it is said to significantly enhance the performance of the queries execution \[ref\].

Wiki Markup
                                                         We are using the indexes _spoc,sopc,psoc,posc,opsc,ospc._ The order of the indexes for \[s\]ubject, \[o\]bject, and \[p\]redicate defines the pattern of a search. For example, the index _spoc_ optimizes a search in which the subject is the first field of the statement (a statement is composed by subject, predicate, and object). More specifically, a query that would take advantage of this index would be: list all instances of metrics (the subject to be searched for) that are of the “type” “TCP Throughput”. “Type” (the predicate) and “TCP Throughput” (the object) are given and we are looking for the metric (the subject). The \[c\]ontext is a fourth field in “triple” statements and is used as named graphs in Sesame. Our solution does not consider contexts, though.

                                                         However, multiple indexes increases the used disk space (to store the indexes) and it also takes longer to load the data to maintain the indexes.

Wiki Markup
                                                         Note: \[7\] points out that the largest dataset known to be loaded into Sesame Native has only 50M triples. If we extrapolate this number, there is no guarantee that the performance will remain satisfactory.

                                      Finally, the exactly same query that took 7 minutes to run, in solution (i) – Jena SDB, took less than 5 seconds using solution (ii) – Open RDF Sesame Native. The circumstances tested were the same in both environments, i.e., same computer, same dataset, and same query.

                                      Therefore, we decided to migrate the project to use Open RDF Sesame Native as RDF Repository.

c. To do

Run very complex queries to test the performance.

4. Accessing the RDF Repository

a. Goal

                                      Establish an easy way to access the RDF data.

b. Progress

Wiki Markup
                                      We are using Java Tomcat Webserver to host a Sparql Endpoint \[ref\]. The HTML page has a text area in which the user will be able to write Sparql 1.1 queries \[ref\] to access the RDF repository. By default, the results are showed in HTML tables, in a JSP page. However, there will be a combo box to choose in which format the results will be shown. Available formats will be triples in CSV, JSON, and XML-RDF.

c. To do

                                      Use CSS and JavaScript to make it prettier.

5. Loading the RDF repository

a. Goal

                                      Generate RDF data using external datasets and PingER data.

b. Progress

                                      The process of generating RDF data and populating the RDF repository is divided into subsections:

  1.                                                   i.            Set up the prefixes

Wiki Markup
                                                         Following the RDF standard, all resources are uniquely identified by an URI \[ref\]. In order to write less and to provide a better organization of the statements, it is common to use namespaces (prefixes) instead of writing absolute URI. For example, it is common to use the namespace *rdfs* for the W3C rdf-schema ([http://www.w3.org/TR/rdf-schema/|http://www.w3.org/TR/rdf-schema/]).

  • Frequency: If the repository is totally empty, i.e., it has just been created, the program should first insert the prefixes into the database. In other words, this should be done only once.
  • Time to load: Less than 1 second.
  1.                                                 ii.            Instantiate continents

We are basing our ontology on the one proposed by the project MOMENT\[1\] described by the article _Measurement Ontology for IP traffic (MOI); IP traffic measurement ontologies architecture_, by the European Telecommunications Standards Institute (ETSI) \[2\].

Wiki Markup
More terms and relations were added and edited on the based ontology. More specifically, metrics were properly added (Directivity, Conditional Loss Probability, Duplicate Packets, Inter-Packet Delay Variation, Out of Order Packets, TCP Throughput, Ping Unpredictability, Ping Unreachability, and Zero Packet Loss Frequency). Statistical measurements were also added (Inter-quartile Range, Median, 90%, 95%, and 75%). Information about the nodes were also added based on the Confluence specifications \[3\] (Site Name, Nick Name, Full Name, Contact for the host, Comments, Trace Server, App User, Ping Size, Ping Server, URL, Project Type, and GMT).

Other existing ontologies are also being linked to our ontology:

  • Wiki Markup
    Freebase \[ref\], an open large graph database.
  • Wiki Markup
    DBPedia \[6\], the ontology and the resources, will be used to provide more information about any geographic location or any other thing Wikipedia has that can be connected to the PingER. So it should be possible to make very specific queries.

c. To do

Wiki Markup
After doing very complex _mashups_ \[ref\] using the ontology, we need to verify if the ontology needs any adjustments.

3. RDF Repository

a. Goal

To establish a good environment for the RDF repository.

b. Progress

Wiki Markup
We analyzed the existing technologies to make it possible. There are well-known Triple Storages such as Jena, Sesame, and Virtuoso \[7\].  According to \[7\], Virtuoso struggles to load large datasets (>1M triples). Hence, we first decided to try Jena SDB (with MySQL) and then Sesame Native.

i. Jena SDB results

Wiki Markup
After establishing Jena SDB 1.3.6 with a native Java API for storing and querying a relational database MySQL \[ref\], \[ref\], we were able to load a reasonable large amount of data to test.

More specifically, data of all nodes and all yearly measurement data were loaded. However, querying the dataset was very disappointing. Some very simple queries, such as listing the value of a given metric in a given year for a given pair of sites took 7 minutes to run. We only had less than 100K triples.

Wiki Markup
Since we would have much more than 100K triples and 7 minutes is way too long for such a simple query, we decided to change the RDF repository. In addition, the results of the articles \[7\] and \[8\] show that there are better alternatives. Therefore, Jena SDB may work very well for smaller datasets, which is not PingER’s case.

ii. Sesame Native results

Wiki Markup
Using the results of \[7\] and \[8\], we decided to try Open RDF Sesame Native 2.7.2. This solution requires a Java Webserver as an interface with the loading program and the RDF repository itself. We are using the latest Tomcat version, Tomcat 7 \[ref\]. The physical layer implementation of the repository provides customization of indexes, which it is said to significantly enhance the performance of the queries execution \[ref\].

Wiki Markup
We are using the indexes _spoc,sopc,psoc,posc,opsc,ospc._ The order of the indexes for \[s\]ubject, \[o\]bject, and \[p\]redicate defines the pattern of a search. For example, the index _spoc_ optimizes a search in which the subject is the first field of the statement (a statement is composed by subject, predicate, and object). More specifically, a query that would take advantage of this index would be: list all instances of metrics (the subject to be searched for) that are of the “type” “TCP Throughput”. “Type” (the predicate) and “TCP Throughput” (the object) are given and we are looking for the metric (the subject). The \[c\]ontext is a fourth field in “triple” statements and is used as named graphs in Sesame. Our solution does not consider contexts, though.

However, multiple indexes increases the used disk space (to store the indexes) and it also takes longer to load the data to maintain the indexes.

Wiki Markup
Note: \[7\] points out that the largest dataset known to be loaded into Sesame Native has only 50M triples. If we extrapolate this number, there is no guarantee that the performance will remain satisfactory.

Finally, the exactly same query that took 7 minutes to run, in solution (i) – Jena SDB, took less than 5 seconds using solution (ii) – Open RDF Sesame Native. The circumstances tested were the same in both environments, i.e., same computer, same dataset, and same query.

Therefore, we decided to migrate the project to use Open RDF Sesame Native as RDF Repository.

c. To do

Run very complex queries to test the performance.

4. Accessing the RDF Repository

a. Goal

Establish an easy way to access the RDF data.

b. Progress

Wiki Markup
We are using Java Tomcat Webserver to host a Sparql Endpoint \[ref\]. The HTML page has a text area in which the user will be able to write Sparql 1.1 queries \[ref\] to access the RDF repository. By default, the results are showed in HTML tables, in a JSP page. However, there will be a combo box to choose in which format the results will be shown. Available formats will be triples in CSV, JSON, and XML-RDF.

c. To do

Use CSS and JavaScript to make it prettier.

5. Loading the RDF repository

a. Goal

Generate RDF data using external datasets and PingER data.

b. Progress

The process of generating RDF data and populating the RDF repository is divided into subsections:

i. Set up the prefixes

Wiki Markup
Following the RDF standard, all resources are uniquely identified by an URI \[ref\]. In order to write less and to provide a better organization of the statements, it is common to use namespaces (prefixes) instead of writing absolute URI. For example, it is common to use the namespace *rdfs* for the W3C rdf-schema ([http://www.w3.org/TR/rdf-schema/|http://www.w3.org/TR/rdf-schema/]).

  • Frequency: If the repository is totally empty, i.e., it has just been created, the program should first insert the prefixes into the database. In other words, this should be done only once.
  • Time to load: Less than 1 second.

ii. Instantiate continents

Wiki Markup
Wiki Markup
                                                         The continents are internally instantiated in the memory of the program so there is no HTTP GET \[ref\].

  • Frequency: Only once.
  • Time to load: Less than 1 second.

...

iii.

...

Instantiate countries

Wiki Markup
                                                         The program uses a HTTP GET to access Geonames API \[ref\] to retrieve a JSON with data of all countries. Then, for each entry of the JSON, the program instantiates a country in RDF and loads into the repository.

...

Wiki Markup
\[3\] PingER Node Details. Retrieved from [https://confluence.slac.stanford.edu/display/IEPM/PingER+NODEDETAILS\|../../../../../../../../../display/IEPM/PingER+NODEDETAILS|] on June 5, 2013.

Wiki Markup
\[4\] Food And Agriculture Organization Of The United Nations Ontology. Retrieved from [http://www.fao.org/countryprofiles/geoinfo/geopolitical/|http://www.fao.org/countryprofiles/geoinfo/geopolitical/] on June 5, 2013.

...