Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

1. Introduction

Wiki MarkupPingER has a huge amount of data and, until the conclusion of this project, the easiest way to retrieve the data is through Pingtable \ [ref\]. Pingtable provides a friendly web interface to retrieve PingER raw data (millions of text files) and load it into a human readable HTML page. However, this is not a web standard and crossing PingER data to generate very specific information may not be possible or extremely difficult using the existing way to retrieve PingER data. This project attempts to provide a standard semantic web format to data retrievable in Pingtable.

Wiki MarkupSemantic web \ [ref\] is a W3C recommendation \ [ref\] and is very useful to publish data on the web. The way this project is doing it is storing PingER data in a Linked Open  standard format for data interchange on the Web. More specifically, the project is converting PingER data into a graph database in Resource Description Framework (RDF) format \[ref\]. This graph database is very flexible; the schema can be efficiently evolved with time; and the whole data can be retrieved using a Sparql Query \[ref\Open  standard format for data interchange on the Web. More specifically, the project is converting PingER data into a graph database in Resource Description Framework (RDF) format [ref]. This graph database is very flexible; the schema can be efficiently evolved with time; and the whole data can be retrieved using a Sparql Query [ref].

The advantages of putting data in this format are that since it is a W3C standard recommendation, there is a large community familiar with it and developing research on it. Hence, it should be very comfortable to those people to retrieve PingER data using this method. If someone interested in the data has never heard about RDF, it is very simple to learn what is needed to use in the project. In addition, we are publishing data in a totally open format so anyone can use it. More importantly, since it is open, someone can come up with a never imagined interesting usage of the data.unmigrated-wiki-markup

Moreover, our data is being linked to many other existing datasets, such as DBPedia (a graph database for Wikipedia) \ [ref\], Geonames (geographic information) [ref], and Freebase \ [ref\]. The data can also be easily linked to any other existing database published in this format. The number of initiatives publishing data in this format has been growing over time so we may also find an even better and totally different usage of this project in a near future \ [ref\]. The graph database is obviously internally linked with its own PingER data so we can create very specific, rich, smart, and interesting queries to retrieve data that would never (or are extremely difficult to) be retrieved without the project.

Wiki MarkupFinally, the existing APIs to handle RDF provide well-known publishing formats such as JSON \[[ref\], CSV \ [ref\], and XML \ [ref\]. We can conveniently get the results of a query and put it into visualization libraries to come up with very interesting visualizations of the data.

2. Ontology

a. Goal

To define the vocabulary used by PingER as well as its terms, concepts, taxonomy, and relations with each other.

b. Progress

Wiki MarkupWe are basing our ontology on the one proposed by the project MOMENT\[1\] described by the article _Measurement Ontology for IP traffic (MOI); IP traffic measurement ontologies architecture_, by the European Telecommunications Standards Institute (ETSI) \ [2\].unmigrated-wiki-markup

More terms and relations were added and edited on the based ontology. More specifically, metrics were properly added (Directivity, Conditional Loss Probability, Duplicate Packets, Inter-Packet Delay Variation, Out of Order Packets, TCP Throughput, Ping Unpredictability, Ping Unreachability, and Zero Packet Loss Frequency). Statistical measurements were also added (Inter-quartile Range, Median, 90%, 95%, and 75%). Information about the nodes were also added based on the Confluence specifications \ [3\] (Site Name, Nick Name, Full Name, Contact for the host, Comments, Trace Server, App User, Ping Size, Ping Server, URL, Project Type, and GMT).

Other existing ontologies are also being linked to our ontology:

...

  • Freebase \ [ref\], an open large graph database.
  • Wiki MarkupDBPedia \ [6\], the ontology and the resources, will be used to provide more information about any geographic location or any other thing Wikipedia has that can be connected to the PingER. So it should be possible to make very specific queries.
c. To do

Wiki MarkupAfter doing very complex _mashups_ \ [ref\] using the ontology, we need to verify if the ontology needs any adjustments.

3. RDF Repository

a. Goal

To establish a good environment for the RDF repository.

b. Progress

Wiki MarkupWe analyzed the existing technologies to make it possible. There are well-known Triple Storages such as Jena, Sesame, and Virtuoso \ [7\].  According to \ According to [7\], Virtuoso struggles to load large datasets (>1M triples). Hence, we first decided to try Jena SDB (with MySQL) and then Sesame Native.

i. Jena SDB results

Wiki MarkupAfter establishing Jena SDB 1.3.6 with a native Java API for storing and querying a relational database MySQL \ [ref\], \ [ref\], we were able to load a reasonable large amount of data to test.

More specifically, data of all nodes and all yearly measurement data were loaded. However, querying the dataset was very disappointing. Some very simple queries, such as listing the value of a given metric in a given year for a given pair of sites took 7 minutes to run. We only had less than 100K triples.

Wiki MarkupSince we would have much more than 100K triples and 7 minutes is way too long for such a simple query, we decided to change the RDF repository. In addition, the results of the articles \ [7\] and \ [8\] show that there are better alternatives. Therefore, Jena SDB may work very well for smaller datasets, which is not PingER’s case.

ii. Sesame Native resultsunmigrated-wiki-markup

Using the results of \ [7\] and \ [8\], we decided to try Open RDF Sesame Native 2.7.2. This solution requires a Java Webserver as an interface with the loading program and the RDF repository itself. We are using the latest Tomcat version, Tomcat 7 \ [ref\]. The physical layer implementation of the repository provides customization of indexes, which it is said to significantly enhance the performance of the queries execution \ [ref\].

Wiki MarkupWe are using the indexes _spoc,sopc,psoc,posc,opsc,ospc._ The order of the indexes for \ [s\]ubject, \ [o\]bject, and \ [p\]redicate defines the pattern of a search. For example, the index _spoc_ optimizes a search in which the subject is the first field of the statement (a statement is composed by subject, predicate, and object). More specifically, a query that would take advantage of this index would be: list all instances of metrics (the subject to be searched for) that are of the “type” “TCP Throughput”. “Type” (the predicate) and “TCP Throughput” (the object) are given and we are looking for the metric (the subject). The \ [c\]ontext is a fourth field in “triple” statements and is used as named graphs in Sesame. Our solution does not consider contexts, though.

However, multiple indexes increases the used disk space (to store the indexes) and it also takes longer to load the data to maintain the indexes.

Wiki MarkupNote: \ [7\] points out that the largest dataset known to be loaded into Sesame Native has only 50M triples. If we extrapolate this number, there is no guarantee that the performance will remain satisfactory. Actually, there is already a warning in this scenario. We already have 12,5M triples.

Finally, the exactly same query that took 7 minutes to run, in solution (i) – Jena SDB --, took less than 5 seconds using solution (ii) – Open RDF Sesame Native. The circumstances tested were the same in both environments, i.e., same computer, same dataset, and same query.

...

Establish an easy way to access the RDF data.

b. Progress

Wiki MarkupWe are using Java Tomcat Webserver to host a Sparql Endpoint \ [ref\]. The HTML page has a text area in which the user will be able to write Sparql 1.1 queries \ [ref\] to access the RDF repository. By default, the results are showed in HTML tables, in a JSP page. However, there will be a combo box to choose in which format the results will be shown. Available formats will be triples in CSV, JSON, and XML-RDF.

c. To do

Use CSS and JavaScript to make it prettier.

...

The process of generating RDF data and populating the RDF repository is divided into subsections:

i. Set up the prefixesunmigrated-wiki-markup

Following the RDF standard, all resources are uniquely identified by an URI \ [ref\]. In order to write less and to provide a better organization of the statements, it is common to use namespaces (prefixes) instead of writing absolute URI. For example, it is common to use the namespace *rdfs* for the W3C rdf-schema ([http://www.w3.org/TR/rdf-schema/|http://www.w3.org/TR/rdf-schema/]).

  • Frequency: If the repository is totally empty, i.e., it has just been created, the program should first insert the prefixes into the database. In other words, this should be done only once.
  • Time to load: Less than 1 second.

ii. Instantiate Continents

Wiki MarkupThe continents are internally instantiated in the memory of the program so there is no HTTP GET \ [ref\].

  • Frequency: Only once.
  • Time to load: Less than 1 second.

iii. Instantiate Countries

Wiki MarkupThe program uses a HTTP GET to access Geonames API \ [ref\] to retrieve a JSON with data of all countries. Then, for each entry of the JSON, the program instantiates a country in RDF and loads into the repository.

  • Frequency: Only once.
  • Time to load: Less than 3 minutes.

...

For each entry in the Node Details JSON, the program runs HTTP GETs on the Geonames API to try to find the nearest town (with at least 1000 habitants) and city (with at least 15000 habitants) based on the latitude and longitude of the site. The state where the town is located (if applicable) is also instantiated and linked to the town. The program finally inserts the instantiated data into the repository.

Wiki MarkupNote: The program also tries to link the existing found town with other known RDF datasets (DBPedia \ [ref\] and Freebase \ [ref\]).

  • Frequency: Probably the same as (iv).
  • Time to load: ~30 minutes.

...

After loading the repository with the data specified in the previous step (5), the program needs to load PingER measurement data.

Wiki MarkupThe first step in this process is to generate the The first step in this process is to generate the Monitoring-Monitored \ [ref\] JSON. The program executes a HTTP GET in [http://www-wanmon.slac.stanford.edu/cgi-wrap/dbprac.pl?monalias=all| to retrieve all monitoring nodes. Then, for each monitoring node, another HTTP GET is executed in http://www-wanmon.slac.stanford.edu/cgi-wrap/dbprac.pl?monalias= all] to retrieve all monitoring nodes. Then, for each monitoring node, another HTTP GET is executed in [http://www-wanmon.slac.stanford.edu/cgi-wrap/dbprac.pl?monalias= EDU.SLAC.STANFORD.N3&find=1|http://www-wanmon.slac.stanford.edu/cgi-wrap/dbprac.pl?monalias=MONITORING_NODE&find=1], where the value of _monalias_ is a given monitoring node, to retrieve the monitored nodes by that monitoring node. The Monitoring-Monitored JSON is then generated and written into a file.

  • Frequency of generating this JSON: Probably the same as (5.iv)
  • Time to generate: Less than 2 minutes.

Wiki Markup
Having the JSON, the instantiating process happens according to this approach: For each monitoring node (entry of the JSON), for each metric, for each packet size, for each time parameter, the program executes a HTTP GET in the Pingtable \[ref\] Tab Separated Values (TSV) file specified by crossing all these parameters. A TSV URL example is of the form

http://www-wanmon.slac.stanford.edu/cgi-wrap/pingtable.pl?format=tsv&file=average_rtt&by=by-node&size=100&tick=allyearly&from=EDU.SLAC.STANFORD.N3&to=WORLD&ex=none&only=all&dataset=hep&percentage=any

Where the parameters:

  • from – is the monitoring node that pings other monitored nodes.
  • tick – represents the time aggregation. PingER has data from 1998 to 2013. At this moment, the project is considering only the following tick parameters:
    • allyearly
    • allmonthly
    • last365days
  • size – is the packet size. At this moment, the project is considering only packets sizes of 100 bytes.
  • file – is the network measurement metric. At this moment, the project is considering only the following metrics:
    • Mean Opinion Scores
    • Directivity
    • Average Round Trip Time
    • Conditional Loss Probability
    • Duplicate Packets
    • Inter Packet Delay Variation
    • Minimum Round Trip Delay
    • Packet Loss
    • TCP Throughput
    • Unreachability
    • Zero Packet Loss Frequency 

Note: This process is totally independent of the previous step (5). Hence it can be independently parallelized. However, if this step is executed before the previous, the measurement information regarding the nodes will point to broken links, which is not a big problem and does not prevent loading measurement data. The broken links will be automatically repaired when the nodes are successfully instantiated (section 5.vii).

...

c. Performance Evaluation

...

Last 365 days: For each monitoring node, for each metric, it is taking around 1 hour to load the data into the repository. Hence, for 80 monitoring nodes, for the 11 metrics, it is going to take approximately 880 hours (36 days). Thus, impracticable amount of time.

...

d. To do

...

It is taking a huge amount of time to load the entire data. We must optimize this process. The project is being adjusted to be easier parallelizable in independent processes and each process is being investigated to try to gain some minutes.

Measure time and space taken to load the repository.

Complex tests.

7. Rich Visualization of the Data

...

a. Goal

...

Provide smart and useful visualization of PingER data in RDF format.

...

b. Progress

...

We studied to possibility of using 3 APIs:

Wiki Markup
i. Google Maps JavaScript API v3 \[ref\]
ii. Google Geo Charts \[ref\]
iii. Google Public Data Explorer \[ref\]

All of them seem to be very useful and can provide rich visualizations. (i) and (iii) seem to be the most powerful of them.

...

c. To do

...

We need to think about good and useful mashups to use within this entire project and showing them on these visualization APIs. One type of mashup that is being investigated is retrieving data from DBPedia to cross PingER data with information related to universities (such as endowment, number of students, if the university is public or private, etc).

8. Documentation

...

a. Goal

...

Document the entire project.

...

b. Progress

...

This Project Progress is being built. It is being kept in both MS WORD format and HTML (to be used in Confluence).

The Confluence page To do-Doing-Done is kept updated more frequently.

...

c. To do

...

The Java Project should be documented. Javadocs are supposed to be generated for each class and method.

An interactive JavaScript document is to be generated to graphically represent the ontology in order to support users to use the RDF data.

Installation guide should be written. This should include how to configure the environment and everything needed to compile and run the project. Both the RDF Repository (with Tomcat settings) and the Sparql Endpoint projects should have an installation guide.

9. References

Wiki Markup
The reference section needs to be fully reviewed. The proper pointers need to be referred in the text above, especially when the \[ref\] word is cited in the text. 

Wiki Markup
\[1\] Project MOMENT Ontologies. Retrieved from [https://svn.fp7-moment.eu/svn/moment/public/Ontology/|https://svn.fp7-moment.eu/svn/moment/public/Ontology/] on June 5, 2013.

EDU.SLAC.STANFORD.N3&find=1, where the value of monalias is a given monitoring node, to retrieve the monitored nodes by that monitoring node. The Monitoring-Monitored JSON is then generated and written into a file.

  • Frequency of generating this JSON: Probably the same as (5.iv)
  • Time to generate: Less than 2 minutes.

Having the JSON, the instantiating process happens according to this approach: For each monitoring node (entry of the JSON), for each metric, for each packet size, for each time parameter, the program executes a HTTP GET in the Pingtable [ref] Tab Separated Values (TSV) file specified by crossing all these parameters. A TSV URL example is of the form

http://www-wanmon.slac.stanford.edu/cgi-wrap/pingtable.pl?format=tsv&file=average_rtt&by=by-node&size=100&tick=allyearly&from=EDU.SLAC.STANFORD.N3&to=WORLD&ex=none&only=all&dataset=hep&percentage=any

Where the parameters:

  • from – is the monitoring node that pings other monitored nodes.
  • tick – represents the time aggregation. PingER has data from 1998 to 2013. At this moment, the project is considering only the following tick parameters:
    • allyearly
    • allmonthly
    • last365days
  • size – is the packet size. At this moment, the project is considering only packets sizes of 100 bytes.
  • file – is the network measurement metric. At this moment, the project is considering only the following metrics:
    • Mean Opinion Scores
    • Directivity
    • Average Round Trip Time
    • Conditional Loss Probability
    • Duplicate Packets
    • Inter Packet Delay Variation
    • Minimum Round Trip Delay
    • Packet Loss
    • TCP Throughput
    • Unreachability
    • Zero Packet Loss Frequency 

Note: This process is totally independent of the previous step (5). Hence it can be independently parallelized. However, if this step is executed before the previous, the measurement information regarding the nodes will point to broken links, which is not a big problem and does not prevent loading measurement data. The broken links will be automatically repaired when the nodes are successfully instantiated (section 5.vii).

...

c. Performance Evaluation

...

Last 365 days: For each monitoring node, for each metric, it is taking around 1 hour to load the data into the repository. Hence, for 80 monitoring nodes, for the 11 metrics, it is going to take approximately 880 hours (36 days). Thus, impracticable amount of time.

...

d. To do

...

It is taking a huge amount of time to load the entire data. We must optimize this process. The project is being adjusted to be easier parallelizable in independent processes and each process is being investigated to try to gain some minutes.

Measure time and space taken to load the repository.

Complex tests.

7. Rich Visualization of the Data

...

a. Goal

...

Provide smart and useful visualization of PingER data in RDF format.

...

b. Progress

...

We studied to possibility of using 3 APIs:

i. Google Maps JavaScript API v3 [ref]
ii. Google Geo Charts [ref]
iii. Google Public Data Explorer [ref]

All of them seem to be very useful and can provide rich visualizations. (i) and (iii) seem to be the most powerful of them.

...

c. To do

...

We need to think about good and useful mashups to use within this entire project and showing them on these visualization APIs. One type of mashup that is being investigated is retrieving data from DBPedia to cross PingER data with information related to universities (such as endowment, number of students, if the university is public or private, etc).

8. Documentation

...

a. Goal

...

Document the entire project.

...

b. Progress

...

This Project Progress is being built. It is being kept in both MS WORD format and HTML (to be used in Confluence).

The Confluence page To do-Doing-Done is kept updated more frequently.

...

c. To do

...

The Java Project should be documented. Javadocs are supposed to be generated for each class and method.

An interactive JavaScript document is to be generated to graphically represent the ontology in order to support users to use the RDF data.

Installation guide should be written. This should include how to configure the environment and everything needed to compile and run the project. Both the RDF Repository (with Tomcat settings) and the Sparql Endpoint projects should have an installation guide.

9. References

The reference section needs to be fully reviewed. The proper pointers need to be referred in the text above, especially when the [ref] word is cited in the text. 

[1] Project MOMENT Ontologies. Retrieved from https://svn.fp7-moment.eu/svn/moment/public/Ontology/ on June 5, 2013.

[2] ETSI. Measurement Ontology for IP traffic (MOI); IP traffic measurement ontologies architecture. Retrieved from Wiki Markup\[2\] ETSI. Measurement Ontology for IP traffic (MOI); IP traffic measurement ontologies architecture. Retrieved from [http://www.etsi.org/deliver/etsi_gs/MOI/001_099/003/01.01.01_60/gs_moi003v010101p.pdf|http://www.etsi.org/deliver/etsi_gs/MOI/001_099/003/01.01.01_60/gs_moi003v010101p.pdf] on June 5, 2013.

Wiki Markup\[3\] PingER Node Details. Retrieved from [https://confluence.slac.stanford.edu/display/IEPM/PingER+NODEDETAILS] on June 5, 2013.

Wiki Markup\[4\] Food And Agriculture Organization Of The United Nations Ontology. Retrieved from [http://www.fao.org/countryprofiles/geoinfo/geopolitical/|http://www.fao.org/countryprofiles/geoinfo/geopolitical/] on June 5, 2013. Wiki Markup

\[5\] Friend of a Friend Ontology. Retrieved from [http://www.foaf-project.org/|http://www.foaf-project.org/] on June 5, 2013.2013.

Wiki Markup\[6\] DBPedia. Retrieved from [http://dbpedia.org/About|http://dbpedia.org/About] on June 5, 2013.unmigrated-wiki-markup

\[7\] Comparison of Triple Stores. Retrieved from [http://www.bioontology.org/wiki/images/6/6a/Triple_Stores.pdf|http://www.bioontology.org/wiki/images/6/6a/Triple_Stores.pdf] on June 5, 2013.

http://en.wikipedia.org/wiki/Mashup_(web_application_hybrid)

...