Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

With the amount of data gathered worldwide, the constant flux (hosts move, get replaced, renamed, are removed etc., links change in performance and routing, pings get blocked or rate limited) in Internet hosts (in particular for hosts in developing regions), it is critical to validate the gathered and configuration data on a regular basis. The daily workflow is described elsewhere. This document is to describe describes the daily validity checking etc.

...

  • the latitudes and longitudes are not missing and are in a valid range;
  • the IP address for a node is in the valid range;
  • two or more hosts do not share the same IP address;
  • all nodes being pinged appear in the data basedatabase, this includes hosts chosen by individual monitoring sites.

...

  • if the host is not resolvable by the name servers;
  • if the IP address differs between NODEDETAILS and offsite.nodes. This is reported in https://www-iepm.slac.stanford.edu/pinger/sites-per-country.html towards the bottom.
  • if the IP address extracted from the ping result does not match that in the NODEDETAILS database then we track the changes and check:
    • if the IP address change is small (e.g. in the last field), and there is a small change in the min-RTT this usually indicates a host name being assigned to multiple hosts. Typical examples are clusters of hosts and we identify such hosts. Currently, about 4% of the remote hosts fall in this category;
    • If the change in min-RTT is large then we further investigate to see whether the host has moved. This is typically associated with a large address change. For example the address of the host with the name 24-7online.co.za (in South Africa) recently changed from 196.3.165.25 to 78.31.108.62. Further investigation of the PingER archived data showed that on November 23 2008 the Min-RTT from SLAC abruptly changed from 330ms to 144ms. Geo IP Tools now shows it is in Reading, England. On average we see about 1 remote host per week making such changes. Typically it is a web server seeking better response time by using a proxy with a good Ethernet connection.
    • This is reported in https://www-iepm.slac.stanford.edu/pinger/sites-per-country.html towards the bottom

Discovering and Replacing Faulty Hosts

Besides looking for and following up on address changes, we make daily measurements via the downsites.pl script to create a sortable, color-coded table of Monitoring/Remote host pairs for any pairs for which pings fail to respond. We categorize the reasons for no response as follows:

  • Remote hosts for which the DNS lookup fails;
  •  Remote hosts that do respond to queries to one of a set of well-known ports (80,7,53,23,25,21,37,79). Such hosts are probably blocking pings.

In addition, we note the following in the sortable table:

  • hosts that respond to an IP address but not to the name.
  • whether a host was down for at least a day in the last week:
    • if so the the number of days it was down is reported as well as the latest date it was found up, and whether it is responding now
    • if the host was down for the entire week then we search back to see when it was last up, and how to report how long it has been down.
    • how many consecutive days a host has not responded to a ping.

...

Remote hosts that are now deemed invalid (e.g. moved and no longer represent a region, do not respond, multiple hosts in different locations respond to the same name (e.g. route name servers, distributed servers such as email servers, etc.) are Disabled in NODEDETAILS. However, the data is not removed from the archive. To accomodate accommodate invalid data data from such hosts the analysis keeps a list of filters to remove invalid data between host pairs for selected periods. The newly developed PingER metrics motion chart tool also quickly enables us to spot hosts that have anomalous PingER metrics (e.g. min-RTT) and lie outside the chart area occupied by similar hosts for some period of time.

...

In cases where the removed remote host is important (e.g. is one of two hosts representing an entire country) and needs replacing with another host, we have developed a HostSearcher ' tool. It first interrogates Google for up to 1000 unique hosts in a selected country (using the Top Level Domain (TLD) feature). The tool then pings each host 10 times (by default) to ensure it responds and saves the min-RTT. Finally, it checks the filtered hosts with Geo IP Tool to obtain the TLD, Latitide/Longitude, country, and city. The final list of hosts obtained can be furher further filtered by whether they are really in the country, their loss rate, their min-RTT, diverse location within the country etc. Though not always accurate this has been a great aid on numerous occasions. Unfortunately for a few developing countries (e.g. Chad, Libya) even this method fails to produce suitable monitorable hosts.

...

After the script getdata.pl is run from a trscrontab on pinger@pinger.slac.stanford.edu to gather data from the monitoring hosts, the data is inspected by checkdata_gif.pl for non-responding monitoring hosts, unusual responses from monitoring hosts, invalid data such as missing tokens, inability to send 10 packets etc. In addition, a table is constructed showing the state (no response from the monitor, no data from monitor, partial data from the monitor, success) of gathering the data for each monitor node. Besides showing the gathering status going back many months, the table also provides easy links to dynamically test the monitoring host for its ping reachability and the response of its response to the web gather request. Emails are sent daily to the central administrators indicating which monitoring hosts were not successful. The typical follow up after a few days is to email the contact(s) at the monitoring node to request help in fixing the problem. At any given time we are uanble unable to gather data from about 10% of the monitoring nodes.

...

Rate limiting is much harder to discover than simple blocking. In some cases, it will result in  in a sudden increase in losses. For example, ICTP turned on rate-limiting at their bordeer border in May-Jun 2008. This affected  affected the losses measured from ICTP to almost all its remote hosts. On the left in the table below is an example plot of losses and RTT from ICTP to a host in Brazi. Itl It shows the increase in hourly ping losses starting early in June 2008. On the right is the median loss from ICTP to 93 sites worldwide in June 2008 showing a steep increase on June 6th and 7th as rate-limiting was applied.

Losses from ICTP to a Brazilian site

Median losses from ICTP to 93 sites worldwide

Another useful tool to spot anoamlous anomalous values is the motion bubble plot.  In particular, looking at min_RTT for countries seen from a monitoring site can quickly show impossibly small minimum RTTs and spot when they start and stop occuringoccurring.

Missing TLD for a country is reported in https://www-iepm.slac.stanford.edu/pinger/sites-per-country.html towards the bottom. To add a missing TLD edit /afs/slac/g/www/www-iepm/pinger/countries.csv.

Errors observed in getting data

Include Page
Errors Observed in Getting Data
Errors Observed in Getting Data