PingER Daily Management

With the amount of data gathered worldwide, the constant flux (hosts move, get replaced, renamed, are removed etc., links change in performance and routing, pings get blocked or rate limited) in Internet hosts (in particular for hosts in developing regions), it is critical to validate the gathered and configuration data on a regular basis. The daily workflow is described elsewhere. This document is to describes the daily validity checking etc.

Validating Meta Database

Many automatic validity checks and reports are made on PingER's meta database NODEDETAILS that describes the various hosts.
These checks are made on a daily basis and include validating that:

  • the latitudes and longitudes are not missing and are in a valid range;
  • the IP address for a node is in the valid range;
  • two or more hosts do not share the same IP address;
  • all nodes being pinged appear in the database, this includes hosts chosen by individual monitoring sites.

Validating Hosts

Daily checks are made on the validity of the remote hosts by pinging each in turn. This is performed by the ping-beacons.pl script that is run from a trscronjob on pinger@pinger.slac.stanford.edu. We check and report:

  • if the host is not resolvable by the name servers;
  • if the IP address differs between NODEDETAILS and offsite.nodes. This is reported in https://www-iepm.slac.stanford.edu/pinger/sites-per-country.html towards the bottom.
  • if the IP address extracted from the ping result does not match that in the NODEDETAILS database then we track the changes and check:
    • if the IP address change is small (e.g. in the last field), and there is a small change in the min-RTT this usually indicates a host name being assigned to multiple hosts. Typical examples are clusters of hosts and we identify such hosts. Currently, about 4% of the remote hosts fall in this category;
    • If the change in min-RTT is large then we further investigate to see whether the host has moved. This is typically associated with a large address change. For example the address of the host with the name 24-7online.co.za (in South Africa) recently changed from 196.3.165.25 to 78.31.108.62. Further investigation of the PingER archived data showed that on November 23 2008 the Min-RTT from SLAC abruptly changed from 330ms to 144ms. Geo IP Tools now shows it is in Reading, England. On average we see about 1 remote host per week making such changes. Typically it is a web server seeking better response time by using a proxy with a good Ethernet connection.
    • This is reported in https://www-iepm.slac.stanford.edu/pinger/sites-per-country.html towards the bottom

Discovering and Replacing Faulty Hosts

Besides looking for and following up on address changes, we make daily measurements via the downsites.pl script to create a sortable, color-coded table of Monitoring/Remote host pairs for any pairs for which pings fail to respond. We categorize the reasons for no response as follows:

  • Remote hosts for which the DNS lookup fails;
  •  Remote hosts that do respond to queries to one of a set of well-known ports (80,7,53,23,25,21,37,79). Such hosts are probably blocking pings.

In addition, we note the following in the sortable table:

  • hosts that respond to an IP address but not to the name.
  • whether a host was down for at least a day in the last week:
    • if so the number of days it was down is reported as well as the latest date it was found up, and whether it is responding now
    • if the host was down for the entire week then we search back to see when it was last up, and how to report how long it has been down.
    • how many consecutive days a host has not responded to a ping.

We also create tables of min-RTT from monitoring hosts to remote hosts sorted by monitor host followed by remote host's region. This enables us to quickly discover hosts in a region with anomalous min-RTTs. Typically these are hosts with a TLD of a developing country in the region but where the host is actually in a developed country.

Remote hosts that are now deemed invalid (e.g. moved and no longer represent a region, do not respond, multiple hosts in different locations respond to the same name (e.g. route name servers, distributed servers such as email servers, etc.) are Disabled in NODEDETAILS. However, the data is not removed from the archive. To accommodate invalid data from such hosts the analysis keeps a list of filters to remove invalid data between host pairs for selected periods. The newly developed PingER metrics motion chart tool also quickly enables us to spot hosts that have anomalous PingER metrics (e.g. min-RTT) and lie outside the chart area occupied by similar hosts for some period of time.

Finding New Hosts to Monitor

In cases where the removed remote host is important (e.g. is one of two hosts or is the only host representing an entire country) and needs replacing with another host, we have developed a HostSearcher tool.

To find new targets in a country (TLD) we created the HostSearcher.pl script. It uses the Google  search parameter q=site:<TLD> feature, e.g. for Malaysia (<TLD>=my) 

It first interrogates Google for up to 1000 unique hosts in a selected country (using the Top Level Domain (TLD) feature). The tool then pings each host 10 times (by default) to ensure it responds and saves the min-RTT. We manually filter out hosts that do not have reasonable ping RTTs for the specified country, and also those identified as not being in the right location (country) according to https://api.ipdata.co. We then manually browse the host web site, looking for possible location information (e.g. street address, map etc.). We use this information together with Google maps (see https://support.google.com/maps/answer/18539?co=GENIE.Platform%3DDesktop&hl=en) to get the latitude and longitude. The final list of hosts obtained can be further filtered by whether they are really in the country, their loss rate, their min-RTT, diverse location within the country etc. Though not always accurate this has been a great aid on numerous occasions. Unfortunately for a few developing countries (e.g. Chad, Libya) even this method fails to produce suitable monitorable hosts. 

For more information on using HostSearcher.pl see Finding PingER Hosts.

 

Validating Gathering of Data

After the script getdata.pl is run from a trscrontab on pinger@pinger.slac.stanford.edu to gather data from the monitoring hosts, the data is inspected by checkdata_gif.pl for non-responding monitoring hosts, unusual responses from monitoring hosts, invalid data such as missing tokens, inability to send 10 packets etc. In addition, a table is constructed showing the state (no response from the monitor, no data from monitor, partial data from the monitor, success) of gathering the data for each monitor node. Besides showing the gathering status going back many months, the table also provides easy links to dynamically test the monitoring host for its ping reachability and the response of its response to the web gather request. Emails are sent daily to the central administrators indicating which monitoring hosts were not successful. The typical follow up after a few days is to email the contact(s) at the monitoring node to request help in fixing the problem. At any given time we are unable to gather data from about 10% of the monitoring nodes.

Other Validations

Rate limiting is much harder to discover than simple blocking. In some cases, it will result in a sudden increase in losses. For example, ICTP turned on rate-limiting at their border in May-Jun 2008. This affected the losses measured from ICTP to almost all its remote hosts. On the left in the table below is an example plot of losses and RTT from ICTP to a host in Brazi. It shows the increase in hourly ping losses starting early in June 2008. On the right is the median loss from ICTP to 93 sites worldwide in June 2008 showing a steep increase on June 6th and 7th as rate-limiting was applied.

Losses from ICTP to a Brazilian site

Median losses from ICTP to 93 sites worldwide

Another useful tool to spot anomalous values is the motion bubble plot.  In particular, looking at min_RTT for countries seen from a monitoring site can quickly show impossibly small minimum RTTs and spot when they start and stop occurring.

Missing TLD for a country is reported in https://www-iepm.slac.stanford.edu/pinger/sites-per-country.html towards the bottom. To add a missing TLD edit /afs/slac/g/www/www-iepm/pinger/countries.csv.

Errors observed in getting data

Following error was reported on Tuesday, May 03, 2011 2:37 AM

Your "cron" job

/afs/slac/package/pinger/analysis/analyze-all.pl --date 1days

produced the following output:

 Skipping line: mon_site(pinger.kohat.edpinger.kohat.edu.pk) not found in NODEDETAILS, for line(1411):pinger.kohat.edpinger.kohat.edu.pk 121.52.145.59 frcu.eun.eg 193.227.1.1 100 1304324727 10 10 215.785 229.858 262.263 1 2 3 4 5 6 7 8 9 10 218 249 253 217 223 217 224 216 215 262

However, Looking at the http://pinger.kohat.edu.pk/cgi-bin/ping_data.pl?in_form=1&begin_hour=00&begin_min=00&begin_sec=00&begin_day=2&begin_month=05&begin_year=2011&begin_offset=&begin_point=y&end_hour=23&end_min=59&end_sec=00&end_day=05&end_month=05&end_year=2011&end_offset=&end_point=y

the output shows:

pinger.kohat.edu.pk 121.52.145.59 frcu.eun.eg 193.227.1.1 100 1304324727 10 10 215.785 229.858 262.263 1 2 3 4 5 6 7 8 9 10 218 249 253 217 223 217 224 216 215 262

Which means there is no error found at the said location.
To our understanding this error message is not due to fault in measuring of data but fault in fetching of data. However, we are unable to understand why this problem occurred while fetching data. 

 

  • No labels