You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 13 Next »

After many tests we concluded that Planet Lab landmarks, and to a lesser extent the PingER landmarks, are unpredictable in terms of availability. If the landmark is not available the TULIP thread has to time out and this can dramatically extend the measurement duration. Thus, ideally, requests for pings should not be sent to such landmarks, i.e. they need to be removed from the list of active landmarks. The purpose of this web page is to indicate how we accomplish this.

For every access by TULIP the response for each landmark is logged. We analyze this logging information and generate a list containing nodes with corresponding success or failure percentages and reasons of their failures. These percentages are generated by /afs/slac/package/pinger/tulip/tulip-log-analyze.pl  and the results can be seen here. The idea is that landmarks with a low success rate should be removed from the active list of landmarks. This is done by disabling the entry in the tulip database.

There are two ways in which we can disable the landmarks which are not responding. 

1) Manual Process

2) Automated Process (updated by trscrontab job running in pinger@pinger.slac.stanford.edu)

 Manual disable of hosts

This is accomplished via the tulip data base. In the tulip database, the table landmarks has a parameter "enabled" which is used to decide which landmark is to be added to sites.xml (the list of active landmark sites). This XML file is later used by the reflector to query the active landmarks for the results. Sites.xml is generated by a trscronjob so if we change the value of enabled to '0' it would automatically not appear in Sites.xml after it has re-run through the trscronjob. We can also update Sites.xml manually. The process is discussed below.

  • Login to tulip database (username and password available in escrow -c iepm iepmacct)
  • Change the database to tulip by cmd
 mysql> use tulip;
  • Now update the value of enabled using the following sql cmd; in this instance we are using ipv4Addr = 141.22.213.35; generally ipv4Addr is the primary key but we can also use hostName as an identifier to disable landmarks
 update landmarks set enabled = '0' where ipv4Addr = '141.22.213.35';
  • Update Sites.xml so that it can now use the updated landmarks using follwing cmd
create_sites-xml.pl > /afs/slac/www/comp/net/wan-mon/tulip/sites.xml

Automated Process (updated by trscronjob)

  tulip-tuning.pl

Using the results from the analysis script we can identify the hosts and their success percentages. We opted to disable all the hosts which were having success less than 20%. The above mentioned script is in /afs/slac.stanford.edu/package/pinger/tulip and it performs the listed functions. It uses the LWP package to access the webpage, download the file and then parse the output to get the faulty landmarks. It is run nightly (see the trscrontab) before the sites.xml is created.

  landmark-laundering.pl

After solving the cleaning up process, we need to address another situation and that is whether any of those hosts would come back and if yes how we would know. Should we disable them forever or should we build up some mechanism to bring them back ? To solve this problem we devised a notification process, which can help us in identifying the landmarks which are up. The above mentioned script is in /afs/slac.stanford.edu/package/pinger/tulip

This script performs following actions to identify landmarks being up or down.
  1. It pings the host name of the landmark which are disabled
    1. If the ping is successful it marks the landmark being up
  2. Else it checks if IP address corresponding to that host is is pingable.
    1. If IP replies it marks as up and also tells that IP responded
    2. If ping fails it tries to connect at port, we did this step because there are many hosts which blocks ping requests
  3. If all the above mentioned steps fail, we mark the host as a not responding. 

Now to achieve the above mentioned steps we have done modifications in tulip database. We have created a new table named maintenance. The description of the table is as follows.

+----------+-------------+------+-----+---------+-------+
| Field    | Type        | Null | Key | Default | Extra |
+----------+-------------+------+-----+---------+-------+
| ipv4Addr | varchar(15) |      | PRI |         |       |
| downDays | int(11)     | YES  |     | NULL    |       |
| upDays   | int(11)     | YES  |     | NULL    |       |
| comments | varchar(11) | YES  |     | NULL    |       |
+----------+-------------+------+-----+---------+-------+

If the disabled landmark replies and passes the above mentioned pseudo , we increment the upDays by 1, change value of downDays to 0 and put in the comment information as to whether the host passed the test or IP was responding. In the same manner if the host fails the test we increment downDays by 1 and mark upDays as 0. With this counter technique at any instant of time we can have an hawk eye view and enable the host if its up for a longer period of time.

An example of the populated table is as follows

Hosts which are up

+-----------------+----------+--------+-----------+
| ipv4Addr        | downDays | upDays | comments  |
+-----------------+----------+--------+-----------+
| 88.255.65.220   |        0 |      1 | OK (name) |
| 203.128.4.48    |        0 |      1 | OK (name) |
| 203.178.133.11  |        0 |      1 | OK (name) |
| 203.178.133.3   |        0 |      1 | OK (name) |
| 198.7.242.41    |        0 |      1 | OK (name) |
| 198.7.242.42    |        0 |      1 | OK (name) |
| 200.133.215.142 |        0 |      1 | OK (name) |
+-----------------+----------+--------+-----------+

Hosts which are down
+-----------------+----------+-------------+
| ipv4Addr        | downDays | comments    |
+-----------------+----------+-------------+
| 69.110.237.116  |        1 | No Response |
| 61.252.60.30    |        1 | No Response |
| 216.165.109.82  |        1 | No Response |
| 213.131.1.102   |        1 | No Response |
| 213.19.160.194  |        1 | No Response |
| 216.165.109.79  |        1 | No Response |
| 216.165.109.81  |        1 | No Response |
| 210.125.84.16   |        1 | No Response |
| 212.201.44.81   |        1 | No Response |
| 212.201.44.82   |        1 | No Response |
| 210.123.39.103  |        1 | No Response |
| 128.10.19.53    |        1 | No Response |
| 137.138.24.234  |        1 | No Response |
+-----------------+----------+-------------+
 

The script is rtun from the trscrontab after tulip-tuning.pl. It produces output of the form:

 
OK(name):Disable host 61.252.60.30 responded its been up for past 18 days
OK(name):Disable host planet2.scs.cs.nyu.edu responded its been up for past 18 days
OK(name):Disable host plab-2.sinp.msu.ru responded its been up for past 18 days
OK(name):Disable host planetlab-1.amst.nodes.planet-lab.org responded its been up for past 18 days

This is sent by email to Qasim, who manually re-enables hosts that have been back up for 20 days. When we have experimented and understood this better then we may automate the re-enabling.

There is a strong possibility that a host will respond to ping but not send pings when requested, i.e. not act as a landmark (e.g. 212.201.44.82).

  • No labels