Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Using the results from the analysis script we can identify the hosts and their success percentages. We opted to disable all the hosts which were having success less than 20%, and to enable the ones with success rate greater than 20%. The above mentioned script is in /afs/slac.stanford.edu/package/pinger/tulip and it performs the listed functions. It uses the LWP package to access the webpage, downloads the analyzed tulip log file for enabled landmarks, saves it in a file and then parses the output to get either the faulty landmarks (i.e. the ones below 20% success rate by default) and then updtaes updating the database to disable such hosts or finds the landmark with a success rate greater than 20% and re-enables, it in the database. It is run nightly (see the trscrontab) before the sites.xml or sites-disabled.xml is created and specifies using the tulip log file for the last 3 days.

  landmark-laundering.pl

After solving the cleaning up process, we need to address another situation and that is whether any of those hosts would come back and if yes how we would know. Should we disable them forever or should we build up some mechanism to bring them back ? To solve this problem we devised a notification process, which can help us in identifying the landmarks which are up. The above mentioned script is in /afs/  http://www-dev.slac.stanford.edu/package/pinger/tulip

Code Block

This script performs following actions to identify landmarks being up or down.
  1. It pings the host name of the landmark which are disabled
    1. If the ping is successful it marks the landmark being up
  2. Else it checks if IP address corresponding to that host is is pingable.
    1. If IP replies it marks as up and also tells that IP responded
    2. If ping fails it tries to connect at port, we did this step because there are many hosts which blocks ping requests
  3. If all the above mentioned steps fail, we mark the host as a not responding. 

Now to achieve the above mentioned steps we have done modifications in tulip database. We have created a new table named maintenance. The description of the table is as follows.

Code Block

+----------+-------------+------+-----+---------+-------+
| Field    | Type        | Null | Key | Default | Extra |
+----------+-------------+------+-----+---------+-------+
| ipv4Addr | varchar(15) |      | PRI |         |       |
| downDays | int(11)     | YES  |     | NULL    |       |
| upDays   | int(11)     | YES  |     | NULL    |       |
| comments | varchar(11) | YES  |     | NULL    |       |
+----------+-------------+------+-----+---------+-------+

If the disabled landmark replies and passes the above mentioned pseudo , we increment the upDays by 1, change value of downDays to 0 and put in the comment information as to whether the host passed the test or IP was responding. In the same manner if the host fails the test we increment downDays by 1 and mark upDays as 0. With this counter technique at any instant of time we can have an hawk eye view and enable the host if its up for a longer period of time.

An example of the populated table is as follows

Code Block

Hosts which are up

+-----------------+----------+--------+-----------+
| ipv4Addr        | downDays | upDays | comments  |
+-----------------+----------+--------+-----------+
| 88.255.65.220   |        0 |      1 | OK (name) |
| 203.128.4.48    |        0 |      1 | OK (name) |
| 203.178.133.11  |        0 |      1 | OK (name) |
| 203.178.133.3   |        0 |      1 | OK (name) |
| 198.7.242.41    |        0 |      1 | OK (name) |
| 198.7.242.42    |        0 |      1 | OK (name) |
| 200.133.215.142 |        0 |      1 | OK (name) |
+-----------------+----------+--------+-----------+

Hosts which are down
+-----------------+----------+-------------+
| ipv4Addr        | downDays | comments    |
+-----------------+----------+-------------+
| 69.110.237.116  |        1 | No Response |
| 61.252.60.30    |        1 | No Response |
| 216.165.109.82  |        1 | No Response |
| 213.131.1.102   |        1 | No Response |
| 213.19.160.194  |        1 | No Response |
| 216.165.109.79  |        1 | No Response |
| 216.165.109.81  |        1 | No Response |
| 210.125.84.16   |        1 | No Response |
| 212.201.44.81   |        1 | No Response |
| 212.201.44.82   |        1 | No Response |
| 210.123.39.103  |        1 | No Response |
| 128.10.19.53    |        1 | No Response |
| 137.138.24.234  |        1 | No Response |
+-----------------+----------+-------------+
 

The script is rtun from the trscrontab after tulip-tuning.pl. It produces output of the form:

Code Block

 
OK(name):Disable host 61.252.60.30 responded its been up for past 18 days
OK(name):Disable host planet2.scs.cs.nyu.edu responded its been up for past 18 days
OK(name):Disable host plab-2.sinp.msu.ru responded its been up for past 18 days
OK(name):Disable host planetlab-1.amst.nodes.planet-lab.org responded its been up for past 18 days

This is sent by email to Qasim, who manually re-enables hosts that have been back up for 20 days. When we have experimented and understood this better then we may automate the re-enabling.

There is a strong possibility that a host will respond to ping but not send pings when requested, i.e. not act as a landmark (e.g. 212.201.44.82). cgi-wrap/scriptdoc.pl?name=landmark-laundering.pl