Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

2) Automated Process (updated by trscrontab job running in pinger@pinger.slac.stanford.edu)

...

Using the results from the analysis script we can identify the hosts and their success percentages. We opted to disable all the hosts which were having success less than 20%. The above mentioned script is in /afs/slac.stanford.edu/package/pinger/tulip and it performs the listed functions. It uses the LWP package to access the webpage, download the file and then parse the output to get the faulty landmarks. It is run nightly (see the trscrontab) before the sites.xml is created.

  landmark-laundering.pl

After solving the cleaning up process, we need to address another situation and that is whether any of those hosts would come back and if yes how we would know. Should we disable them forever or should we build up some mechanism to bring them back ? To solve this problem we devised a notification process, which can help us in identifying the landmarks which are up.

...

Code Block
Hosts which are up

+-----------------+----------+--------+-----------+
| ipv4Addr        | downDays | upDays | comments  |
+-----------------+----------+--------+-----------+
| 88.255.65.220   |        0 |      1 | OK (name) |
| 203.128.4.48    |        0 |      1 | OK (name) |
| 203.178.133.11  |        0 |      1 | OK (name) |
| 203.178.133.3   |        0 |      1 | OK (name) |
| 198.7.242.41    |        0 |      1 | OK (name) |
| 198.7.242.42    |        0 |      1 | OK (name) |
| 200.133.215.142 |        0 |      1 | OK (name) |
+-----------------+----------+--------+-----------+

Hosts which are down
+-----------------+----------+-------------+
| ipv4Addr        | downDays | comments    |
+-----------------+----------+-------------+
| 69.110.237.116  |        1 | No Response |
| 61.252.60.30    |        1 | No Response |
| 216.165.109.82  |        1 | No Response |
| 213.131.1.102   |        1 | No Response |
| 213.19.160.194  |        1 | No Response |
| 216.165.109.79  |        1 | No Response |
| 216.165.109.81  |        1 | No Response |
| 210.125.84.16   |        1 | No Response |
| 212.201.44.81   |        1 | No Response |
| 212.201.44.82   |        1 | No Response |
| 210.123.39.103  |        1 | No Response |
| 128.10.19.53    |        1 | No Response |
| 137.138.24.234  |        1 | No Response |
+-----------------+----------+-------------+
 

The script is rtun from the trscrontab after tulip-tuning.pl. It produces output of the form:

Code Block
 
OK(name):Disable host 61.252.60.30 responded its been up
for past 18 days OK(name):Disable host planet2.scs.cs.nyu.edu responded its
been up for past 18 days OK(name):Disable host plab-2.sinp.msu.ru responded its
been up for past 18 days OK(name):Disable host
planetlab-1.amst.nodes.planet-lab.org responded its been up for past 18 days

This is sent by email to Qasim, who manually re-enables hosts that have been back up for 20 days. When we have experimented and understood this better then we may automate the re-enabling.

There is a strong possibility that a host will respond to ping but not send pings when requested, i.e. not act as a landmark (e.g. 212.201.44.82).