After alot of tests we concluded that Planet Lab nodes highly unpredictable in nature in terms of availability. For every access by TULIP we log the response for landmark. With the help of this logging information we generate a list containing of nodes with corresponding success or failure percentages and reasons of their failsures. These percentages are generated by /afs/slac/package/netmon/tulip/tulip-log-analyze.pl and results can be seen here
Now with the help of this results from analysis script we can very well identify the hosts and thier success percentage. We opted to disable all the hosts which were having success less than 20%. The above mentioned script is in $tulipdir and it performs the listed functions. It use LWP package to access the webpage, download the file and then parse the output to get the faulty landmarks.
After solving the cleaning up process, we landup in another situation and that is weather any of those hosts would come back and if yes how we would know. Should we disable them forever ir should we build up some mechanism to bring them back ? To solve this problem we devised pretty straight forward mechanism i.e. to devise a notification process, which can help us in identifying the landmarks which are up.
This script performs following actions to identify landmarks being up or down.
Now to achieve the above mentioned steps we have done modifications in tulip database. We have created a new table named maintenance. The discription of the table is as follows.
+----------+-------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +----------+-------------+------+-----+---------+-------+ | ipv4Addr | varchar(15) | | PRI | | | | downDays | int(11) | YES | | NULL | | | upDays | int(11) | YES | | NULL | | | comments | varchar(11) | YES | | NULL | | +----------+-------------+------+-----+---------+-------+ |