Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Should fetch sites.txt or have a local copy of sites.txt, what changes should be made to sites.txt ?
    • A new parameter should be added to sites.txt to include teir0 or teir1. Also the region of tier1 sites needs to be specified in sites.txt
  • A separate thread should be used for each landmark and Semaphores should be used for locking, so that data from different threads should not inter-mix.
  • There should be a limit on the number of threads that can be launched at a time (say 10).
  • Should there be extra logging on the reflector or can we rely on the standard web logs which will log each query including time stamp, the client name.  What else it logs depends on whether the rewuest is Get or a Post.
  • Where are the results parsed, could be in the reflector or in the Java client. In the client distributes the parsing load, reduces the load on the reflector, simplifies the CGI script.
  • What should happen if a landmark responds with a bad data. ( Should it process the error or send the raw data back?). Since there will be some anomalies I suspect the reflector will need to return the full response and anyhow needs to inform the user, so I suspect initially the clinet client will process the response and spot errors etc.) Also if the client parses the result it will probably be easily able to spot problems.
  • Special consideration for security as the script ultimately has to be deployed at SLAC (Perl taint option, warning option, special open method etc)
  • Need to agree on a common format for the exchange of data.
  • Needs a blacklisting mechanism for malicious hosts.

After discussing with Yee and Booker it was seen that forks may be too complicated. The version of Perl at SLAC did not support threading. Also the security people will not allow forks running inside a CGI-script. So I had to come up with an alternative. The solution to this problem was to use Asynchronous IO.  A bunch of requests could be send to the landmarks without waiting for he the response.  The  LWP::Parallel library provides all this functionality.  It supports asynchronous IO. Currently it is not installed  so I am using a local version  in my home directory.  Ultimately this module has to be installed on the production server.

I have implemented most of the functionality. The script is running fine.  I will have to taken measures to make the script more secure so that it could not be used as a platform to launch DDOS attacks, by limiting the number of concurrent process of reflector.cgi to 10. Also the script produces customized messages such as ( request time out or Connection failed so that the TULIP client can differentiate between the various kind of error conditions). Also there is a crude blacklisting mechanism so that particular IP addresses can be blocked.

...

There are two scripts: reflector.cgiand  EventHandler.pm. Both have -T (Tainting), warning (-w), use strict, use the 3 parameter version of open, all opens and closes have a die or its equivalent. EventHandler.pm is called by reflector.cgi.  The Scripts CGI scripts are  deployed in the path /afs/slac.stanford.edu/g/www/cgi-wrap-bin/net/shahryar/smokeping/.

...

The function parameter, enables one to specify to ping (function=ping, default), to print out usage information (function=help), to print out the log file(function=log), to print out an HTML table of the landmarks (function=landmarks) from the TULIP database, and to analyze the log file (function=analyze). The function analyze has a sub option ability=0 to analyze the logs for disabled landmarks and ability =1 (default) to analyze log entries excluding those for disabled landmarks. The function landmarks displays a table of landmarks from the Tulip database. This can be invoked by using this link(function=landmarks). An additional parameter at this stage is has an additional parameter out, which can toggle output between html and csv. Using this link, one can easily have a look at landmarks present in the database, along with their most important properties, like tracerouteURL, pingURL, longitude, latitude, city, country and hostname.

...

  • Country names are defined by the Mapland database since this is used to produce our maps and we cannot modify it.  Usually (but not always) it is in agreement with UN standards.
  • The country names (and regions) in the PingER database can be found here.
  • Regions are defined as given in the PingER NodeDetails Oracle database.
  • A region to country mapping can be found at here. This mapping is created using a the script createCountryRegion.pl which is placed at in /afs/slac.stanford.edu/www/comp/net/wan-mon/tulip/TULIP/. This script uses the NODE_DETAILS hash to determine the regions of various countries.

...

There are about 60 SLAC/Looking Glass landmarks, and about 156 PlanetLabs landmarks.  We are working on filtering the latter to remove filter the landmarks nightly using tulip-tuning.pl to disable non- or poorly responding landmarks, and re-enable them when they are working well. The PlanetLab landmarks send 10 pings very quickly, whereas the SLAC/Looking Glass landmarks send five 56 byte pings with one second between them, they will also wait for a deadline time of 30 seconds for pings to be replied to.

...

To reduce the network impact and reduce the initial rough estimate time, we also break the landmarks into two tiers. Tier0 landmarks are used to identify the region for the target. Then tier1 hosts for that region can be used to more exactly locate the target.  Tier0 hosts are chosen as being at the edges of the region, being well connected, highly reliable and quick to respond. We currently only define tier0 hosts for North America and Europe.  In other regions all the landmarks are regarded as tier0. There are about 8 tier0 hosts for North America and 4 for Europe. This reduces the number of landmarks to make measurements from measurements  with a tier0 request since there are over 100 landmarks in either North America or Europe.When sending the query to the reflector we can characterize it in to into three tiers 0,1 and 2. These tiers would be matched with sites.xml nodes and relevant tiering information would be extracted from tag name "tier".  

...

Unfortunately the SLAC landmark server is really designed to be executed from a browser which will render the output, see for example http://www.slac.stanford.edu/cgi-wrap/traceroute.pl?function=ping&target=www.fnal.gov and view the source.  Thus not only is it verbose (~3.3KBytes per successful request) but also EventHandler.pm has to carefully parse this human readable output to find the relevant ping output lines with the RTTs and losses. In addition the server uses the standard system ping command and its output which varies in details between OS's and releases so this also has to be accomadated accomodated in the parsing.

Logging 

In addition to the normal web server (Apache) logging, we  use Log4perl for logging. The configuration file is very simple. The following types of error messages can be found in the log file (this is at /scratch/tulip_log on wanmon.slac.stanford.edu and is also available here) together with with time stamped records of all requests, the requesting host, and the target.
2007/09/03 20:02:25 ERROR> EventHandler.pm:70 EventHandler::on_failure - Landmark=http://128.6.192.158,\
Client=134.79.117.29, failed to connect response code 500<BR>
2007/09/03 20:02:34 ERROR> EventHandler.pm:142 EventHandler::parseData - Landmark=http://129.22.150.90,\
Client=134.79.117.29, 10 packets transmitted, 0 received, 100% packet loss, rtt min/avg/max = 0/0/0:
2007/09/03 20:09:09 ERROR> EventHandler.pm:115 EventHandler::parseData - Landmark=http://128.143.137.250,\
Client=134.79.117.29, request timed out: To 134.79.16.9 timed out

...