TULIP Central Reflector

Purpose

The main purpose of TULIP Central Reflector is to proxy the TULIP queries to PlanetLab's Scriptroute Service. It may also be extended to issue all queries. This decision will be related to speed of execution and security among other things. The PlanetLab Scriptroute service provides a cookie which works for a single IP address only. So in this way all the requests will be issued from the Central reflector and the responses will be sent back to the TULIP JNLP Client. Here is a map of Planet Lab servers.

Implementation

The TULIP Central reflector will be a CGI Script (reflector.cgi) deployed at SLAC. The TULIP client will issue a single request and the Reflector will go ahead and probe all the landmarks in that region*[1] and return the results to the TULIP Client. Probing the target site from more vantage points may give us a better estimate of its location.

Requirements

Should fetch sites.txt or have a local copy of sites.txt, what changes should be made to sites.txt ?
- A new parameter should be added to sites.txt to include teir0 or teir1. Also the region of tier1 sites needs to be specified in sites.txt
A separate thread should be used for each landmark and Semaphores should be used for locking, so that data from different threads should not inter-mix.
There should be a limit on the number of threads that can be launched at a time (say 10).
Should there be extra logging on the reflector or can we rely on the standard web logs which will log each query including time stamp, the client name. What else it logs depends on whether the rewuest is Get or a Post.
Where are the results parsed, could be in the reflector or in the Java client. In the client distributes the parsing load, reduces the load on the reflector, simplifies the CGI script.
What should happen if a landmark responds with a bad data. ( Should it process the error or send the raw data back?). Since there will be some anomalies I suspect the reflector will need to return the full response and anyhow needs to inform the user, so I suspect initially the clinet will process the response and spot errors etc.) Also if the client parses the result it will probably be easily able to spot problems.
Special consideration for security as the script ultimately has to be deployed at SLAC ( Perl taint option, warning option, special open method etc)
Need to agree on a common format for the exchange of data.
Needs a blacklisting mechanism for malicious hosts.

After discussing with Yee and Booker it was seen that forks may be too complicated. The version of Perl at SLAC did not support threading. Also the security people will not allow forks running inside a CGI-script. So I had to come up with an alternative. The solution to this problem was to use Asynchronous IO. A bunch of requests could be send to the landmarks without waiting for he response. The LWP::Parallel library provides all this functionality. It supports asynchronous IO. Currently it is not installed so I am using a local version in my home directory. Ultimately this module has to be installed on the production server.

I have implemented most of the functionality. The script is running fine. I will have to taken measures to make the script more secure so that it could not be used as a platform to launch DDOS attacks, by limiting the number of concurrent process of reflector.cgi to 10. Also the script produces customized messages such as ( request time out or Connection failed so that TULIP client can differentiate between the various kind of error conditions). Also there is a blacklisting mechanism so that particular IP addresses can be blocked.

Implementation

There are two scripts: reflector.cgi and EventHandler.pm. Both have -T (Tainting), warning (-w), use strict, use the 3 parameter version of open, all opens and closes have a die or its equivalent. EventHandler.pm is called by reflector.cgi. The Scripts are deployed in the path /afs/slac.stanford.edu/g/www/cgi-wrap-bin/net/shahryar/smokeping/.

Invocation

The reflector script is called by a URL of the form:

http://www-wanmon.slac.stanford.edu/cgi-wrap/reflector.cgi?region=northamerica&target=134.79.16.9&tier=0

Leaving out the region will assume all regions, leaving out the tier will assume all tiers.

The script uses asynchronous I/O to talk simultaneously with up to 20 landmarks. Up to 5 copies of reflector.cgi can be running simultaneously.

For the Planetlab landmarks an interpretive script is supplied with the target ($target) and the number of pings ($ping) to make. For the SLAC ping servers and the Looking glass sites the landmark is accessed by a URL provided in the sites.txt file (see below) in the PingSites token (e.g. http://www.slac.stanford.edu/cgi-wrap/nph-traceroute.pl?choice=yes&function=trace&target=$target).

Files

The list of SLAC and Looking Glass landmarks is read from /afs/slac/www/comp/net/wan-mon/tulip>ne sites.txt. the format is space separated tokens:

SNo Site_name PingSite TraceSite Lat Long Reference_Site Alpha, for example (where \ means line broken for viewing)

1  SLAC,Stanford_US http://www.slac.stanford.edu/cgi-wrap/traceroute.pl?function=ping&target=
\ http://www.slac.stanford.edu/cgi-wrap/nph-traceroute.pl?choice=yes&function=trace&target=
\  39.32  -122.04 www.slac.stanford.edu   73

The list of PlanetLab landmarks is read from: /afs/slac.stanford.edu/www/comp/net/wan-mon/tulip/TULIP/newsites.txt. It appears as:

Piscataway_UnitedStates_PL 128.6.192.158 40.5516 -74.4637 orbpl1.rutgers.edu northamerica
Aachen_Germany_PL 137.226.138.154 50.7708 6.1053 freedom.informatik.RWTH-Aachen.DE europe
Winnipeg_Canada_PL 198.163.152.230 49.8833 -97.1668 planetlab2.win.trlabs.ca northamerica 0

Anything following a # sign is ignored (it is a comment).

The tokens are space delimited, they are:

Location: in the form City_Country_Type (the only Type currently is PL=PlanetLab)
IP Address
Latitude
Longitude
IP Name
Region (possible regions with PlanetLab hosts are northamerica, eastasia, europe)
Tier: currently may be 0 or 1, if not provided tier 1 is implied

Tiering

To reduce the network impact and reduce the initial rough estimate time, we also break the landmarks into two tiers. Tier0 landmarks are used to identify the region for the target. Then tier1 hosts for that region can be used to more exactly locate the target. Tier0 hosts are chosen as being at the edges of the region, being well connected, highly reliable and quick to respond. We currently only define tier0 hosts for North America and Europe. In other regions all the landmarks are regarded as tier0. There are about 8 tier0 hosts for North America and 4 for Europe. This reduces the number of landmarks to make measurements from with a tier0 request since there are over 100 landmarks in either North America or Europe.

Responses

The responses appear as:

Landmark=http://128.6.192.158, Client=134.79.117.29, failed to connect response code 500
Landmark=http://141.149.218.208, Client=134.79.117.29, 10 packets transmitted, 0 received, 100% packet loss, rtt min/avg/max = 0/0/0
Landmark=http://128.193.33.7, Client=134.79.117.29, 10 packets transmitted, 10 received, 0% packet loss, rtt min/avg/max = 29.178/29.2495/29.316
Landmark=http://pinger.fnal.gov/cgi-pub/traceroute.pl?function=ping&target=134.79.16.9, Client=134.79.117.57, 5 packets transmitted,\
5 received, 0% packet loss, rtt min/avg/max = 52/52/53

The first 3 response are from PlanetLab landmarks and the latter is from a SLAC type landmark.

Errors Reported by PlanetLab

Failed to connect to http://129.22.150.90 response code 500
ERROR: you're (134.79.18.134) already running a measurement on socket 14. http://128.83.122.179
10 packets transmitted, 0 received, 100% packet loss, time 0 ms rtt min/avg/max = 0/0/0 http://141.149.218.208
Can't resolve DNS: submitted:6:in `ip_dst=': unable to resolve $target: running in a chroot without dns support (RuntimeError)
submitted:9: warning: didn't see packet 5 leave: pcap overloaded or server bound to incorrect interface?
To 134.79.16.9 timed out
Error connecting: Connection refused
ERROR: you need a valid scriptroute authentication cookie to use this server, or the cookie you used does not match your client IP 134.79.18.163; go to http://www.scriptroute.org/cookies.html to get one.
ERROR: you're (134.79.18.134) already running a measurement on socket 10.
PlanetLab Server Error: Received: IP (tos 0xc0, ttl 253, id 51592, offset 0, flags [none], length: 56)
192.70.187.218 > 198.82.160.220: icmp 36: time exceeded in-transit
Error connecting: No buffer space available
submitted:9:in `send_train': scriptrouted error: unable to send to 137.138.137.177: No buffer space available (ScriptrouteError)

Logging

In addition to the normal web server (Apache) logging, we use Log4perl for logging. The configuration file is very simple. The following types of error messages can be found in the log file.

2007/09/03 20:02:25 ERROR> EventHandler.pm:70 EventHandler::on_failure - Landmark=http://128.6.192.158, Client=134.79.117.29, failed to connect response code 500<BR>
2007/09/03 20:02:34 ERROR> EventHandler.pm:142 EventHandler::parseData - Landmark=http://129.22.150.90, Client=134.79.117.29, 10 packets transmitted, 0 received, 100% packet loss, rtt min/avg/max = 0/0/0:
2007/09/03 20:09:09 ERROR> EventHandler.pm:115 EventHandler::parseData - Landmark=http://128.143.137.250, Client=134.79.117.29, request timed out: To 134.79.16.9 timed out

Plus Unusual PlanetLab errors of the form:

2007/09/03 20:02:58 ERROR> EventHandler.pm:125 EventHandler::parseData - Landmark=http://128.4.36.11, Client=134.79.117.29, <planetLab error message, see section "Errors Reported by PlanetLab">

There is a script at ~cottrell/bin/tulip-analyze-log.pl to aggregate the errors. Typical output appears as:

28cottrell@wanmon:~>bin/tulip-log-analyze.pl
tulip-log-analyze.pl: Mon Sep  3 22:16:47 2007=1188883007
100%_loss,128.111.52.61=1
100%_loss,128.227.56.81=4
100%_loss,129.22.150.90=4
100%_loss,129.24.211.25=5
100%_loss,131.247.2.242=1
100%_loss,141.149.218.208=3
100%_loss,152.14.92.58=1
100%_loss,155.225.2.72=4
100%_loss,205.189.33.178=1
100%_loss,206.207.248.34=2
100%_loss,208.117.131.115=2
100%_loss,63.64.153.84=1
100%_loss,65.241.38.58=3
100%_loss,75.130.96.12=4
failed_to_connect,128.192.101.217=5
failed_to_connect,128.238.88.64=5
failed_to_connect,128.6.192.158=5
failed_to_connect,129.105.44.252=5
failed_to_connect,141.149.218.208=2
failed_to_connect,143.215.129.115=5
failed_to_connect,169.229.50.16=5
failed_to_connect,192.197.121.3=5
failed_to_connect,208.216.119.19=2
failed_to_connect,216.165.109.81=1
not_sent,128.4.36.11=1
timeout,128.143.137.250=1
timeout,128.4.36.11=1
timeout,129.130.252.138=1
timeout,155.225.2.72=1
timeout,64.151.112.20=1
timeout,65.241.38.58=1
Mon Sep  3 22:16:47 2007 tulip-log-analyze.pl: took 0 seconds

As we review the logs we will determine whether probing from some landmarks is reliable enough to warrant their use.

Landmark Failures

The typical failure mechanisms for the target www.cern.ch with a timeout of 2 and 10 seconds made in the evening (PDT) of September 8th 2007 is seen in the table below. The multiple numbers in each cell are for different requests. It is seen that increasing the timeout from 2 to 10 seconds does not provide much, if any help. So we utilize a timeout of 2 seconds.

Timeout	2 secs	10 secs
100% loss	7, 10, 9	10, 11, 8
Success	22, 16, 17	20, 14, 21
Fail to connect	10, 8, 9	8, 6, 5
Timeout	45, 50, 49	46, 51, 46

Performance

Some spot measures of performance indicate that for 10 pings per target and 86 PlanetLab landmarks for region=northamerica as we vary the number of landmarks accessed simultaneously, the number of parallel requests per landmark, and the timeout for each request the duration is as follows (n.b. there is a timeout of 100 seconds on the complete process, and the default values are in boldface in the table below:

Simultaneous landmarks	Parallel requests / landmark	Request timeout	Duration (secs)
20	5	2	50
20	5	10	60
10	5	2	88
40	5	2	34
20	10	2	50

Testing

It can be tested from a web browser by entering the URL, e.g.

http://www-wanmon.slac.stanford.edu/cgi-wrap/reflector.cgi?region=northamerica&target=134.79.16.9

Sample Scripts

traceroute.pl: This script has been written with special security considerations so it will help in implementing reflector.cgi

topology.pm: This is a multi-threaded script written by Yee so this will help understand the threading issues in perl which are a bit complex.

Child pages

TULIP Central Reflector

Purpose

Implementation

Requirements

Implementation

Invocation

Files

Tiering

Responses

Errors Reported by PlanetLab

Logging

Landmark Failures

Performance

Testing

Sample Scripts