You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 50 Next »

Purpose

 The main purpose of TULIP Central Reflector is to proxy the TULIP queries to PlanetLab's Scriptroute Service. It may also be extended to issue all queries.  This decision will be related to speed of execution and security among other things. The PlanetLab Scriptroute service provides a cookie which works for a single IP address only. So in this way all the requests will be issued from the Central reflector and the responses will be sent back to the TULIP JNLP Client. Here is a map of Planet Lab servers.

Implementation 

 The TULIP Central reflector will be a CGI Script (reflector.cgi) deployed at SLAC. The TULIP client will issue a single request and the Reflector will go ahead and probe all the landmarks in that region*[1] and return the results to the TULIP Client. Probing the target site from more vantage points may give us a better estimate of its location.

Requirements

  • Should fetch sites.txt or have a local copy of sites.txt, what changes should be made to sites.txt ?
    • A new parameter should be added to sites.txt to include teir0 or teir1. Also the region of tier1 sites needs to be specified in sites.txt
  • A separate thread should be used for each landmark and Semaphores should be used for locking, so that data from different threads should not inter-mix.
  • There should be a limit on the number of threads that can be launched at a time (say 10).
  • Should there be extra logging on the reflector or can we rely on the standard web logs which will log each query including time stamp, the client name.  What else it logs depends on whether the rewuest is Get or a Post.
  • Where are the results parsed, could be in the reflector or in the Java client. In the client distributes the parsing load, reduces the load on the reflector, simplifies the CGI script.
  • What should happen if a landmark responds with a bad data. ( Should it process the error or send the raw data back?). Since there will be some anomalies I suspect the reflector will need to return the full response and anyhow needs to inform the user, so I suspect initially the clinet will process the response and spot errors etc.) Also if the client parses the result it will probably be easily able to spot problems.
  • Special consideration for security as the script ultimately has to be deployed at SLAC ( Perl taint option, warning option, special open method etc)
  • Need to agree on a common format for the exchange of data.
  • Needs a blacklisting mechanism for malicious hosts.

 After discussing with Yee and Booker it was seen that forks may be too complicated. The version of Perl at SLAC did not support threading. Also the security people will not allow forks running inside a CGI-script. So I had to come up with an alternative. The solution to this problem was to use Asynchronous IO.  A bunch of requests could be send to the landmarks without waiting for he response.  The  LWP::Parallel library provides all this functionality.  It supports asynchronous IO. Currently it is not installed  so I am using a local version  in my home directory.  Ultimately this module has to be installed on the production server.

  I have implemented most of the functionality. The script is running fine.  I will have to taken measures to make the script more secure so that it could not be used as a platform to launch DDOS attacks, by limiting the number of concurrent process of reflector.cgi to 10. Also the script produces customized messages such as ( request time out or Connection failed so that TULIP client can differentiate between the various kind of error conditions). Also there is a blacklisting mechanism so that particular IP addresses can be blocked.

Implementation

There are two scripts: reflector.cgi and  EventHandler.pm. Both have -T (Tainting), warning (-w), use strict, use the 3 parameter version of open, all opens and closes have a die or its equivalent. EventHandler.pm is called by reflector.cgi.  The Scripts are  deployed in the path /afs/slac.stanford.edu/g/www/cgi-wrap-bin/net/shahryar/smokeping/.

Invocation 

The reflector script is called by a URL of the form:

http://www-wanmon.slac.stanford.edu/cgi-wrap/reflector.cgi?region=northamerica&target=134.79.16.9&tier=0&type=planetlabs

Leaving out the region will assume all regions, leaving out the tier will assume all tiers, leaving out the type will assume both PlanetLabs and SLAC type landmarks. If the region is included then only landmarks in that region will be used. If the tier is specified then only that tier's landmarks will be used. If the type is specified then only that type landmark will be used.  Any or all of the  tier, region and type region may be specified as "all".

The script uses asynchronous I/O to talk simultaneously with up to 20 landmarks. Up to 5 copies of reflector.cgi can be running simultaneously.

For the Planetlab landmarks an interpretive script (also see here for the original is supplied with the target ($target) and the number of pings ($ping) to make. For the SLAC ping servers and the Looking glass sites the landmark is accessed by a URL provided in the sites.txt file (see below) in the PingSites token (e.g. http://www.slac.stanford.edu/cgi-wrap/nph-traceroute.pl?choice=yes&function=trace&target=$target).

Files

 The list of SLAC and Looking Glass landmarks is read from /afs/slac/www/comp/net/wan-mon/tulip/sites.txt. the format is space separated tokens:

SNo Site_name PingSite TraceSite Lat Long Reference_Site Alpha, for example (where \ means line broken for viewing)

1  SLAC,Stanford_US http://www.slac.stanford.edu/cgi-wrap/traceroute.pl?function=ping&target=
\ http://www.slac.stanford.edu/cgi-wrap/nph-traceroute.pl?choice=yes&function=trace&target=
\  39.32  -122.04 www.slac.stanford.edu   73

 The list of PlanetLab landmarks is read from: /afs/slac.stanford.edu/www/comp/net/wan-mon/tulip/TULIP/newsites.txt. It appears as:

Piscataway_UnitedStates_PL 128.6.192.158 40.5516 -74.4637 orbpl1.rutgers.edu northamerica
Aachen_Germany_PL 137.226.138.154 50.7708 6.1053 freedom.informatik.RWTH-Aachen.DE europe
Winnipeg_Canada_PL 198.163.152.230 49.8833 -97.1668 planetlab2.win.trlabs.ca northamerica 0

Anything following a # sign is ignored (it is a comment).

 The tokens are space delimited, they are:

Location: in the form City_Country_Type (the only Type currently is PL=PlanetLab)
IP Address
Latitude
Longitude
IP Name
Region (possible regions with PlanetLab hosts are northamerica, eastasia, europe)
Tier: currently may be 0 or 1, if not provided tier 1 is implied

Next version

This is still in design/discussion. We want to remove, where possible, the overloading of tokens, make it more amenable to adding new tokens, and tie in better to the pingER NodeDetails database. Our first stab at the template for the new XML file can be found here. An initial version will be created from the original PlanteLabs and SLAC sites.txt files at:

my @files=("/afs/slac.stanford.edu/www/comp/net/wan-mon/tulip/newsites.txt",#PlanetLabs
             "/afs/slac.stanford.edu/www/comp/net/wan-mon/tulip/sites.txt",   #SLAC

We will also take this opportunity to clean up the data. For example make sure the City and country appear, thta elements are in the right order etc. 

The intent is that the tier and alpha values will be added to the NodeDetails databse later and this will then become the authoritative source from which the XML will be created.

It is unclear whether we need to add alpha or whether this should be determined by the program itself, so we need a value to decide whether to leave it up to the program.

When we have the new XML file then we can modify reflector.cgi using the perl LibXML module  to read in the XML file.  the second step will be to upgrade the NodeDetails database to use provide the new elements.

Conventions

We use the following conventions:

  • Country names are defined by the Mapland database since this is used to produce our maps and we cannot modify it.  Usually (but not always) it is in agreement with UN standards.
  • The country names (and regions) in the PingER database can be found here.
  • Regions are defined as given in the PingER NodeDetails Oracle database.

Deployment of Landmarks

There are about 60 SLAC/Looking Glass landmarks, and about 156 PlanetLabs landmarks.  We are working on filtering the latter to remove non- or poorly respodning landmarks. The PlanetLab landmarks send 10 pings very quickly, whereas the SLAC/Looking Glass landmarks send five 56 byte pings with one second between them, they will also wait for a deadline time of 30 seconds for pings to be replied to.

You can see a Google map of the PlanetLab and SLAC/Looking Glass landmarks by clicking below or by going to: http://www.slac.stanford.edu/comp/net/wan-mon/viper/tulip_googlemap.htm
 


 

Tiering

To reduce the network impact and reduce the initial rough estimate time, we also break the landmarks into two tiers. Tier0 landmarks are used to identify the region for the target. Then tier1 hosts for that region can be used to more exactly locate the target.  Tier0 hosts are chosen as being at the edges of the region, being well connected, highly reliable and quick to respond. We currently only define tier0 hosts for North America and Europe.  In other regions all the landmarks are regarded as tier0. There are about 8 tier0 hosts for North America and 4 for Europe. This reduces the number of landmarks to make measurements from with a tier0 request since there are over 100 landmarks in either North America or Europe.

Responses

The responses  appear as:

Landmark=http://128.6.192.158, Client=134.79.117.29, failed to connect response code 500
Landmark=http://141.149.218.208, Client=134.79.117.29, 10 packets transmitted, 0 received, 100% packet loss, rtt min/avg/max = 0/0/0
Landmark=http://128.193.33.7, Client=134.79.117.29, 10 packets transmitted, 10 received, 0% packet loss, rtt min/avg/max = 29.178/29.2495/29.316
Landmark=http://pinger.fnal.gov/cgi-pub/traceroute.pl?function=ping&target=134.79.16.9, Client=134.79.117.57, 5 packets transmitted,\
 5 received, 0% packet loss, rtt min/avg/max = 52/52/53

The first 3 response are from PlanetLab landmarks and the latter is from a SLAC type landmark. 

Errors Reported by PlanetLab

Failed to connect to http://129.22.150.90 response code 500
ERROR: you're (134.79.18.134) already running a measurement on socket 14. http://128.83.122.179
10 packets transmitted, 0 received, 100% packet loss, time 0 ms rtt min/avg/max = 0/0/0 http://141.149.218.208
Can't resolve DNS: submitted:6:in `ip_dst=': unable to resolve $target: running in a chroot without dns support (RuntimeError)
submitted:9: warning: didn't see packet 5 leave: pcap overloaded or server bound to incorrect interface?
To 134.79.16.9 timed out
Error connecting: Connection refused
ERROR: you need a valid scriptroute authentication cookie to use this server, or the cookie you used does not match your client IP 134.79.18.163; go to http://www.scriptroute.org/cookies.html to get one.
ERROR: you're (134.79.18.134) already running a measurement on socket 10.
PlanetLab Server Error: Received: IP (tos 0xc0, ttl 253, id 51592, offset 0, flags [none], length: 56)
192.70.187.218 > 198.82.160.220: icmp 36: time exceeded in-transit
Error connecting: No buffer space available
submitted:9:in `send_train': scriptrouted error:  unable to send to 137.138.137.177: No buffer space available (ScriptrouteError)

Logging 

In addition to the normal web server (Apache) logging, we  use Log4perl for logging. The configuration file is very simple. The following types of error messages can be found in the log file.

2007/09/03 20:02:25 ERROR> EventHandler.pm:70 EventHandler::on_failure - Landmark=http://128.6.192.158, Client=134.79.117.29, failed to connect response code 500<BR>
2007/09/03 20:02:34 ERROR> EventHandler.pm:142 EventHandler::parseData - Landmark=http://129.22.150.90, Client=134.79.117.29, 10 packets transmitted, 0 received, 100% packet loss, rtt min/avg/max = 0/0/0:
2007/09/03 20:09:09 ERROR> EventHandler.pm:115 EventHandler::parseData - Landmark=http://128.143.137.250, Client=134.79.117.29, request timed out: To 134.79.16.9 timed out

Plus Unusual PlanetLab errors of the form:

2007/09/03 20:02:58 ERROR> EventHandler.pm:125 EventHandler::parseData - Landmark=http://128.4.36.11, Client=134.79.117.29, <planetLab error message, see section "Errors Reported by PlanetLab">

There is a script at ~cottrell/bin/tulip-analyze-log.pl to analyse the logs. Typical output appears as:

28cottrell@wanmon:~>bin/tulip-log-analyze.pl
===============Failure types by landmark =======================
              Landmark,    Success,   100%_loss,connect_fail,  not_sent,  timeout,      refused,      in_use,     no_name, transit_exc, Totals,
    143.225.229.236,    100.0%,        0.0%,        0.0%,        0.0%,        0.0%,         0.0%,        0.0%,        0.0%,        0.0%           1,
       149.48.230.20,      40.0%,       37.8%,        2.2%,        0.0%,       11.1%,        0.0%,        0.0%,        8.9%,        0.0%          45,...
itchy.cs.uga.edu_PL,        0.0%,       15.4%,       84.6%,        0.0%,        0.0%,        0.0%,        0.0%,        0.0%,        0.0%          26,
             Landmark,     Success,   100%_loss,connect_fail,   not_sent,  timeout,     refused,      in_use,    no_name, transit_exc,   Totals,
                   Totals,        2258,         422,         457,          11,         401,           0,          52,         287,           0,        3888
Wed Oct  3 14:58:38 2007 tulip-log-analyze.pl: took 38 seconds to analyze 4378 records for 323 requests.
Successful hosts=111, Failing hosts=108, PlanetLabs=128(100% success=26), SLACs=16(100% success=10)

As we review the logs we will determine whether probing from some landmarks is reliable enough to warrant their use.

Landmark Failures

The typical failure mechanisms for the target www.cern.ch with a timeout of 2  and 10 seconds made in the evening (PDT) of September 8th 2007 is seen in the table below. The multiple numbers in each cell are for different requests. It is seen that increasing the timeout from 2 to 10 seconds does not provide much, if any help.  So we utilize a timeout of 2 seconds.

Timeout

2 secs

10 secs

100% loss

7, 10, 9

10, 11, 8

Success

22, 16, 17

20, 14, 21

Fail to connect

10, 8, 9

8, 6, 5

Timeout

45, 50, 49

46, 51, 46

Performance

Some spot measures of performance indicate that for 10 pings per target and 86 PlanetLab landmarks for region=northamerica as we vary the number of landmarks accessed simultaneously,  the number of parallel requests per landmark, and the timeout for each request the duration is as follows (n.b. there is a timeout of 100 seconds on the complete process, and the default values are in boldface in the table below:

Simultaneous landmarks

Parallel requests / landmark

Request timeout

Duration (secs)

 20

5

2

50

20

5

10

60

10

5

2

88

40

5

2

34

20

10

2

50

Testing

 It can be tested from a web browser by entering the URL (e.g. from a browser or from wget), e.g.

http://www-wanmon.slac.stanford.edu/cgi-wrap/reflector.cgi?region=northamerica&target=134.79.16.9

It can also be called from the command line, e.g.
>setenv REMOTE_ADDR 134.79.18.134; perl -d -T bin/reflector.cgi "region=northamerica&target=134.79.16.9&tier=0&type=slac"
However unless you have  a PlanetLab cookie for your host it will not fully work.

Some hosts mis-identified bt Geo IP tools and VisualRoute include: www.cst.edu.ve 

Security 

There are several issues related to security.

Landmark Server 

The SLAC traceroute/landmark server that is frequently used by landmarks servers: rejects attempts to traceroute to a broadcast address; does not allow a remote host name to be greater than 255 characters to prevent buffer overflow attempts; does not allow a remote host in a different domain to do a traceroute to a host within the same domain as the web server; limits the maximum number of traceroute processes running in the server to reduce the chance of a denial of service request; starts the traceroute after 3 hops if the client/browser and server are in different domains in order to hide internal routing information from outsiders; has a blacklist of sites that are blocked.

Tulip Client 

TULIP only allows one copy of the client to be running on a client host. TULIP also hides the URLs used for the landmarks to reduce the possibility of people gleaning the URLs for a denial of service attack. Editing the landmark URL's requires a password known only to the developers.

Log 

There is a centralized log with time stamped records of all requests, the requesting host, and the target. This is analyzed for abusers.

Scanning and Denial of Service 

A major concern is that the target is pinged simultaneously from multiple landmarks. This can look like a scan of multiple hosts when the target host responds to the ping requests. It can also look like a denial of service attack, especially for hosts with limited available bandwdth, such as are found in developing countries. We thus limit the number of pings from a landmark to a target to 5.

I doubt the early version triggered the alert. It had < 60 landmarks, of these I am guessing (TULIP is down at the moment) about 10-20 did not work (i.e respond to the request to ping).  However recently we added 149 PlanetLab hosts.  The net result is that with the current version 39 PlanetLab landmarks answer 100% time, 39 answer sometimes, the rest are either not requested or never answer (as far as I can tell this means they are not pinging, i.e. they are not responding to the request to make the pings).  The typical number of PlanetLab hosts trying to ping is about 60 (of these about 10 fail with 0 pings responding).

We are working on two things to reduce the number of landmarks pinging at a time.

  1. Remove landmarks which are not 100% reliable and whose function is replicated by another landmark (e.g. a nearby working one).
  2. We are also looking at tiering the landmarks (see tiering to tier the N. American and European hosts). The top tier will enable us to locate the region of the world and then the second tier can be used to find the location in that region. This reduces the number of landmarks used and divides them in time into two or more sets.  Most landmarks are in N. America or Europe (136 out of 149 for PlanetLabs & 26 out of 63 for the SLAC type landmarks). So for tier0 landmarks we use 5 sites in North America, 3 in Europe and all 32 sites outside N. America and Europe. The tier0 sites are first requested to provide the area the host is in and a rough estimate of position. Thus there are currently 5+32+3 tier0 landmark requests (thiw will be reduced when we remove unreliable landmarks, see above) The client can then request more detailed information of the host if it is in N. America or Europe.

Other Concerns

We have also considered whether the knowledge that a machine and possibly the usual owner can be accurately located may violate some privacy issue. This may require us to add some fuzz to results. So far this has not been done.

Sample Scripts

traceroute.pl: This script has been written with special security considerations so it will help in implementing reflector.cgi

topology.pm: This is a multi-threaded script written by Yee so this will help understand the threading issues in perl which are a bit complex.

  • No labels