Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin

...

The main purpose of TULIP Central Reflector is to proxy the TULIP queries to landmarks in PlanetLab's Scriptroute Service. It may also be extended to issue all queries.  This decision will be related to speed of execution and security among other things service, and the perfSONAR and  PingER reverse ping servers. The PlanetLab Scriptroute service provides a cookie which works for a single IP address only. So in this way all the requests will be issued from the Central reflector and the responses will be sent back to the TULIP JNLP Client. 

Implementation 

...

The TULIP Central reflector will be a CGI Script (reflector.cgi) deployed at SLAC. The TULIP client will issue a single request and the Reflector will go ahead and probe all the landmarks in that region*\[1\] and return the results to the TULIP Client. Probing the target site from more vantage points may give us a better estimate of its location.

Requirements

  • Should fetch sites.txt or have a local copy of sites.txt, what changes should be made to sites.txt ?
    • A new parameter should be added to sites.txt to include teir0 or teir1tier1. Also the region of tier1 sites needs to be specified in sites.txt
  • A separate thread should be used for each landmark and Semaphores should be used for locking, so that data from different threads should not inter-mix.
  • There should be a limit on the number of threads that can be launched at a time (say 10).
  • Should there be extra logging on the reflector or can we rely on the standard web logs which will log each query including time stamp, the client name.  What else it logs depends on whether the rewuest request is Get or a Post.
  • Where are the results parsed, could be in the reflector or in the Java client. In the client distributes the parsing load, reduces the load on the reflector, simplifies the CGI script.
  • What should happen if a landmark responds with a bad data. ( Should it process the error or send the raw data back?). Since there will be some anomalies I suspect the reflector will need to return the full response and anyhow needs to inform the user, so I suspect initially the clinet client will process the response and spot errors etc.) Also if the client parses the result it will probably be easily able to spot problems.
  • Special consideration for security as the script ultimately has to be deployed at SLAC (Perl taint option, warning option, special open method etc)
  • Need to agree on a common format for the exchange of data.
  • Needs a blacklisting mechanism for malicious hosts.

After discussing with Yee and Booker it was seen that forks may be too complicated. The version of Perl at SLAC did not support threading. Also the security people will not allow forks running inside a CGI-script. So I had to come up with an alternative. The solution to this problem was to use Asynchronous IO.  A bunch of requests could be send to the landmarks without waiting for he the response.  The  LWP::Parallel library provides all this functionality.  It supports asynchronous IO. Currently it is not installed  so I am using a local version  in my home directory.  Ultimately this module has to be installed on the production server.

I have implemented most of the functionality. The script is running fine.  I will have to taken measures to make the script more secure so that it could not be used as a platform to launch DDOS attacks, by limiting the number of concurrent process of reflector.cgi to 10. Also the script produces customized messages such as ( request time out or Connection failed so that the TULIP client can differentiate between the various kind of error conditions). Also there is a crude blacklisting mechanism so that particular IP addresses can be blocked.

Implementation

There are two scripts: reflector.cgi and  EventHandler.pm. Both have -T (Tainting), warning (-w), use strict, use the 3 parameter version of open, all opens and closes have a die or its equivalent. EventHandler.pm is called by reflector.cgi.  The Scripts CGI scripts are  deployed in the path /afs/slac.stanford.edu/g/www/cgi-wrap-bin/net/shahryar/smokeping/.

Invocation

The reflector script is called by a URL of the form:

Code Block
http://www-wanmon.slac.stanford.edu/cgi-wrap/reflector.cgi?region=North%20America&target=134.79.16104.980&tier=all&type=PlanetLab&ability=1

Leaving out the region will assume all regions, leaving out the tier will assume all tiers, leaving out the type will assume both PlanetLabs PingER and SLAC perfSONAR type landmarks. If the region is included then only landmarks in that region will be used. If the tier is specified then only that tier's landmarks will be used. If the type is specified then only that type landmark(s) (e.g. type=PingER,perfSONAR will use only landmark types PingER and perfSONAR) will be used.  Any or all of the  tier, region and type region may be specified as "all".

The ability parameter enables the user to specify whether to use enabled landmarks (ability=1, default) or disabled landmarks (by specifying ability=0).  The enabled landmarks are found in;

Code Block
http://www.slac.stanford.edu/comp/net/wan-mon/tulip/sites.xml

The disabled landmarks are found in:

Code Block
http://www.slac.stanford.edu/comp/net/wan-mon/tulip/sites-disabled.xml

The sites xml files are created a by a trscrontanb running in pinger@pinger.slac.stanford.edu. The script that writes sites.xml is: /afs/slac/package/pinger/tulip/create_sites-xml.plThere pl. There is also a debug parameter. Use debug=1 or greater if you want debugging output.

The function parameter, enables one to specify to ping (function=ping, default), to print out usage information (function=help), to print out the log file(function=log), to print out an HTML table of the landmarks (function=landmarks) from the TULIP database, and to analyze the log file (funjctionfunction=analyze).

The script uses asynchronous I/O to talk simultaneously with up to 20 landmarks. Up to 5 copies of reflector.cgi can be running simultaneously.

For the Planetlab landmarks an interpretive script (also see here for the original is supplied with the target ($target) and the number of pings ($ping) to make. For the SLAC ping servers and the Looking glass sites the landmark is accessed by a URL provided in the sites.txt file (see below) in the PingSites token (e.g. http://www.slac.stanford.edu/cgi-wrap/nph-traceroute.pl?choice=yes&function=trace&target=$target).

New XML landmarks file

The function analyze has a sub option ability=0 to analyze the logs for disabled landmarks and ability =1 (default) to analyze log entries excluding those for disabled landmarks. The function landmarks has an additional parameter out (out=html (default) or out=csv), which can toggle output between html and csv. Using this link, one can easily have a look at landmarks present in the database, along with their most important properties, like tracerouteURL, pingURL, longitude, latitude, city, country and hostname.

The script uses asynchronous I/O to talk simultaneously with up to 20 landmarks. Up to 5 copies of reflector.cgi can be running simultaneously.

For the Planetlab landmarks an interpretive script (also see here for the original is supplied with the target ($target) and the number of pings ($ping) to make. For the SLAC ping servers and the Looking glass sites the landmark is accessed by a URL provided in the sites.txt file (see below) in the PingSites token (e.g. http://www.slac.stanford.edu/cgi-wrap/nph-traceroute.pl?choice=yes&function=trace&target=$target).

New XML landmarks file

TULIP has been refactored to use an XML file for its site/landmark information. This is generated from a MySQL database in the 'tulip' schema on pinger. The file can be generated with a perl script in the subversion repository, pinger/trunk/bin/create_sites-xml.pl and its associated template file, pinger/trunk/bin/sites-xml.tt2. Add/modify nodes in the sites TULIP has been refactored to use an XML file for its site/landmark information. This is generated from a MySQL database in the 'tulip' schema on pinger. The file can be generated with a perl script in the subversion repository, pinger/trunk/bin/create_sites-xml.pl and its associated template file, pinger/trunk/bin/sites-xml.tt2. Add/modify nodes in the sites list by connecting to the database.

...

  • Country names are defined by the Mapland database since this is used to produce our maps and we cannot modify it.  Usually (but not always) it is in agreement with UN standards.
  • The country names (and regions) in the PingER database can be found here.
  • Regions are defined as given in the PingER NodeDetails Oracle database.
  • A region to country mapping can be found at here. This mapping is created using a the script createCountryRegion.pl which is placed at in /afs/slac.stanford.edu/www/comp/net/wan-mon/tulip/TULIP/. This script uses the NODE_DETAILS hash to determine the regions of various countries.

...

There are about 60 SLAC/Looking Glass landmarks, and about 156 PlanetLabs landmarks.  We are working on filtering the latter to remove filter the landmarks nightly using tulip-tuning.pl to disable non- or poorly respodning landmarksresponding landmarks, and re-enable them when they are working well. The PlanetLab landmarks send 10 pings very quickly, whereas the SLAC/Looking Glass landmarks send five 56 byte pings with one second between them, they will also wait for a deadline time of 30 seconds for pings to be replied to.

...

To reduce the network impact and reduce the initial rough estimate time, we also break the landmarks into two tiers. Tier0 landmarks are used to identify the region for the target. Then tier1 hosts for that region can be used to more exactly locate the target.  Tier0 hosts are chosen as being at the edges of the region, being well connected, highly reliable and quick to respond. We currently only define tier0 hosts for North America and Europe.  In other regions all the landmarks are regarded as tier0. There are about 8 tier0 hosts for North America and 4 for Europe. This reduces the number of landmarks to make measurements from measurements  with a tier0 request since there are over 100 landmarks in either North America or Europe.When sending the query to the reflector we can characterize it in to into three tiers 0,1 and 2. These tiers would be matched with sites.xml nodes and relevant tiering information would be extracted from tag name "tier".  

...

The first 3 response are from PlanetLab landmarks and the latter is from a SLAC type landmark. 

Errors Reported by PlanetLab

Nb. a negative loss is reported if the target provides duplicate responses (emplification), such as

Code Block

3 packets transmitted, 5 packets received, 1.67 times amplification

PlanetLab Landmarks

To access the PlanetLab landmarks one needs a cookie that is associated with a subnet (in our case 134.79/16). In addition one needs a ruby script that is sent to the PlanetLab landmark to execute. These are put together by reflector.cgi to create a URL in hex form.

Errors Reported by PlanetLab

Failed to connect to http://129.22.150.90 response code 500
ERROR: you're (134.79.18.134) already running a measurement on socket 14. Wiki MarkupFailed to connect to [http://129.22.150.90] response code 500 ERROR: you're (134.79.18.134) already running a measurement on socket 14. [http://128.83.122.179]
10 packets transmitted, 0 received, 100% packet loss, time 0 ms rtt min/avg/max = 0/0/0 [http://141.149.218.208]
Can't resolve DNS: submitted:6:in `ip_dst=': unable to resolve $target: running in a chroot without dns support (RuntimeError)
submitted:9: warning: didn't see packet 5 leave: pcap overloaded or server bound to incorrect interface?
To 134.79.16.9 timed out
Error connecting: Connection refused
ERROR: you need a valid scriptroute authentication cookie to use this server, or the cookie you used does not match\
your client IP 134.79.18.163; go to [http://www.scriptroute.org/cookies.html] to get one.
ERROR: you're (134.79.18.134) already running a measurement on socket 10.
PlanetLab Server Error: Received: IP (tos 0xc0, ttl 253, id 51592, offset 0, flags \ [none\], length: 56)
192.70.187.218 > 198.82.160.220: icmp 36: time exceeded in-transit
Error connecting: No buffer space available
submitted:9:in `send_train': scriptrouted error:  unable to send to   unable to send to 137.138.137.177: No buffer space available (ScriptrouteError)

Parsing the SLAC Landmarks

Unfortunately the SLAC landmark server is really designed to be executed from a browser which will render the output, see for example http://www.slac.stanford.edu/cgi-wrap/traceroute.pl?function=ping&target=www.fnal.gov and view the source.  Thus not only is it verbose (~3.3KBytes per successful request) but also EventHandler.pm has to carefully parse this human readable output to find the relevant ping output lines with the RTTs and losses. In addition the server uses the standard system ping command and its output which varies in details between OS's and releases so this also has to be accomadated accomodated in the parsing.

Logging 

In addition to the normal web server (Apache) logging, we  use Log4perl for logging. The configuration file is very simple. The following types of error messages can be found in the log file (this is at /scratch/tulip_log on wanmon.slac.stanford.edu and is also available here) together with with time stamped records of all requests, the requesting host, and the target.
2007/09/03 20:02:25 ERROR> EventHandler.pm:70 EventHandler::on_failure - Landmark=http://128.6.192.158,\
Client=134.79.117.29, failed to connect response code 500<BR>
2007/09/03 20:02:34 ERROR> EventHandler.pm:142 EventHandler::parseData - Landmark=http://129.22.150.90,\
Client=134.79.117.29, 10 packets transmitted, 0 received, 100% packet loss, rtt min/avg/max = 0/0/0:
2007/09/03 20:09:09 ERROR> EventHandler.pm:115 EventHandler::parseData - Landmark=http://128.143.137.250,\
Client=134.79.117.29, request timed out: To 134.79.16.9 timed out

...

Code Block
28cottrell@wanmon:~>bin/tulip-log-analyze.pl
===============Failure types by landmark =======================
           Landmark, Success,  100%_loss,connect_fail, not_sent, timeout, refused,  in_use,  no_name, transit_exc, Totals,
    143.225.229.236, 100.0%,  0.0%, 0.0%, 0.0%, 0.0%, 0.0%, 0.0%, 0.0%,        0.0%           1,
      149.48.230.20, 40.0%,  37.8%, 2.2%, 0.0%, 11.1%, 0.0%, 0.0%,        8.9%,        0.0%          45,...
itchy.cs.uga.edu_PL, 0.0%,  15.4%, 84.6%, 0.0%, 0.0%, 0.0%, 0.0%,        0.0%,        0.0%          26,
           Landmark, Success,  100%_loss,connect_fail, not_sent, timeout, refused, in_use,  no_name, transit_exc,  Totals,
             Totals,  2258, 422, 457, 11, 401, 0, 52, 287, 0, 3888
Wed Oct  3 14:58:38 2007 tulip-log-analyze.pl: took 38 seconds to analyze 4378 records for 323 requests.
Successful hosts=111, Failing hosts=108, PlanetLabs=128(100% success=26), SLACs=16(100% success=10)

 The log is analyzed to understand usage, look  for abusers etc.

Landmark Failures

The typical failure mechanisms for the target www.cern.ch with a timeout of 2  and 10 seconds made in the evening (PDT) of September 8th 2007 is seen in the table below. The multiple numbers in each cell are for different requests. It is seen that increasing the timeout from 2 to 10 seconds does not provide much, if any help.  So we utilize a timeout of 2 seconds.

Timeout

2 secs

10 secs

100% loss

7, 10, 9

10, 11, 8

Success

22, 16, 17

20, 14, 21

Fail to connect

10, 8, 9

8, 6, 5

Timeout

45, 50, 49

46, 51, 46

Performance

Some spot measures of performance indicate that for 10 pings per target and 86 PlanetLab landmarks for region=northamerica as we vary the number of landmarks accessed simultaneously,  the number of parallel requests per landmark, and the timeout for each request the duration is as follows (n.b. there is a timeout of 100 seconds on the complete process, and the default values are in boldface in the table below:

SLACs=16(100% success=10)

 The log is analyzed to understand usage, look  for abusers etc.

Landmark Failures

The typical failure mechanisms for the target www.cern.ch with a timeout of 2  and 10 seconds made in the evening (PDT) of September 8th 2007 is seen in the table below. The multiple numbers in each cell are for different requests. It is seen that increasing the timeout from 2 to 10 seconds does not provide much, if any help.  So we utilize a timeout of 2 seconds.

Timeout

2 secs

10 secs

100% loss

7, 10, 9

10, 11, 8

Success

22, 16, 17

20, 14, 21

Fail to connect

10, 8, 9

8, 6, 5

Timeout

45, 50, 49

46, 51, 46

Performance

Some spot measures of performance indicate that for 10 pings per target and 86 PlanetLab landmarks for region=northamerica as we vary the number of landmarks accessed simultaneously,  the number of parallel requests per landmark, and the timeout for each request the duration is as follows (n.b. there is a timeout of 100 seconds on the complete process, and the default values are in boldface in the table below:

Simultaneous landmarks

Parallel requests / landmark

Request timeout

Duration (secs)

 20

5

2

50

20

5

10

60

10

5

2

88

40

5

2

34

20

10

2

50

Performance Version=3.0

There was a problem found with reflector that the request timeout was fixed at 5sec and the parallel requests was fixed at 10. Modifying the parameters in the code didn't change anything. This was causing two problems, 1) 5 seconds was too less for most of the landmarks causing large number of timeouts and 2) The speed of the script was not fast enough. The timeout was changed to 10sec and the number of parallel requests was increased to 80. During the time of testing there were a total of 200 active landmarks of which 91 were in North America, 15 in South Asia and 49 in Europe.

Region

Tier

Time (seconds)

All

0

10

North America

1

23

Europe

1

35

South Asia

1

17

All

All

65

Simultaneous landmarks

Parallel requests / landmark

Request timeout

Duration (secs)

 20

5

2

50

20

5

10

60

10

5

2

88

40

5

2

34

20

10

2

50

Testing

 It can be tested from a web browser by entering the URL (e.g. from a browser or from wget), e.g.

...

We have also considered whether the knowledge that a machine and possibly the usual owner can be accurately located may violate some privacy issue. This may require us to add some fuzz to results. So far this has not been done.

Sample Scripts

traceroute.pl: This script has been written with special security considerations so it will help in implementing reflector.cgi

...