Problem

Kars Ohrenberg of DESY observed data transfer problems from SLAC to DESY. He sent email describing the problem and requesting assistance.

Apparently tranfers to the Max Plank Institute in Garching host camp.rzg.mpg.de was working fine (100MB/sec using bbcp) but to the DESY host mms1.desy.de was only getting 15MB/sec.

The traceroutes from SLAC were the same up until the 11th and 12th hops:

11  esnet-wash.rt1.fra.de.geant2.net (62.40.125.77)  170.452 ms  170.420 ms  170.327 ms
12  dfn-gw.rt1.fra.de.geant2.net (62.40.124.34)  171.134 ms  171.012

The traceroute from SLAC to DESY for the 13 hop onwards are shown below"

13  zr-pot1-te0-7-0-2.x-win.dfn.de (188.1.145.138)  184.130 ms  184.523 ms 184.580 ms
14  xr-tub1-vlan500.x-win.dfn.de (188.1.144.222)  188.352 ms  185.096ms 185.093 ms
15  xr-des1-te1-1.x-win.dfn.de (188.1.144.226)  189.444 ms  189.517 ms 189.619 ms
16  kr-desy.x-win.dfn.de (188.1.47.10)  190.352 ms  190.245 ms 189.988 ms
17  * * *

The traceroute from SLAC to Garsching for the 13th hop onwards is shown below:

13  xr-gar1-te2-2.x-win.dfn.de (188.1.145.54)  179.273 ms  179.393 ms 179.512 ms
14  kr-ipp-garching.x-win.dfn.de (188.1.234.34)  179.063 ms  179.118 ms 179.173 ms
15  bi-d2a.rzg.mpg.de (130.183.254.5)  178.872 ms  179.016 ms  179.070 ms
16  camp.rzg.mpg.de (130.183.19.11)  179.802 ms  179.853 ms  179.845 ms

Investigation

I am not worried about the routers not responding to a traceroute and getting * * * this is normal. The average difference of ~45ms in getting to the end node (when the RTTs as far as the gateway are the same) is explained below. I am running mtr to both end hosts. See below. Do not worry about the difference of one packet between send and receive this is because it is constantly updating. Looking at the averages and best (no queueing) they are for DESY about 40ms and for Garching about 1ms.  This indicates a lot of queuing/jitter on the DESY link which can affect TCP performance. So I am guessing the problem (if indeed it is a network problem) may lie in congestion in the Frankfurt Potsdam link.

Having given up on GeoIPTools to locate the routers, we looked at DFN and found http://www.win-labor.dfn.de/cgi-bin/hades/map.pl?config=win giving one way delays.  This shows that from Frankfurt the packets go to Potsdam then Tubingen and then to DESY. Looking at the map (below) it appears the Frankfurt to Max Plan Garching link is about 300Miles whereas the Frankfurt-Potsdam-Tubingen-Hamburg/DESY is more roundabout and about 1100miles.  If one takes the speed of light in fibre as being 0.6c, this gives distance = 0.4*100*RTT(ms) km then the difference in the two routes is about 30ms which is in rough agreement in what we are seeing from the traceroutes. The actual one way delay measurements themselves are of little use for what we want, first RTT is more important and more importantly it is the jitter (queueing) that is more important.

There could also be longer term effects such as the day link being more congested at some times of the day. I went back to look at the Archived PingER data to a host at DESY and one nearby to Garching but in Munich.

Looking back at the pinger data (see http://www-wanmon.slac.stanford.edu/cgi-wrap/pingtable.pl?file=average_rtt&by=by-node&size=100&tick=last120days&from=EDU.SLAC.STANFORD.N3&to=Germany&ex=none&only=all&dataset=hep&percentage=any) and the plots below:

The DESY RTTs look pretty stable though there was a step increase in RTT on 2010-05-20. Looking at the pings from SLAC to Munich Max Plank (as opposed to Garching nearby) it has a very similar appearance with a bit more stability.  The step change is again visible and the overall losses are similar.

The pings I am currently measuring to Garching (camp.rcg.mpg.de) confirm the PingER values of about 179ms. However the pings to mms1.desy.de are much larger (min/avg/max/stdev=189/237/398/62/1589) than the PingER RTTs to ping.desy.de. They are also much larger than the pings just measured simultaneously to ping/desy.de (min/avg/max/stdev/count=189/189.5/194/0.58/1388).

Could there be something up with the host mms1.desy.de, e.g. it is overloaded, is paging, its network connection is congested, the NIC is not properly configured etc.

Also see https://confluence.slac.stanford.edu/display/IEPM/Incorrect+geolocation+results+by+database+based+approaches for impact on this routing on Geolocation.

  • No labels