We observed that from the TULIP landmark monitor.niit.edu.pk to nasa.nexlink.net.pk (alias lg.nexlinx.net.pk) was sometimes seeing very large RTTs (500-100ms) and for extended periods compared to those (~50ms) from other landmarks at COMSATS and NCP in Pakistan. We ran top on monitor.niit.edu.pk and the extended RTT's did not seem to be related to what is running on the host. Then we ran traceroute in the two states (~50 ms RTT and 500-100ms RTT), see below. The routes do not change but the RTTs to the second hop and beyond are badly elongated. The fact that the first hop is not elongated, I think means it is not caused by monitor itself. Thus it appears to be in the network that the congestion is occurring. In fact from the traceroutes it would appear to be most noticeable at the second hop which is labeled as DSL.
jerrod@www:~$ traceroute lg.nexlinx.net.pk traceroute to nasa.nexlinx.net.pk (202.59.80.52), 30 hops max, 38 byte packets 1 203.99.50.201 (203.99.50.201) 0.894 ms 0.940 ms 0.985 ms 2 lo-0-bras1.dsl.net.pk (203.82.63.253) 51.178 ms 71.427 ms 97.958 ms 3 g9-10-iba.nayatel.pk (203.82.48.165) 47.867 ms 56.102 ms 47.368 ms 4 58-65-175-217.nayatel.pk (58.65.175.217) 88.079 ms 49.389 ms 48.004 ms 5 f0-gw3.nayatel.pk (58.65.166.250) 48.094 ms f1-gw3.nayatel.pk (58.65.166.254) 48.716 ms 49.531 ms 6 rwp44.pie.net.pk (202.125.155.93) 50.012 ms 53.399 ms 52.011 ms 7 rwp44.pie.net.pk (202.125.148.163) 50.035 ms 49.537 ms 50.021 ms 8 pos1-1.lhe63gsrc1.pie.net.pk (202.125.159.25) 59.919 ms 63.638 ms 63.938 ms 9 lhr63.pie.net.pk (202.125.138.169) 53.932 ms 55.435 ms 56.104 ms 10 lhr63.pie.net.pk (202.125.147.6) 56.008 ms 81.429 ms 55.995 ms 11 * * * 12 nasa.nexlinx.net.pk (202.59.80.52) 58.410 ms 69.452 ms 57.957 ms jerrod@www:~$ traceroute lg.nexlinx.net.pk traceroute to nasa.nexlinx.net.pk (202.59.80.52), 30 hops max, 38 byte packets 1 203.99.50.201 (203.99.50.201) 0.775 ms 0.720 ms 0.837 ms 2 lo-0-bras1.dsl.net.pk (203.82.63.253) 470.776 ms 538.449 ms 528.146 ms 3 g9-10-iba.nayatel.pk (203.82.48.165) 556.081 ms 504.211 ms 601.765 ms 4 58-65-175-217.nayatel.pk (58.65.175.217) 618.138 ms 58-65-175-221.nayatel.pk (58.65.175.221) 729.73 8 ms 693.730 ms 5 f1-gw3.nayatel.pk (58.65.166.254) 766.243 ms 826.162 ms 755.736 ms 6 rwp44.pie.net.pk (202.125.155.93) 746.828 ms 761.153 ms 800.322 ms 7 rwp44.pie.net.pk (202.125.148.163) 826.194 ms 899.759 ms 854.293 ms 8 pos1-1.lhe63gsrc1.pie.net.pk (202.125.159.25) 1080.374 ms 983.870 ms 1010.345 ms 9 lhr63.pie.net.pk (202.125.138.169) 896.189 ms 905.808 ms 845.805 ms 10 lhr63.pie.net.pk (202.125.147.6) 874.788 ms 987.723 ms 940.307 ms 11 * * * 12 nasa.nexlinx.net.pk (202.59.80.52) 863.136 ms 956.115 ms 893.514 ms
Next we ran 60,000 100 byte pings, with a separation of 1 second between pings, from monitor.niit.edu to nasa.nexlink.net.pk starting at about 10:30pm February 16, Pakistan time. The summary appears as:
--- nasa.nexlinx.net.pk ping statistics --- 60000 packets transmitted, 59741 received, +33 errors, 0% packet loss, time 60077854ms rtt min/avg/max/mdev = 55.763/92.539/2099.976/136.408 ms, pipe 3
Within the 16.66 hours of pings there were two instances of "Destination Host Uneachable". Each of these lasted about 30 seconds. I am suspicious that these were caused by DSL resyncs. The events are shown as negative losses in the time series below.
We imported the ping data into Excel and looked at the time series:
It is seen that there are many extended periods with large RTTs which are NOT associated with packets losses. A frequency histogram of the RTTs is seen below:
Note the log-log scale. Due to long-range behavior of RTTs the distribution is expected to be heavy-tailed following a straight line to the right of the peak. It obviously does not show this type of distribution, being fairly flat from 200 - 1000ms. This may be partly caused by the DSL router buffers which are typically configured to be very large (presumably to reduce packet loss). We can estimate the buffer sizes by taking the queuing time as maximum RTT - minimum RTT and the link speed from NIIT to off-site as 2 Mbits/s. which gives about 256KBytes of buffering.
Follow Up
We were fortunate to receive assistance from an operations manager at Nayatel and Micronet. Hop 2 in the traceroute is a Micronet Broadband Remote Access Server (BRAS) on which the DSL customer's [NUST/NIIT] PPPoE session is terminated. It is a Cisco 7206VXR.
PINGs and traceroute to the NASA.nexlinx.net.pk site from a Nayatel core router confirm that the long delays are not in the core netwwork.
NYT-IBA-RTR7609#trace 202.59.80.52 Type escape sequence to abort. Tracing the route to nasa.nexlinx.net.pk (202.59.80.52) 1 58-65-175-221.nayatel.pk (58.65.175.221) 0 msec 58-65-175-217.nayatel.pk (58.65.175.217) 4 msec 58-65-175-221.nayatel.pk (58.65.175.221) 0 msec 2 f0-gw3.nayatel.pk (58.65.166.250) 0 msec 0 msec f1-gw3.nayatel.pk (58.65.166.254) 0 msec 3 rwp44.pie.net.pk (202.125.155.93) 0 msec 0 msec 0 msec 4 rwp44.pie.net.pk (202.125.148.164) 4 msec 0 msec 4 msec 5 pos1-3.lhe63gsrc2.pie.net.pk (202.125.159.39) [MPLS: Label 453 Exp 0] 4 msec 8 msec 4 msec 6 lhr63.pie.net.pk (202.125.138.148) 8 msec 4 msec 8 msec 7 lhr63.pie.net.pk (202.125.147.6) 8 msec 4 msec 8 msec 8 nasa.nexlinx.net.pk (202.59.80.52) 8 msec 8 msec 8 msec
NYT-IBA-RTR7609# NYT-IBA-RTR7609#ping Protocol [ip]:Target IP address: 202.59.80.52 Repeat count [5]: 100 Datagram size [100]: Timeout in seconds [2]: Extended commands [n]: Sweep range of sizes [n]: Type escape sequence to abort. Sending 100, 100-byte ICMP Echos to 202.59.80.52, timeout is 2 seconds: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Success rate is 100 percent (100/100), round-trip min/avg/max = 4/9/32 ms NYT-IBA-RTR7609#