Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

...

First a little clarification. The issue is between a researcher on the Stanford net to a service at ANL. I talked with Les to see about a contact on ESnet, since the traceroute I can run show a lot of ESnet hops.

This is the traceroute I get from a Stanford PS (outside our firewall
infrastructure) to ANL's public PS:

Wiki Markup\[\]$ traceroute ndt.anl.gov traceroute to ndt.anl.gov (146.137.222.101), 30 hops max, 40 byte packets
1 frcfa-rtr-vlan817.Stanford.EDU (171.67.92.18) 0.215 ms 0.175 ms 0.157 ms
2 boundarya-rtr-vlan8.SUNet (171.64.255.210) 0.326 ms 0.394 ms 0.457 ms
3 hpra-rtr.Stanford.EDU (171.66.0.33) 0.234 ms 0.236 ms 0.240 ms
4 hpr-svl-hpr2--stanford-10ge.cenic.net (137.164.27.61) 1.105 ms 1.172 ms 1.233 ms
5 hpr-esnet--svl-hpr2-100ge.cenic.net (137.164.26.10) 1.571 ms 1.647 ms 22.232 ms
6 sacrcr5-ip-a-sunncr5.es.net (134.55.40.5) 4.096 ms 3.827 ms 3.833 ms
7 denvcr5-ip-a-sacrcr5.es.net (134.55.50.202) 25.048 ms 25.113 ms 25.369 ms
8 kanscr5-ip-a-denvcr5.es.net (134.55.49.58) 35.620 ms 35.711 ms 35.961 ms
9 chiccr5-ip-a-kanscr5.es.net (134.55.43.81) 46.647 ms 46.729 ms 46.985 ms
10 starcr5-ip-a-chiccr5.es.net (134.55.42.42) 46.977 ms 46.987 ms 47.242 ms
11 anl-ip-a-anlcr5.es.net (198.124.218.6) 59.069 ms 59.078 ms 59.088 ms
12 * * *
13 * * *

I understand your point about CENIC handing off to ESnet, so transfers from Stanford to ANL are good. If ANL hands traffic right to I2 back in Il, then I can see that path would be different. Thus asymmetry. (have you a way to get a traceroute from ANL to my Stanford PS server, just to document the different routing?)

In talking with contacts at ANL, they report that anlborder-ps.it.anl.gov is actually within a firewall. Here are some details they suggest:

ndt.anl.gov – outside the ANL firewall
anlborder-ps.it.anl.gov – behind the ANL firewall
prfsnr.aps.anl.gov – behind the APS firewall within ANL
perfsonar.gmca.aps.anl.gov – on the GMCA (i.e. our group) subnet

Again your comment about traffic to anlborder-ps.it.anl.gov being slower makes sense.

What still has me stumped is this throughput graph from anlborder-ps.it.anl.gov to a SLAC PS.
http://anlborder-ps.it.anl.gov/serviceTest/bandwidthGraph.cgi?url=http://localhost:8085/perfSONAR_PS/services/pSB&key=f900976535c9051c3d9251e0301335c2&sTime=1367294236&eTime=1369886236&keyR=f6122888ee44db6d8691acc7bc37a2dd&src=slac-pt1.es.net&dst=anlborder-ps.it.anl.gov&srcIP=198.129.254.134&dstIP=130.202.222.58

This appears to be all ESnet but yet is asymmetical, and more than a firewall would impose, I think.

So my summary is: I can understand the asymmetry between Stanford - ANL due to the routing, is there any hope to get the two rates (or routes) closer to one an other? Somehow have the ANL traffic take the I2 route in both directions (or take ESnet in both directions- seems like politics to make either of those things happen). I ask as the tool the researcher uses is one of those simple NX clients which doesn't work great when the back and forth speed is that far apart. Related, but now more informational, is why the ANL-SLAC traffic, over ESnet seems to have asymmetry but for unknown reasons.

...

Here are the prefixes ESnet is getting from Stanford (AS32), via a direct 100G peering with CENIC (AS2152):

Wiki Markup128.12.0.0/16 \ *\[\] 7w1d 16:06:45, MED 10, localpref 4000,
from 134.55.200.146
AS path: 2153 32 I
> to 134.55.49.1 via xe-1/3/0.5
171.64.0.0/14 \ *\[\] 7w1d 16:06:45, MED 10, localpref 4000,
from 134.55.200.146
AS path: 2153 32 I
> to 134.55.49.1 via xe-1/3/0.5unmigrated-wiki-markup

2607:f6d0::/32 \ *\[\] 7w1d 16:07:51, MED 10, localpref 4000,
from 134.55.200.146
AS path: 2153 32 I
> to fe80::ea4:2ff:fe5b:3401 via xe-1/3/0.5

If you start preferring these routes from ESnet, we should end up with a
symmetric path from ANL to Stanford and back. That may not solve the
throughput issues, but it will give us a place to start troubleshooting.

Thanks also for the clarification on anlborder-ps. I'll run some more
tests today and send the results.

...

Hi Linda:

Thanks for the offer! I can send you other parts of the email thread to catch you up, but we're trying to troubleshoot an issue between the Stanford campus and ANL. I am not yet sure if the issue is within ANL or not. We seem to have a clean path between the ANL ESnet router and the SLAC and Sunnyvale ESnet routers, but there are routing asymmetries between ANL proper and Stanford that are making it harder to isolate the issue. I brought in Corey because I wanted to understand how the various perfsonar nodes were set up, and I was having some problems using ndt.anl.gov.

What we're seeing is degraded throughput, mainly in the ANL to Stanford direction. Phil has confirmed that this is Stanford campus and not SLAC, and the campus is connected via CENIC. Currently, ANL gets to Stanford via MREN to Internet2 to CENIC to Stanford. (Note that this is based on an inference I am making by doing a reverse traceroute from anlborder-ps.it.anl.gov to UC Berkeley, since I still have access to machines there. If anyone at ANL wants to provide some traceroutes to Stanford, e.g. rcf-perfsonar.stanford.edu, that would be great.) Traffic in the Stanford to ANL direction goes via the 100G CENIC-ESnet peering in Sunnyvale and then stays on ESnet until arriving at ANL.

Corey, how is anlborder-ps connected? Does it connect directly to your border router? What kind of router is it?

Here's what I am seeing between the anlborder-ps and ESnet's pS box near
SLAC:

bwctl -L 1500 -i 2 -t 30 -f m -c anlborder-ps.it.anl.gov
bwctl: Using tool: iperf
bwctl: 47 seconds until test results available

Wiki MarkupRECEIVER START
bwctl: exec_line: iperf \ -B 130.202.222.58 \ -s \ -f m \ -m \ -p 5001 \ -t 30 \ -i 2
bwctl: start_tool: 3579008575.131517 \
-----------------------------------------------------------\-
Server listening on TCP port 5001
Binding to local address 130.202.222.58
TCP window size: 0.08 MByte (default) \
-----------------------------------------------------------\- \
[ 14\] local 130.202.222.58 port 5001 connected with 198.129.254.134 port
5001 \
[ ID\] Interval Transfer Bandwidth \[ 14\] Bandwidth
[ 14] 0.0\- 2.0 sec 11.1 MBytes 46.4 Mbits/sec \
[ 14\] 2.0\- 4.0 sec 190 MBytes 796 Mbits/sec \
[ 14\] 4.0\- 6.0 sec 504 MBytes 2115 Mbits/sec \
[ 14\] 6.0\- 8.0 sec 513 MBytes 2153 Mbits/sec \
[ 14\] 8.0-10.0 sec 504 MBytes 2115 Mbits/sec \
[ 14\] 10.0-12.0 sec 512 MBytes 2146 Mbits/sec \
[ 14\] 12.0-14.0 sec 507 MBytes 2127 Mbits/sec \
[ 14\] 14.0-16.0 sec 509 MBytes 2135 Mbits/sec \
[ 14\] 16.0-18.0 sec 510 MBytes 2141 Mbits/sec \
[ 14\] 18.0-20.0 sec 507 MBytes 2127 Mbits/sec \
[ 14\] 20.0-22.0 sec 511 MBytes 2143 Mbits/sec \
[ 14\] 22.0-24.0 sec 507 MBytes 2125 Mbits/sec \
[ 14\] 24.0-26.0 sec 510 MBytes 2140 Mbits/sec \
[ 14\] 26.0-28.0 sec 508 MBytes 2132 Mbits/sec \
[ 14\] 28.0-30.0 sec 509 MBytes 2136 Mbits/sec \
[ 14\] 0.0-30.1 sec 6833 MBytes 1906 Mbits/sec \ [ 14\] MSS size 1460 bytes (MTU 1500 bytes, ethernet)
bwctl: stop_exec: 3579008620.204830unmigrated-wiki-markup

RECEIVER END \
[\]$ bwctl \ -L 1500 \ -i 2 \ -t 30 \ -f m \ -s anlborder-ps.it.anl.gov
bwctl: Using tool: iperf
bwctl: 158 seconds until test results availableunmigrated-wiki-markup

RECEIVER START
bwctl: exec_line: /usr/bin/iperf \ -B 198.129.254.134 \ -s \ -f m \ -m \ -p 5093 \ -t 30 \ -i 2
bwctl: start_tool: 3579008749.906801 \
-----------------------------------------------------------\-
Server listening on TCP port 5093
Binding to local address 198.129.254.134 TCP window size: 0.08 MByte (default) \
-----------------------------------------------------------\- \
[ 15\] local 198.129.254.134 port 5093 connected with 130.202.222.58 port
5093 \
[ ID\] Interval Transfer Bandwidth \[ 15\] Bandwidth
[ 15] 0.0\- 2.0 sec 2.69 MBytes 11.3 Mbits/sec \ [ 15\] 2.0\- 4.0 sec 4.57 MBytes 19.2 Mbits/sec \
[ 15\] 4.0\- 6.0 sec 25.9 MBytes 109 Mbits/sec \
[ 15\] 6.0\- 8.0 sec 26.7 MBytes 112 Mbits/sec \
[ 15\] 8.0-10.0 sec 17.8 MBytes 74.5 Mbits/sec \
[ 15\] 10.0-12.0 sec 60.2 MBytes 253 Mbits/sec \
[ 15\] 12.0-14.0 sec 197 MBytes 828 Mbits/sec \
[ 15\] 14.0-16.0 sec 81.4 MBytes 342 Mbits/sec \
[ 15\] 16.0-18.0 sec 219 MBytes 917 Mbits/sec \
[ 15\] 18.0-20.0 sec 467 MBytes 1957 Mbits/sec \
[ 15\] 20.0-22.0 sec 495 MBytes 2078 Mbits/sec \
[ 15\] 22.0-24.0 sec 495 MBytes 2076 Mbits/sec \
[ 15\] 24.0-26.0 sec 500 MBytes 2098 Mbits/sec \
[ 15\] 26.0-28.0 sec 494 MBytes 2072 Mbits/sec \
[ 15\] 28.0-30.0 sec 494 MBytes 2072 Mbits/sec \
[ 15\] 0.0-30.0 sec 3583 MBytes 1002 Mbits/sec \ [ 15\] MSS size 1460 bytes (MTU 1500 bytes, ethernet)
bwctl: stop_exec: 3579008795.945836

RECEIVER END

Traffic in the slac-pt1-->anl direction looks reasonable, but traffic in the opposite direction show some evidence of really slow slow-start and maybe even some packet loss. I'd like to see how the pS box is connected to try to get to the root of the throughput asymmetry.

thanks,
michael

...