Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

I have been looking at various perfsonar nodes in an effort to track down the issues that SSRL is experiencing with throughput to ANL.

You're correct to note that the routing between Stanford Campus and ANL is asymmetric. CENIC prefers to hand traffic bound for ANL off to ESnet at the 100G peering at Sunnyvale, while ANL prefers the path through MREN directly to Internet2. In other words, the ANL-->Stanford Campus path never touches ESnet. I can also see from the pS toolkit web interface on the ANL that there are similar issues between anlboder-ps and CENIC pS machines. This suggests to me that there is a more general issue between anlborder-ps and the rest of the world (lack of queue depth on the immediate upstream switch or router is one possibility).
That same issue could be affecting the ANL-->SSRL throughput.

It's a bit harder to see things from the SLAC end. Throughput tests between ESnet's pS boxes at the SLAC ESnet router (slac-pt1) and at the ANL ESnet router (anl-pt1) look really good. You can see the overall picture here:

https://my.es.net/network/performance/bwctl

Things degrade somewhat when we take one step inside the ANL border. I see significantly worse performance (but not horrible on an absolute
scale) between slac-pt1.es.net and anlborder-ps.it.anl.gov than with slac-pt1.es.net and anl-pt1.es.net. I also see better performance in the direction toward anlborder-ps than in the opposite direction, but I see really good performance in BOTH directions between slac-pt1 and anl-pt1.

It's harder to see things on the SLAC side because the only perfsonar host is on the Stanford campus and the outbound routing is different between campus and SLAC. Also, I am noticing the same outbound issues between the Stanford pt host and CENIC pS hosts that we are seeing between Stanford and ESnet hosts. It looks like there may be an outbound issue with the Stanford pS host as well.

So I think we need to take two steps here: One is to try to figure out why there seems to be some outbound throughput issue at ANL (at least with their perfsonar box); the other is to get a perfsonar box (even a temporary toolkit box that we can test with) deployed within SLAC, as close to SSRL as possible. That will give us a chance to test different parts of the (almost) end-to-end path.

Yee responded with the SLAC perfSONAR hosts:

Machines at our cores are

psnr-serv01.slac.stanford.edu
psnr-farm04.slac.stanford.edu
psnr-farm10.slac.stanford.edu

We in the midst of updating some security policies, so I'm not sure if all tests are allowed at the moment.

Response from Stanford.

First a little clarification. The issue is between a researcher on the Stanford net to a service at ANL. I talked with Les to see about a contact on ESnet, since the traceroute I can run show a lot of ESnet hops.

This is the traceroute I get from a Stanford PS (outside our firewall
infrastructure) to ANL's public PS:

Wiki Markup
\[\]$ traceroute ndt.anl.gov traceroute to ndt.anl.gov (146.137.222.101), 30 hops max, 40 byte packets
1  frcfa-rtr-vlan817.Stanford.EDU (171.67.92.18)  0.215 ms  0.175 ms 0.157 ms
2  boundarya-rtr-vlan8.SUNet (171.64.255.210)  0.326 ms  0.394 ms 0.457 ms
3  hpra-rtr.Stanford.EDU (171.66.0.33)  0.234 ms  0.236 ms  0.240 ms
4  hpr-svl-hpr2--stanford-10ge.cenic.net (137.164.27.61)  1.105 ms 1.172 ms  1.233 ms
5  hpr-esnet--svl-hpr2-100ge.cenic.net (137.164.26.10)  1.571 ms 1.647 ms  2.232 ms
6  sacrcr5-ip-a-sunncr5.es.net (134.55.40.5)  4.096 ms  3.827 ms 3.833 ms
7  denvcr5-ip-a-sacrcr5.es.net (134.55.50.202)  25.048 ms  25.113 ms 25.369 ms
8  kanscr5-ip-a-denvcr5.es.net (134.55.49.58)  35.620 ms  35.711 ms 35.961 ms
9  chiccr5-ip-a-kanscr5.es.net (134.55.43.81)  46.647 ms  46.729 ms 46.985 ms
10  starcr5-ip-a-chiccr5.es.net (134.55.42.42)  46.977 ms  46.987 ms 47.242 ms
11  anl-ip-a-anlcr5.es.net (198.124.218.6)  59.069 ms  59.078 ms 59.088 ms
12  * * *
13  * * *

I understand your point about CENIC handing off to ESnet, so transfers from Stanford to ANL are good. If ANL hands traffic right to I2 back in Il, then I can see that path would be different. Thus asymmetry. (have you a way to get a traceroute from ANL to my Stanford PS server, just to document the different routing?)

In talking with contacts at ANL, they report that anlborder-ps.it.anl.gov is actually within a firewall. Here are some details they suggest:

ndt.anl.gov – outside the ANL firewall
anlborder-ps.it.anl.gov – behind the ANL firewall
prfsnr.aps.anl.gov – behind the APS firewall within ANL
perfsonar.gmca.aps.anl.gov – on the GMCA (i.e. our group) subnet

Again your comment about traffic to anlborder-ps.it.anl.gov being slower makes sense.

What still has me stumped is this throughput graph from anlborder-ps.it.anl.gov to a SLAC PS.
http://anlborder-ps.it.anl.gov/serviceTest/bandwidthGraph.cgi?url=http://localhost:8085/perfSONAR_PS/services/pSB&key=f900976535c9051c3d9251e0301335c2&sTime=1367294236&eTime=1369886236&keyR=f6122888ee44db6d8691acc7bc37a2dd&src=slac-pt1.es.net&dst=anlborder-ps.it.anl.gov&srcIP=198.129.254.134&dstIP=130.202.222.58

This appears to be all ESnet but yet is asymmetical, and more than a firewall would impose, I think.

So my summary is: I can understand the asymmetry between Stanford - ANL due to the routing, is there any hope to get the two rates (or routes) closer to one an other? Somehow have the ANL traffic take the I2 route in both directions (or take ESnet in both directions- seems like politics to make either of those things happen). I ask as the tool the researcher uses is one of those simple NX clients which doesn't work great when the back and forth speed is that far apart. Related, but now more informational, is why the ANL-SLAC traffic, over ESnet seems to have asymmetry but for unknown reasons.

...

Here are the prefixes ESnet is getting from Stanford (AS32), via a direct 100G peering with CENIC (AS2152):

128.12.0.0/16 *BGP/170 [] 7w1d 16:06:45, MED 10, localpref 4000,
from 134.55.200.146
AS path: 2153 32 I
> to 134.55.49.1 via xe-1/3/0.5
171.64.0.0/14 *BGP/170 [] 7w1d 16:06:45, MED 10, localpref 4000,
from 134.55.200.146
AS path: 2153 32 I
> to 134.55.49.1 via xe-1/3/0.5

2607:f6d0::/32 *BGP/170 [] 7w1d 16:07:51, MED 10, localpref 4000,
from 134.55.200.146
AS path: 2153 32 I
> to fe80::ea4:2ff:fe5b:3401 via xe-1/3/0.5

If you start preferring these routes from ESnet, we should end up with a
symmetric path from ANL to Stanford and back. That may not solve the
throughput issues, but it will give us a place to start troubleshooting.

Thanks also for the clarification on anlborder-ps. I'll run some more
tests today and send the results.