See here for the report from the ESnet NOC. On the morning of August 16th, the EQX-CHI-CR5 router began to clock ingress FCS errors on port 1/1/1 (facing CHIC-CR5). 

  • Incident Start Time:  08/16/2019 06:27:51 PDT AKA 13:27:51 UTC
  • Incident Resolution Time:  08/20/2019 15:31:00 AKA  21:31:00 UTC

If we look at the PingER losses from say SLAC to Europe (see https://www.slac.stanford.edu/cgi-bin/pingtable.pl?file=packet_loss&by=by-node&size=100&tick=hourly&year=2019&month=08&day=16&from=EDU.SLAC.STANFORD.PINGER&to=WORLD&ex=none&only=all&ipv=all&dataset=hep&percentage=any) one can see that for the UTC hour 14:00-1500 the number of hosts in Europe responding to pings from SLAC dropped from 97 or 98 for the rest of the day to 37 i.e. the number of responding hosts dropped by a dramatic 42%. A typical traceroute from SLAC normally goes via the EQX-CH-CR5 router, for example see the traceroute below for DESY, in particular hop 10.

[cottrell@pinger ~]$ traceroute -m 30 -q 1 -w 1 -A ping.desy.de
traceroute to 131.169.5.47 (131.169.5.47), 30 hops max, 140 byte packets
 1  rtr-servcore1-serv01-webserv.slac.stanford.edu (134.79.197.130) [AS3671]  0.585 ms
 2  rtr-core2-p2p-serv01-02.slac.stanford.edu (134.79.253.253) [AS3671]  0.588 ms
 3  rtr-fwcore1-trust-p2p-core1.slac.stanford.edu (134.79.254.134) [AS3671]  1.127 ms
 4  rtr-core1-p2p-fwcore1-untrust.slac.stanford.edu (134.79.254.137) [AS3671]  1.440 ms
 5  rtr-border2-7k-core2.slac.stanford.edu (134.79.252.185) [AS3671]  1.308 ms
 6  sunncr5-ip-c-slac.slac.stanford.edu (192.68.191.233) [AS3671]  1.928 ms
 7  134.55.40.145 (134.55.40.145) [AS293]  4.250 ms
 8  134.55.40.150 (134.55.40.150) [AS293]  46.466 ms
 9  washcr5-ip-a-chiccr5.es.net (134.55.36.46) [AS293]  63.502 ms
10  eqxashcr5-ip-a-eqxchicr5.es.net (134.55.207.53) [AS293]  63.013 ms
11  londcr5-ip-c-aofacr5.es.net (134.55.37.22) [AS293]  138.596 ms
12  amstcr5-ip-a-londcr5.es.net (134.55.222.13) [AS293]  144.920 ms
13  newycr5-ip-a-aofacr5.es.net (134.55.37.78) [AS293]  68.714 ms
14  londcr5-sdn-a-newycr5.es.net (134.55.220.21) [AS293]  137.348 ms
15  amstcr5-ip-a-londcr5.es.net (134.55.222.13) [AS293]  143.881 ms
16  esnet-eex-gw.ams.nl.geant.net (62.40.124.105) [AS21320/AS20965]  145.115 ms
17  cr-tub2-be10.x-win.dfn.de (188.1.146.210) [AS680]  165.508 ms
18  kr-des91.x-win.dfn.de (188.1.231.114) [AS680]  169.408 ms

If we look at the times series of the individual ping measurements made every 30 minutes to this target seen from SLAC we see:

The last lossless data point for 100 Byte pings was at 13:19:53 and the next measured was 15:52:52. The data for the 1000Byte pings is shown below.

Unix epoch, UTC,           min-RTT,avg-RTT,max-RTT,Loss
1565961608,8/16/2019 13:20,169.507,169.684,169.799,0
1565963682,8/16/2019 13:54,0,0,0,100
1565965667,8/16/2019 14:27,0,0,0,100
1565967504,8/16/2019 14:58,0,0,0,100
1565969272,8/16/2019 15:27,0,0,0,100
1565970782,8/16/2019 15:53,169.469,169.609,169.713,0

If we take the packet loss hourly data measured from SLAC to the world for 2018/8/16 cut and paste into Excel, then reverse sort on hour 14 and remove all hosts that had a measured loss (not a .), then sort on hour 13 and remove all hosts that did not have a measured loss (i.e. recorded as a .).  Then it is seen in the spreadsheet that the main impact is on Europe including the Balkans (37countries), Africa (23 countries),  Middle East (9 countries), Central Asia (8 countries), S.E. Asia (7 countries), Latin America  (7 countries), S. Asia (2 countries BD & MV), E. Asia (1 country CN).

Further sorting (in order) TLD, Region and (reverse sort) 13:00 helps identify the impacted targets.

It makes sense that traffic from SLAC to Europe,  Middle East, Central Asia, would go through Chicago. This is not so clear for Latin America and China. I looked at the traceroutes to see if the route goes through a router with the name eqxchicr5 embedded in it.

  • There are PingER target hosts in China. I would expect those routes to go West from SLAC across the Pacific and hence not see the eqxchicr5 router. I looked at several that show up in the spreadsheet and indeed their traceroute* does not indicate a router with the name eqxchicr5. It is also noticeable that all of these Chinese targets show heavy losses during the rest of the day, and such heavy losses are not the general case for Chinese target hosts.
  • Africa is a mixed bag for the few hosts I tracerouted some show a hop with eqxchicr5 embedded in it, others do not.
  • Strangely there is a host (melspeed.intervolve.com.au) that appears to be in Australia and I would have expected the route to go West from California across the Pacific, yet goes from SLAC to a router with the name eqxchicr5 in its name.
  • Middle East and Central Asia routes mainly go via eqxchicr5
  • I would expect S.E. Asia to go via Japan and Singapore and this tends to be the case for some traceroute spot checks I made
  • Latin America appears to go mainly through eqxchicr5  according to the traceroutes.

To try and resolve some of these issues I wrote a script (traceroute-list.pl) to traceroute to each hostname in the list of hosts. The results are shown in traceroutes-list.txt.  It is seen:

  • Of the 38 hosts in Africa that appear to have an outage at the time of the incident, 11 match eqxchicr5 in their traceroute and 27 do not match as seen from SLAC.
  • Of the 16 hosts in the Balkans that appear to have an outage at the time of the incident, 7 match eqxchicr5 in their traceroute and 9 do not match as seen from SLAC.
  • Of the 13 hosts in Central Asia that appear to have an outage at the time of the incident, 7 match eqxchicr5 in their traceroute and 6 do not match as seen from SLAC.
  • Of the 61 hosts in Europe that appear to have an outage at the time of the incident, 33 match eqxchicr5 in their traceroute and 28 do not match as seen from SLAC.
  • Of the 7 hosts in Latin America that appear to have an outage at the time of the incident, 3 match eqxchicr5 in their traceroute and 4 do not match as seen from SLAC.
  • Of the 14 hosts in the Middle East that appear to have an outage at the time of the incident, 3 match eqxchicr5 in their traceroute and 1 do not match as seen from SLAC.
  • Of the 5 hosts in North America that appear to have an outage at the time of the incident, 5 match eqxchicr5 in their traceroute and 0 do not match as seen from SLAC.
  • Of the 6 hosts in Russia that appear to have an outage at the time of the incident, 6 match eqxchicr5 in their traceroute and 0 do not match as seen from SLAC.
  • Of the 12 hosts in S.E. Asia that appear to have an outage at the time of the incident, 0 match eqxchicr5 in their traceroute and 12 do not match as seen from SLAC.
  • Of the 3 hosts in S. Asia that appear to have an outage at the time of the incident, 0 match eqxchicr5 in their traceroute and 3 do not match as seen from SLAC.
  • Summary:

    RegionMatch Do not MatchTotal impacted at 14:00Total Targets monitoredTotal responsive targets at 13:00
    Africa112738131104
    Balkans79163130
    C. Asia76132927
    East Asia01414211140
    Europe33286113297
    Latin America3478981
    Middle East311145846
    North America 5169371
    Oceania 0112016
    Russia6061211
    S.E. Asia0111111483
    South Asia0334842
    Total75115190968748

Route Changing

An interest is seeing whether PingER can fairly easily and reliably (no false positives etc.) detect such an incident. As we dig in deeper, a surprise is the significant changes in traceroutes from one hour to another. For example, looking at the traceroutes from pinger.slac.stanford.edu to 191 targets on 9/19/2019  (as far as I know a stable day when no problems are reported) the number of routes with hops matching 'eqxchicr5measured at the following times changes quite a lot.  For example, below is a table of the results for 2019/09/19 11:45 PDT and 2019/09/19 11:59 PDT.

 

Region 11:45MatchNo matchSumRegion 11:59MatchNon MatchSum
Africa162238Africa162238
Balkans7916Balkans9716
Central_Asia8513Central_Asia10313
East_Asia01414East_Asia01414
Europe303161Europe372461
Latin_America437Latin_America437
Middle_East11314Middle_East21214
North_America516North_America426
Oceania011Oceania011
Russia426Russia516
S.E._Asia01111S.E._Asia01111
SE.ASIA011SE.ASIA011
South_Asia033South_Asia033
Sum75116191Sum87104191

Below are two traceroutes to measured a few minutes apart, one has hops with the string 'eqxchicr5', the other does not.

Measured  after 13:18 9/19/2019Measured after 13:26 9/19/2019
cmd=traceroute  -m 30 -q 1 -w 1 www.uoi.grcmd=traceroute  -m 30 -q 1 -w 1 www.uoi.gr  
Found eqxchicr5 in traceroute  -m 30 -q 1 -w 1 www.uoi.gr for tld=GR(ntld/Balkans)=3 in hop(s)Cannot find eqxchicr5 for tld=GR(ntld/Balkans)=2 in traceroute  -m 30 -q 1 -w 1 www.uoi.gr  
 9  eqxchicr5-ip-b-chiccr5.es.net (134.55.218.62)  46.664 ms   
 10  eqxashcr5-ip-a-eqxchicr5.es.net (134.55.207.53)  63.456 ms   
traceroute to www.uoi.gr (195.130.120.109), 30 hops max, 60 byte packetstraceroute to www.uoi.gr (195.130.120.109), 30 hops max, 60 byte packets  
  1  rtr-servcore1-serv01-iepm.slac.stanford.edu (134.79.104.66)  0.780 ms  1  134.79.104.67 (134.79.104.67)  0.675 ms  
  2  rtr-core2-p2p-serv01-01.slac.stanford.edu (134.79.254.65)  0.670 ms  2  rtr-core1-p2p-serv01-01.slac.stanford.edu (134.79.253.249)  0.577 ms  
  3  rtr-fwcore1-trust-p2p-core1.slac.stanford.edu (134.79.254.134)  1.059 ms  3  rtr-fwcore1-trust-p2p-core1.slac.stanford.edu (134.79.254.134)  1.038 ms  
  4  rtr-core1-p2p-fwcore1-untrust.slac.stanford.edu (134.79.254.137)  1.452 ms  4  rtr-core2-p2p-fwcore2-untrust.slac.stanford.edu (134.79.254.149)  1.285 ms  
  5  rtr-border2-7k-core1.slac.stanford.edu (134.79.252.181)  1.305 ms  5  rtr-border2-7k-core2.slac.stanford.edu (134.79.252.185)  1.213 ms  
  6  sunncr5-ip-c-slac.slac.stanford.edu (192.68.191.233)  1.781 ms  6  sunncr5-ip-c-slac.slac.stanford.edu (192.68.191.233)  1.807 ms  
  7  134.55.40.145 (134.55.40.145)  4.322 ms  7  sacrcr5-ip-a-sunncr5.es.net (134.55.40.5)  4.354 ms  
  8  134.55.40.150 (134.55.40.150)  46.533 ms  8  134.55.40.150 (134.55.40.150)  46.635 ms  
  9  eqxchicr5-ip-b-chiccr5.es.net (134.55.218.62)  46.664 ms  9  washcr5-ip-a-chiccr5.es.net (134.55.36.46)  63.694 ms  
 10  eqxashcr5-ip-a-eqxchicr5.es.net (134.55.207.53)  63.456 ms 10  cern513cr5-ip-a-washcr5.es.net (134.55.37.61)  149.270 ms  
 11  esnet-eex-gw.gen.ch.geant.net (62.40.124.77)  149.565 ms 11  *  
 12  cern513cr5-ip-a-washcr5.es.net (134.55.37.61)  149.278 ms 12  ae4.mx1.mil2.it.geant.net (62.40.98.89)  156.263 ms  
 13  esnet-eex-gw.gen.ch.geant.net (62.40.124.77)  149.358 ms 13  esnet-eex-gw.gen.ch.geant.net (62.40.124.77)  149.343 ms  
 14  grnet-gw.mx2.ath.gr.geant.net (62.40.124.90)  180.904 ms 14  ae4.mx1.mil2.it.geant.net (62.40.98.89)  156.230 ms  
 15  uoi-1.eier.access-link.grnet.gr (62.217.96.193)  192.686 ms 15  uoi-1.eier.access-link.grnet.gr (62.217.96.193)  193.755 ms  
 16  boufos.bbr.uoi.gr (195.130.120.244)  195.589 ms 16  grnet-gw.mx2.ath.gr.geant.net (62.40.124.90)  181.088 ms
 17  uoi-1.eier.access-link.grnet.gr (62.217.96.193)  192.233 ms 17  *  

It is seen that the name has changed to 'chiccr5', so we reran the measurements using 'chiccr5' instead of 'eqxchiccr5' and also added 'chicr5'.

The result is shown below where we inclusively matched either 'chiccr5' or 'chicr5'.

  • RegionMatch 22:53 2019/7/19Do not Match 22:53 2019/7/19Total impacted at 14:00 2019/09/16Total Targets monitoredTotal responsive targets at 13:00 2019/09/16
    Africa172138131104
    Balkans106163130
    C. Asia103132927
    East Asia01414211140
    Europe44176113297
    Latin America5278981
    Middle East311145846
    North America 5169371
    Oceania 0112016
    Russia6061211
    S.E. Asia0111111483
    South Asia0334842
    Total75115190968748

    This appears to be much more stable (i.e. the numbers of matching versus non matching hosts is pretty constant from measurement time to measurement time, and when there is a difference it is accounted for by a router not responding.

  • An example of three measurements made matching either 'chiccr5' or 'chicr5' is shown below:

    Thu Sep 19 22:53:46 2019  Thu Sep 19 23:30:45 2019 Fri Sep 20 11:17:05 2019
    RegionMatchNo MatchSum RegionMatchNo MatchSum RegionMatchNo MatchSum
    Africa172138 Africa172138 Africa172138
    Balkans10616 Balkans10616 Balkans10616
    Central_Asia10313 Central_Asia10313 Central_Asia8513
    East_Asia01414 East_Asia01414 East_Asia01414
    Europe441761 Europe431861 Europe441761
    Latin_America527 Latin_America527 Latin_America527
    Middle_East31114 Middle_East31114 Middle_East31114
    North_America516 North_America516 North_America516
    Oceania011 Oceania011 Oceania011
    Russia606 Russia606 Russia606
    S.E._Asia01111 S.E._Asia01111 S.E._Asia01111
    SE.ASIA011 SE.ASIA011 SE.ASIA011
    South_Asia033 South_Asia033 South_Asia033
  • Thus the route changing to not involving the ESnet Chicago CR5 was a false alarm.

Temporal changes

Patrick Dorn of ESnet informed us "We have two ECMP paths between Chicago and Washington. One direct and one through EQX-CHI-CR5. So you will see temporal variation in traceroutes.".

Routing changes within a Region

We are still left with understanding why some routes (for targets which experienced the incident 9/16/2019) from SLAC to a region for some countries use the Chicago CR5 and for other countries in the same region do not appear to (as determined by not finding the strings 'chiccr5' or 'chicr5' in the traceroutes).

Looking at this in more detail we see the following countries have routes from SLAC that appear to go via Chicago CR5 and other routes that do not go via Chicago CR5

  • Cannot find chicr5 or chiccr5 for tld=ZA(Africa) in response to traceroute  -m 30 -q 1 -w 1 www.dut.ac.za

    Cannot find chicr5 or chiccr5 for tld=ZA(Africa) in response to traceroute  -m 30 -q 1 -w 1 www.museumsnc.co.za

    Found chicr5 or chiccr5 in traceroute  -m 30 -q 1 -w 1 www.ru.ac.za for tld=ZA(Africa)

    Found chicr5 or chiccr5 in traceroute  -m 30 -q 1 -w 1 brunsvigia.tenet.ac.za for tld=ZA(Africa)

    Cannot find chicr5 or chiccr5 for tld=AL(Balkans) in response to traceroute  -m 30 -q 1 -w 1 www.geo.edu.al

    Found chicr5 or chiccr5 in traceroute  -m 30 -q 1 -w 1 www.upt.al for tld=AL(Balkans)

    Cannot find chicr5 or chiccr5 for tld=RO(Balkans) in response to traceroute  -m 30 -q 1 -w 1 speed.alienstation.ro

    Found chicr5 or chiccr5 in traceroute  -m 30 -q 1 -w 1 ns1.credis.ro for tld=RO(Balkans)

    Found chicr5 or chiccr5 in traceroute  -m 30 -q 1 -w 1 www.fabboya.az for tld=AZ(Central_Asia)

    Cannot find chicr5 or chiccr5 for tld=AZ(Central Asia) in response to traceroute  -m 30 -q 1 -w 1 speedtest.ivory.azstarnet.az

    Found chicr5 or chiccr5 in traceroute  -m 30 -q 1 -w 1 www.democrats.ge for tld=GE(Central_Asia)

    Found chicr5 or chiccr5 in traceroute  -m 30 -q 1 -w 1 www.tdasu.edu.ge for tld=GE(Central_Asia)

    Cannot find chicr5 or chiccr5 for tld=GE(Central Asia) in response to traceroute  -m 30 -q 1 -w 1 www.gdi.ge

    Cannot find chicr5 or chiccr5 for tld=GE(ntld/Central_Asia)=3 in response to traceroute  -m 30 -q 1 -w 1 www.koda.ge

    Found chicr5 or chiccr5 in traceroute  -m 30 -q 1 -w 1 tsu.ge for tld=GE(Central_Asia)

    Found chicr5 or chiccr5 in traceroute  -m 30 -q 1 -w 1 www.gsi.de for tld=DE(Europe)

    Found chicr5 or chiccr5 in traceroute  -m 30 -q 1 -w 1 www.physik.rwth-aachen.de for tld=DE(Europe)

    Cannot find chicr5 or chiccr5 for tld=DE(Europe) in response to traceroute  -m 30 -q 1 -w 1 www.kph.uni-mainz.de

    Cannot find chicr5 or chiccr5 for tld=GI(Europe) in response to traceroute  -m 30 -q 1 -w 1 www.attiaslevy.gi

    Found chicr5 or chiccr5 in traceroute  -m 30 -q 1 -w 1 www.fsc.gi for tld=GI(Europe)=21

    Cannot find chicr5 or chiccr5 for tld=GI(Europe) in response to traceroute  -m 30 -q 1 -w 1 www.gibmuseum.gi

    Cannot find chicr5 or chiccr5 for tld=LU(Europe) in response to traceroute  -m 30 -q 1 -w 1 www.cssf.lu

    Found chicr5 or chiccr5 in traceroute  -m 30 -q 1 -w 1 ns1.restena.lu for tld=LU(Europe)

    Cannot find chicr5 or chiccr5 for tld=LU(Europe) in response to traceroute  -m 30 -q 1 -w 1 www.rtl.lu

    Found chicr5 or chiccr5 in traceroute  -m 30 -q 1 -w 1 www.nikhef.nl for tld=NL(Europe)

    Cannot find chicr5 or chiccr5 for tld=NL(Europe) in response to traceroute  -m 30 -q 1 -w 1 www.routenet.nl

    Found chicr5 or chiccr5 in traceroute  -m 30 -q 1 -w 1 edepot.wur.nl for tld=NL(Europe)

Partial explanation from Dale W. Carder <dwcarder@es.net>

1) You are definitely seeing the effect of ECMP paths through ESnet
particularly from Chicago to Washington.  In ESnet's IGP, the following
paths are equal cost and through traffic is spread over them:
- chic-cr5 -- wash-cr5
- chic-cr5 -- eqx-chi-cr5 -- eqx-ash-cr5 -- wash-cr5

2) I think the timing of this outage also coincided with transit
upgrades underway at Sunnyvale.  Traffic was migrated off a congested
path to AS3356 until a new 10G was put in place. 
3) Additionally, I know we recently updated route filters for Geant, and
there was also a long duration outage at CSTNET.  That could have had
some aberrations on international connectivity.

So, Hopefully this helps fill in some blanks in your notes.  I would be
interested in any other international paths where it looks like there
may be sub-optimal route selection.  Often times these are to some
degree manually curated to prefer faster links and at other times the L3
topology doesn't tell the whole story of what is happening at L2.

Signature

The easiest signature to detect the effect of such an incident is to simply look at the number of targets responding each hour and look for a sudden drop. In this case we saw at UTC hour 14:00-1500 the number of hosts in Europe responding to pings from SLAC dropped from 97 or 98 for the rest of the day to 37 i.e. the number of responding hosts dropped by a dramatic 42%.


'* The relevant line of the traceroute from SLAC made during the evening of 2019/09/17 are shown in column AD of the spreadsheet.

 

  • No labels