See here for the report from the ESnet NOC. On the morning of August 16th, the EQX-CHI-CR5 router began to clock ingress FCS errors on port 1/1/1 (facing CHIC-CR5).
- Incident Start Time: 08/16/2019 06:27:51 PDT AKA 13:27:51 UTC
- Incident Resolution Time: 08/20/2019 15:31:00 AKA 21:31:00 UTC
If we look at the PingER losses from say SLAC to Europe (see https://www.slac.stanford.edu/cgi-bin/pingtable.pl?file=packet_loss&by=by-node&size=100&tick=hourly&year=2019&month=08&day=16&from=EDU.SLAC.STANFORD.PINGER&to=WORLD&ex=none&only=all&ipv=all&dataset=hep&percentage=any) one can see that for the UTC hour 14:00-1500 the number of hosts in Europe responding to pings from SLAC dropped from 97 or 98 for the rest of the day to 37 i.e. the number of responding hosts dropped by a dramatic 42%. A typical traceroute from SLAC normally goes via the EQX-CH-CR5 router, for example see the traceroute below for DESY, in particular hop 10.
[cottrell@pinger ~]$ traceroute -m 30 -q 1 -w 1 -A ping.desy.de traceroute to 131.169.5.47 (131.169.5.47), 30 hops max, 140 byte packets 1 rtr-servcore1-serv01-webserv.slac.stanford.edu (134.79.197.130) [AS3671] 0.585 ms 2 rtr-core2-p2p-serv01-02.slac.stanford.edu (134.79.253.253) [AS3671] 0.588 ms 3 rtr-fwcore1-trust-p2p-core1.slac.stanford.edu (134.79.254.134) [AS3671] 1.127 ms 4 rtr-core1-p2p-fwcore1-untrust.slac.stanford.edu (134.79.254.137) [AS3671] 1.440 ms 5 rtr-border2-7k-core2.slac.stanford.edu (134.79.252.185) [AS3671] 1.308 ms 6 sunncr5-ip-c-slac.slac.stanford.edu (192.68.191.233) [AS3671] 1.928 ms 7 134.55.40.145 (134.55.40.145) [AS293] 4.250 ms 8 134.55.40.150 (134.55.40.150) [AS293] 46.466 ms 9 washcr5-ip-a-chiccr5.es.net (134.55.36.46) [AS293] 63.502 ms 10 eqxashcr5-ip-a-eqxchicr5.es.net (134.55.207.53) [AS293] 63.013 ms 11 londcr5-ip-c-aofacr5.es.net (134.55.37.22) [AS293] 138.596 ms 12 amstcr5-ip-a-londcr5.es.net (134.55.222.13) [AS293] 144.920 ms 13 newycr5-ip-a-aofacr5.es.net (134.55.37.78) [AS293] 68.714 ms 14 londcr5-sdn-a-newycr5.es.net (134.55.220.21) [AS293] 137.348 ms 15 amstcr5-ip-a-londcr5.es.net (134.55.222.13) [AS293] 143.881 ms 16 esnet-eex-gw.ams.nl.geant.net (62.40.124.105) [AS21320/AS20965] 145.115 ms 17 cr-tub2-be10.x-win.dfn.de (188.1.146.210) [AS680] 165.508 ms 18 kr-des91.x-win.dfn.de (188.1.231.114) [AS680] 169.408 ms
If we look at the times series of the individual ping measurements made every 30 minutes to this target seen from SLAC we see:
The last lossless data point for 100 Byte pings was at 13:19:53 and the next measured was 15:52:52. The data for the 1000Byte pings is shown below.
Unix epoch, UTC, min-RTT,avg-RTT,max-RTT,Loss 1565961608,8/16/2019 13:20,169.507,169.684,169.799,0 1565963682,8/16/2019 13:54,0,0,0,100 1565965667,8/16/2019 14:27,0,0,0,100 1565967504,8/16/2019 14:58,0,0,0,100 1565969272,8/16/2019 15:27,0,0,0,100 1565970782,8/16/2019 15:53,169.469,169.609,169.713,0
If we take the packet loss hourly data measured from SLAC to the world for 2018/8/16 cut and paste into Excel, then reverse sort on hour 14 and remove all hosts that had a measured loss (not a .), then sort on hour 13 and remove all hosts that did not have a measured loss (i.e. recorded as a .). Then it is seen in the spreadsheet that the main impact is on Europe including the Balkans (37countries), Africa (23 countries), Middle East (9 countries), Central Asia (8 countries), S.E. Asia (7 countries), Latin America (7 countries), S. Asia (2 countries BD & MV), E. Asia (1 country CN).
Further sorting (in order) TLD, Region and (reverse sort) 13:00 helps identify the impacted targets.
It makes sense that traffic from SLAC to Europe, Middle East, Central Asia, would go through Chicago. This is not so clear for Latin America and China. I looked at the traceroutes to see if the route goes through a router with the name eqxchicr5 embedded in it.
- There are PingER target hosts in China. I would expect those routes to go West from SLAC across the Pacific and hence not see the eqxchicr5 router. I looked at several that show up in the spreadsheet and indeed their traceroute* does not indicate a router with the name eqxchicr5. It is also noticeable that all of these Chinese targets show heavy losses during the rest of the day, and such heavy losses are not the general case for Chinese target hosts.
- Africa is a mixed bag for the few hosts I tracerouted some show a hop with eqxchicr5 embedded in it, others do not.
- Strangely there is a host (melspeed.intervolve.com.au) that appears to be in Australia and I would have expected the route to go West from California across the Pacific, yet goes from SLAC to a router with the name eqxchicr5 in its name.
- Middle East and Central Asia routes mainly go via eqxchicr5
- I would expect S.E. Asia to go via Japan and Singapore and this tends to be the case for some traceroute spot checks I made
- Latin America appears to go mainly through eqxchicr5 according to the traceroutes.
To try and resolve some of these issues I wrote a script (traceroute-list.pl) to traceroute to each hostname in the list of hosts. The results are shown in traceroutes-list.txt. It is seen:
- Of the 38 hosts in Africa that appear to have an outage at the time of the incident, 11 match eqxchicr5 in their traceroute and 27 do not match as seen from SLAC.
- Of the 16 hosts in the Balkans that appear to have an outage at the time of the incident, 7 match eqxchicr5 in their traceroute and 9 do not match as seen from SLAC.
- Of the 13 hosts in Central Asia that appear to have an outage at the time of the incident, 7 match eqxchicr5 in their traceroute and 6 do not match as seen from SLAC.
- Of the 61 hosts in Europe that appear to have an outage at the time of the incident, 33 match eqxchicr5 in their traceroute and 28 do not match as seen from SLAC.
- Of the 7 hosts in Latin America that appear to have an outage at the time of the incident, 3 match eqxchicr5 in their traceroute and 4 do not match as seen from SLAC.
- Of the 14 hosts in the Middle East that appear to have an outage at the time of the incident, 3 match eqxchicr5 in their traceroute and 1 do not match as seen from SLAC.
- Of the 5 hosts in North America that appear to have an outage at the time of the incident, 5 match eqxchicr5 in their traceroute and 0 do not match as seen from SLAC.
- Of the 6 hosts in Russia that appear to have an outage at the time of the incident, 6 match eqxchicr5 in their traceroute and 0 do not match as seen from SLAC.
- Of the 12 hosts in S.E. Asia that appear to have an outage at the time of the incident, 0 match eqxchicr5 in their traceroute and 12 do not match as seen from SLAC.
- Of the 3 hosts in S. Asia that appear to have an outage at the time of the incident, 0 match eqxchicr5 in their traceroute and 3 do not match as seen from SLAC.
Summary:
Region Match Do not Match Total impacted at 14:00 Total Targets monitored Total responsive targets at 13:00 Africa 11 27 38 131 104 Balkans 7 9 16 31 30 C. Asia 7 6 13 29 27 East Asia 0 14 14 211 140 Europe 33 28 61 132 97 Latin America 3 4 7 89 81 Middle East 3 11 14 58 46 North America 5 1 6 93 71 Oceania 0 1 1 20 16 Russia 6 0 6 12 11 S.E. Asia 0 11 11 114 83 South Asia 0 3 3 48 42 Total 75 115 190 968 748
Route Changing
An interest is seeing whether PingER can fairly easily and reliably (no false positives etc.) detect such an incident. As we dig in deeper, a surprise is the significant changes in traceroutes from one hour to another. For example, looking at the traceroutes from pinger.slac.stanford.edu to 191 targets on 9/19/2019 (as far as I know a stable day when no problems are reported) the number of routes with hops matching 'eqxchicr5' measured at the following times changes quite a lot. For example, below is a table of the results for 2019/09/19 11:45 PDT and 2019/09/19 11:59 PDT.
Region 11:45 | Match | No match | Sum | Region 11:59 | Match | Non Match | Sum |
Africa | 16 | 22 | 38 | Africa | 16 | 22 | 38 |
Balkans | 7 | 9 | 16 | Balkans | 9 | 7 | 16 |
Central_Asia | 8 | 5 | 13 | Central_Asia | 10 | 3 | 13 |
East_Asia | 0 | 14 | 14 | East_Asia | 0 | 14 | 14 |
Europe | 30 | 31 | 61 | Europe | 37 | 24 | 61 |
Latin_America | 4 | 3 | 7 | Latin_America | 4 | 3 | 7 |
Middle_East | 1 | 13 | 14 | Middle_East | 2 | 12 | 14 |
North_America | 5 | 1 | 6 | North_America | 4 | 2 | 6 |
Oceania | 0 | 1 | 1 | Oceania | 0 | 1 | 1 |
Russia | 4 | 2 | 6 | Russia | 5 | 1 | 6 |
S.E._Asia | 0 | 11 | 11 | S.E._Asia | 0 | 11 | 11 |
SE.ASIA | 0 | 1 | 1 | SE.ASIA | 0 | 1 | 1 |
South_Asia | 0 | 3 | 3 | South_Asia | 0 | 3 | 3 |
Sum | 75 | 116 | 191 | Sum | 87 | 104 | 191 |
Below are two traceroutes to measured a few minutes apart, one has hops with the string 'eqxchicr5', the other does not.
Measured after 13:18 9/19/2019 | Measured after 13:26 9/19/2019 | ||
cmd=traceroute -m 30 -q 1 -w 1 www.uoi.gr | cmd=traceroute -m 30 -q 1 -w 1 www.uoi.gr | ||
Found eqxchicr5 in traceroute -m 30 -q 1 -w 1 www.uoi.gr for tld=GR(ntld/Balkans)=3 in hop(s) | Cannot find eqxchicr5 for tld=GR(ntld/Balkans)=2 in traceroute -m 30 -q 1 -w 1 www.uoi.gr | ||
9 eqxchicr5-ip-b-chiccr5.es.net (134.55.218.62) 46.664 ms | |||
10 eqxashcr5-ip-a-eqxchicr5.es.net (134.55.207.53) 63.456 ms | |||
traceroute to www.uoi.gr (195.130.120.109), 30 hops max, 60 byte packets | traceroute to www.uoi.gr (195.130.120.109), 30 hops max, 60 byte packets | ||
1 rtr-servcore1-serv01-iepm.slac.stanford.edu (134.79.104.66) 0.780 ms | 1 134.79.104.67 (134.79.104.67) 0.675 ms | ||
2 rtr-core2-p2p-serv01-01.slac.stanford.edu (134.79.254.65) 0.670 ms | 2 rtr-core1-p2p-serv01-01.slac.stanford.edu (134.79.253.249) 0.577 ms | ||
3 rtr-fwcore1-trust-p2p-core1.slac.stanford.edu (134.79.254.134) 1.059 ms | 3 rtr-fwcore1-trust-p2p-core1.slac.stanford.edu (134.79.254.134) 1.038 ms | ||
4 rtr-core1-p2p-fwcore1-untrust.slac.stanford.edu (134.79.254.137) 1.452 ms | 4 rtr-core2-p2p-fwcore2-untrust.slac.stanford.edu (134.79.254.149) 1.285 ms | ||
5 rtr-border2-7k-core1.slac.stanford.edu (134.79.252.181) 1.305 ms | 5 rtr-border2-7k-core2.slac.stanford.edu (134.79.252.185) 1.213 ms | ||
6 sunncr5-ip-c-slac.slac.stanford.edu (192.68.191.233) 1.781 ms | 6 sunncr5-ip-c-slac.slac.stanford.edu (192.68.191.233) 1.807 ms | ||
7 134.55.40.145 (134.55.40.145) 4.322 ms | 7 sacrcr5-ip-a-sunncr5.es.net (134.55.40.5) 4.354 ms | ||
8 134.55.40.150 (134.55.40.150) 46.533 ms | 8 134.55.40.150 (134.55.40.150) 46.635 ms | ||
9 eqxchicr5-ip-b-chiccr5.es.net (134.55.218.62) 46.664 ms | 9 washcr5-ip-a-chiccr5.es.net (134.55.36.46) 63.694 ms | ||
10 eqxashcr5-ip-a-eqxchicr5.es.net (134.55.207.53) 63.456 ms | 10 cern513cr5-ip-a-washcr5.es.net (134.55.37.61) 149.270 ms | ||
11 esnet-eex-gw.gen.ch.geant.net (62.40.124.77) 149.565 ms | 11 * | ||
12 cern513cr5-ip-a-washcr5.es.net (134.55.37.61) 149.278 ms | 12 ae4.mx1.mil2.it.geant.net (62.40.98.89) 156.263 ms | ||
13 esnet-eex-gw.gen.ch.geant.net (62.40.124.77) 149.358 ms | 13 esnet-eex-gw.gen.ch.geant.net (62.40.124.77) 149.343 ms | ||
14 grnet-gw.mx2.ath.gr.geant.net (62.40.124.90) 180.904 ms | 14 ae4.mx1.mil2.it.geant.net (62.40.98.89) 156.230 ms | ||
15 uoi-1.eier.access-link.grnet.gr (62.217.96.193) 192.686 ms | 15 uoi-1.eier.access-link.grnet.gr (62.217.96.193) 193.755 ms | ||
16 boufos.bbr.uoi.gr (195.130.120.244) 195.589 ms | 16 grnet-gw.mx2.ath.gr.geant.net (62.40.124.90) 181.088 ms | ||
17 uoi-1.eier.access-link.grnet.gr (62.217.96.193) 192.233 ms | 17 * |
It is seen that the name has changed to 'chiccr5', so we reran the measurements using 'chiccr5' instead of 'eqxchiccr5' and also added 'chicr5'.
The result is shown below where we inclusively matched either 'chiccr5' or 'chicr5'.
Region Match 22:53 2019/7/19 Do not Match 22:53 2019/7/19 Total impacted at 14:00 2019/09/16 Total Targets monitored Total responsive targets at 13:00 2019/09/16 Africa 17 21 38 131 104 Balkans 10 6 16 31 30 C. Asia 10 3 13 29 27 East Asia 0 14 14 211 140 Europe 44 17 61 132 97 Latin America 5 2 7 89 81 Middle East 3 11 14 58 46 North America 5 1 6 93 71 Oceania 0 1 1 20 16 Russia 6 0 6 12 11 S.E. Asia 0 11 11 114 83 South Asia 0 3 3 48 42 Total 75 115 190 968 748 This appears to be much more stable (i.e. the numbers of matching versus non matching hosts is pretty constant from measurement time to measurement time, and when there is a difference it is accounted for by a router not responding.
An example of three measurements made matching either 'chiccr5' or 'chicr5' is shown below:
Thu Sep 19 22:53:46 2019 Thu Sep 19 23:30:45 2019 Fri Sep 20 11:17:05 2019 Region Match No Match Sum Region Match No Match Sum Region Match No Match Sum Africa 17 21 38 Africa 17 21 38 Africa 17 21 38 Balkans 10 6 16 Balkans 10 6 16 Balkans 10 6 16 Central_Asia 10 3 13 Central_Asia 10 3 13 Central_Asia 8 5 13 East_Asia 0 14 14 East_Asia 0 14 14 East_Asia 0 14 14 Europe 44 17 61 Europe 43 18 61 Europe 44 17 61 Latin_America 5 2 7 Latin_America 5 2 7 Latin_America 5 2 7 Middle_East 3 11 14 Middle_East 3 11 14 Middle_East 3 11 14 North_America 5 1 6 North_America 5 1 6 North_America 5 1 6 Oceania 0 1 1 Oceania 0 1 1 Oceania 0 1 1 Russia 6 0 6 Russia 6 0 6 Russia 6 0 6 S.E._Asia 0 11 11 S.E._Asia 0 11 11 S.E._Asia 0 11 11 SE.ASIA 0 1 1 SE.ASIA 0 1 1 SE.ASIA 0 1 1 South_Asia 0 3 3 South_Asia 0 3 3 South_Asia 0 3 3 - Thus the route changing to not involving the ESnet Chicago CR5 was a false alarm.
Routing changes within a Region
We are still left with understanding why some routes (for targets which experienced the incident 9/16/2019) from SLAC to a region for some countries use the Chicago CR5 and for other countries in the same region do not.
Looking at this in more detail we see:
Cannot find chicr5 for tld=ZA(ntld/Africa)=17 in response to traceroute -m 30 -q 1 -w 1 www.dut.ac.za
Cannot find chicr5 for tld=ZA(ntld/Africa)=18 in response to traceroute -m 30 -q 1 -w 1 www.museumsnc.co.za
Found chicr5 in traceroute -m 30 -q 1 -w 1 www.ru.ac.za for tld=ZA(ntld/Africa)=15 in hop(s)
Found chicr5 in traceroute -m 30 -q 1 -w 1 brunsvigia.tenet.ac.za for tld=ZA(ntld/Africa)=16 in hop(s)
Cannot find chicr5 for tld=AL(ntld/Balkans)=1 in response to traceroute -m 30 -q 1 -w 1 www.geo.edu.al
Found chicr5 in traceroute -m 30 -q 1 -w 1 www.upt.al for tld=AL(ntld/Balkans)=1 in hop(s)
Cannot find chicr5 for tld=RO(ntld/Balkans)=5 in response to traceroute -m 30 -q 1 -w 1 speed.alienstation.ro
Found chicr5 in traceroute -m 30 -q 1 -w 1 ns1.credis.ro for tld=RO(ntld/Balkans)=6 in hop(s)
Found chicr5 in traceroute -m 30 -q 1 -w 1 www.fabboya.az for tld=AZ(ntld/Central_Asia)=1 in hop(s)
Cannot find chicr5 for tld=AZ(ntld/Central_Asia)=1 in response to traceroute -m 30 -q 1 -w 1 speedtest.ivory.azstarnet.az
Found chicr5 in traceroute -m 30 -q 1 -w 1 www.democrats.ge for tld=GE(ntld/Central_Asia)=2 in hop(s)
Found chicr5 in traceroute -m 30 -q 1 -w 1 www.tdasu.edu.ge for tld=GE(ntld/Central_Asia)=3 in hop(s)
Cannot find chicr5 for tld=GE(ntld/Central_Asia)=2 in response to traceroute -m 30 -q 1 -w 1 www.gdi.ge
Cannot find chicr5 for tld=GE(ntld/Central_Asia)=3 in response to traceroute -m 30 -q 1 -w 1 www.koda.ge
Found chicr5 in traceroute -m 30 -q 1 -w 1 tsu.ge for tld=GE(ntld/Central_Asia)=4 in hop(s)
Found chicr5 in traceroute -m 30 -q 1 -w 1 www.gsi.de for tld=DE(ntld/Europe)=11 in hop(s)
Found chicr5 in traceroute -m 30 -q 1 -w 1 www.physik.rwth-aachen.de for tld=DE(ntld/Europe)=12 in hop(s)
Cannot find chicr5 for tld=DE(ntld/Europe)=3 in response to traceroute -m 30 -q 1 -w 1 www.kph.uni-mainz.de
Cannot find chicr5 for tld=GI(ntld/Europe)=8 in response to traceroute -m 30 -q 1 -w 1 www.attiaslevy.gi
Found chicr5 in traceroute -m 30 -q 1 -w 1 www.fsc.gi for tld=GI(ntld/Europe)=21 in hop(s)
Cannot find chicr5 for tld=GI(ntld/Europe)=9 in response to traceroute -m 30 -q 1 -w 1 www.gibmuseum.gi
Cannot find chicr5 for tld=LU(ntld/Europe)=10 in response to traceroute -m 30 -q 1 -w 1 www.cssf.lu
Found chicr5 in traceroute -m 30 -q 1 -w 1 ns1.restena.lu for tld=LU(ntld/Europe)=32 in hop(s)
Cannot find chicr5 for tld=LU(ntld/Europe)=11 in response to traceroute -m 30 -q 1 -w 1 www.rtl.lu
Found chicr5 in traceroute -m 30 -q 1 -w 1 www.nikhef.nl for tld=NL(ntld/Europe)=33 in hop(s)
Cannot find chicr5 for tld=NL(ntld/Europe)=15 in response to traceroute -m 30 -q 1 -w 1 www.routenet.nl
Found chicr5 in traceroute -m 30 -q 1 -w 1 edepot.wur.nl for tld=NL(ntld/Europe)=34 in hop(s)
Signature
The easiest signature to detect the effect of such an incident is to simply look at the number of targets responding each hour and look for a sudden drop. In this case we saw at UTC hour 14:00-1500 the number of hosts in Europe responding to pings from SLAC dropped from 97 or 98 for the rest of the day to 37 i.e. the number of responding hosts dropped by a dramatic 42%.
'* The relevant line of the traceroute from SLAC made during the evening of 2019/09/17 are shown in column AD of the spreadsheet.