The root cause of the downtime on November 6th was a bug in HE.net’s router software. HE.net is one of our transit providers. When the bug occurs, network connectivity is lost in one direction between their routers, resulting in packet loss to some portions of the internet. Technically this was as a partial outage, as many west coast routes (including my home ISP) were still working.

HE.net tells us that there will be a software update sometime after November 30th to fix this issue. Our services should not be affected during that update except for any users still on IP addresses not owned by prgmr.com. All customers using these IP addresses have already been notified of the issue and were given instructions on how to switch to prgmr.com IP addresses a couple of weeks ago.

The downtime on November 6th lasted much longer than we would have liked. Here is a breakdown of the timeline and some mitigating steps we’ve taken:

  • At 00:52 -0700 I was first paged. For some reason I did not hear any of the four texts or phone calls. My landline was called but there is no phone in the room I was in. I am planning to add a landline to that room.
  • At 01:06 -0700 the pages were passed to the secondary person on pager. This person did not have sufficient training to isolate this particular network issue. We have discussed the problem and also updated our pager documentation to handle this particular issue. There was also not a hard time limit on how long to spend diagnosing the issue before escalating to another staff member, which we have now added.
  • At 01:33 -0700 I happened to wake up and notice I had been paged. Upon investigating, I figured out that the problem was likely a duplicate of the issue we were having with HE.net previously and decided that the best course of action was to shut down our connection with HE.net. But I had to look up how to do that with our software. This has been added to the pager notes. We’ve also signed up for an additional monitoring service that automatically generates traceroutes from a number of test sites which should cut down some on debugging time.
  • 1 hour and 17 minutes after the first page, the problem was fully resolved.