Update 16:34 -0700: The reason why this happened is that someone, when setting up the nagios monitoring, used default values from some example. It turns out we were not paging until we got to 60% packet loss. Warn at 2% loss and page at 6% loss is way more reasonable. Luke set the nagios thresholds to page when the packet loss exceeds the new, lower thresholds.
Update 13:25 -0700: ipv6 was disrupted at the time we switched ports. ipv6 connectivity should now be restored.
Update: The immediate problem is fixed. Luke changed the physical ports on both sides of the connection and it appears that there’s a problem with the original port in use on lefanu. While there might be a hardware problem, that’s not 100% clear. We haven’t decided what to do about it long term. We’ll put some effort today into figuring out how to tweak our monitoring such that we get paged for a problem like this.
- Lefanu is experiencing high inbound packet loss. We are investigating potential physical issues as this machine is identical to another which should have an identical software and hardware configuration. Affected customers will receive a credit for the downtime.
The larger problem is why our monitoring tools did not alert us; we will be looking into how to add or adjust the thresholds.