At Wed Mar 25 16:04:57 UTC 2020, the majority of our recursive resolvers stopped working. The overall recursive resolver downtime was one hour and twenty minutes. We are in the process of issuing prorated credits for VPS customers.

As part of this DNS outage, we identified a weakness in our monitoring infrastructure: we have an external monitor of our primary monitoring system, but it doesn’t guarantee that the DNS resolver is working on the primary monitoring system, only that the system is up with an active network. Therefore when both of our local resolvers stopped working simultaneously we were not paged as we should have been. We’ve added an external resolver where needed and long term will have the primary monitoring system initiate connections to the external monitor to prevent this in the future.

The root cause was that our DNSSEC settings were stale and had not been revisited. Our bind configuration contained the following line:

dnssec-lookaside auto;

DNSSEC Lookaside Validation, specified in RFC 5074, is a now-historical mechanism for publishing DNS Security trust anchors outside of the DNS delegation chain. This mechanism was using a public registry provided by the Internet Systems Consortium (ISC) at dlv.isc.org.

In September of 2017, the ISC retired dlv.isc.org, replacing it with an empty zone, so as not to break existing resolver configurations. The signatures for dlv.isc.org expired earlier today on accident, causing all queries from our resolvers to fail with broken trust chain errors. The expiration resulted in this error message:

named:   validating : dlv.isc.org NSEC: verify failed due to bad signature (keyid=64263): RRSIG has expired

The fix was to replace dnssec-lookaside auto with dnssec-lookaside no in our resolver configuration and restart the resolvers.