We had a user report this morning that pkg.freebsd.org would not resolve with our recursive resolvers. freebsd.org showed the same problem.

To summarize the exhibited behavior:

  • Different geographic regions led to different results, even for the same IP address.
  • The domain resolved via TCP but not UDP.
  • The authoritative nameserver failed to resolve the domain via UDP but resolved other domains via UDP.

We aren’t sure how long this problem was happening but it’s possible it might have been for at least a few hours.

My first thought looking into this was why weren’t our resolvers falling back to TCP? I found this post talking about caching and fallback behaviors in the recursive resolver bind, but flushing related caches didn’t have any effect, so I dug some more.

The nameservers for freebsd.org are:

$ host -t NS freebsd.org
freebsd.org name server ns2.he.net.
freebsd.org name server ns4.he.net.
freebsd.org name server ns5.he.net.
freebsd.org name server ns3.he.net.

Testing all of these servers showed the same failure to resolve.

Querying “normally” using UDP with “host freebsd.org ns2.he.net” did not resolve, but with TCP using “host -T freebsd.org ns2.he.net” did.

What about other domains? “host he.net ns2.he.net” resolved HE.net using UDP. A randomly selected domain using ns2.he.net as a nameserver also resolved using UDP.

In the case that UDP is working for some but not all domains, I’m not sure when BIND would fall back to TCP.

Someone else reported that resolving freebsd.org using ns2.he.net was working for them. On the anycast theory, where the same IP resolves to different servers based on the geographic location, we each ran a traceroute:

# San Jose
$ traceroute ns2.he.net
traceroute to ns2.he.net (216.218.131.2), 30 hops max, 60 byte packets
...
10  10ge5-10.core1.sjc2.he.net (64.62.153.169)  20.586 ms  20.281 ms 10ge10-19.core1.sjc2.he.net (216.218.213.101)  19.888 ms
11  10ge4-4.core1.sjc1.he.net (72.52.92.117)  25.198 ms  22.612 ms e0-36.core2.sjc2.he.net (184.104.192.214)  20.754 ms
12  ns2.he.net (216.218.131.2)  11.795 ms  15.573 ms  16.084 ms

# Georgia
$ traceroute ns2.he.net
traceroute to ns2.he.net (216.218.131.2), 30 hops max, 60 byte packets
 5  10gigabitethernet4-1.core1.chi1.he.net (208.115.136.37)  37.566 ms  37.667 ms  37.762 ms
 6  100ge15-1.core1.ash1.he.net (184.105.64.134)  45.503 ms  58.312 ms  40.511 ms
 7  ns2.he.net (216.218.131.2)  40.549 ms  40.573 ms  40.896 ms

The difference in latency between the last two hops is very small, so these traceroutes appear to confirm ns2.he.net is an anycast address.

This looked like HE.net’s problem, so I reported it to HE.net. The problem resolved afterwards. I’m still not sure how we could have best handled it locally, nor do I know how common this type of problem is.