We had a user report this morning that pkg.freebsd.org would not resolve with our recursive resolvers. freebsd.org showed the same problem.
To summarize the exhibited behavior:
- Different geographic regions led to different results, even for the same IP address.
- The domain resolved via TCP but not UDP.
- The authoritative nameserver failed to resolve the domain via UDP but resolved other domains via UDP.
We aren’t sure how long this problem was happening but it’s possible it might have been for at least a few hours.
My first thought looking into this was why weren’t our resolvers falling back to TCP? I found this post talking about caching and fallback behaviors in the recursive resolver bind, but flushing related caches didn’t have any effect, so I dug some more.
The nameservers for freebsd.org are:
$ host -t NS freebsd.org freebsd.org name server ns2.he.net. freebsd.org name server ns4.he.net. freebsd.org name server ns5.he.net. freebsd.org name server ns3.he.net.
Testing all of these servers showed the same failure to resolve.
Querying “normally” using UDP with “
host freebsd.org ns2.he.net” did not
resolve, but with TCP using “
host -T freebsd.org ns2.he.net” did.
What about other domains? “
host he.net ns2.he.net” resolved HE.net using UDP.
A randomly selected domain using ns2.he.net as a nameserver also resolved using
In the case that UDP is working for some but not all domains, I’m not sure when BIND would fall back to TCP.
Someone else reported that resolving freebsd.org using ns2.he.net was working for them. On the anycast theory, where the same IP resolves to different servers based on the geographic location, we each ran a traceroute:
# San Jose $ traceroute ns2.he.net traceroute to ns2.he.net (126.96.36.199), 30 hops max, 60 byte packets ... 10 10ge5-10.core1.sjc2.he.net (188.8.131.52) 20.586 ms 20.281 ms 10ge10-19.core1.sjc2.he.net (184.108.40.206) 19.888 ms 11 10ge4-4.core1.sjc1.he.net (220.127.116.11) 25.198 ms 22.612 ms e0-36.core2.sjc2.he.net (18.104.22.168) 20.754 ms 12 ns2.he.net (22.214.171.124) 11.795 ms 15.573 ms 16.084 ms # Georgia $ traceroute ns2.he.net traceroute to ns2.he.net (126.96.36.199), 30 hops max, 60 byte packets 5 10gigabitethernet4-1.core1.chi1.he.net (188.8.131.52) 37.566 ms 37.667 ms 37.762 ms 6 100ge15-1.core1.ash1.he.net (184.108.40.206) 45.503 ms 58.312 ms 40.511 ms 7 ns2.he.net (220.127.116.11) 40.549 ms 40.573 ms 40.896 ms
The difference in latency between the last two hops is very small, so these traceroutes appear to confirm ns2.he.net is an anycast address.
This looked like HE.net’s problem, so I reported it to HE.net. The problem resolved afterwards. I’m still not sure how we could have best handled it locally, nor do I know how common this type of problem is.