We had network related kernel crashes on three separate hosts over several months. We believe we have identified the root cause and applied a mitigation. Here is the time-line:
- The first crash on the first host was October 27th. I investigated the kernel back-trace but did not identify anything obvious.
- The second crash on the first host was November 17th. After this we moved this host to a new physical server. I did some fairly serious investigation of the kernel back-trace. The one recommendation we got back were to disable specific network optimizations. We assumed this was related to customer traffic and disabled it on customer facing interfaces, but not all interfaces. The original physical server was of a build that is unusual to our fleet so we thought that may have also been related, hence trying different hardware.
- The third crash on the first host was January 17th. It looked like some kind of memory corruption bug; I did not know from where. We still believed at this point it was likely related to customer traffic.
- The fourth crash was on a second host on February 4th. We had recently started provisioning new customers onto this server after it having hosted only long-time customers, and thought it was related to traffic from a new service.
- The fifth crash was on a third host on February 12th. We did not believe it was related to customer traffic at this point. We decided to disable the relevant network optimizations on all interfaces on February 13th as a precautionary measure, despite the internal interfaces having been in production for a long time.
- At the beginning of March I rebuilt and deployed a new kernel with all the supported debug options on. With the network optimizations on, this kernel crashed quickly, reliably and with a clear back-trace. With the network optimizations off it did not crash. The crash indicated an attempt to access a memory page already freed specific to a 10gbps network card we use. This would not cause errors unless this memory was already re-allocated.
- On March 4th I submitted a patch to the Linux kernel for review. I determined at the same time that the bug had already incidentally been removed from mainline within the last year. It had been present since 2012.
- On April 25th I submitted a new patch version which will hopefully be accepted.
More details can be found here.