As stated elsewhere, we had 5 kernel crashes over several months that were related to the use of a specific network card.
The actual bug is fairly interesting. When a device driver wants to have a piece of hardware perform Direct Memory Access (DMA) to write data into its memory buffer, it must ask for this memory range to be mapped or unmapped into the memory accessible to the given device. Generally, on x86 if a device is capable of directly addressing the memory no translation is needed. This particular network device is 64-bit capable and so normally no special action is needed.
However, on Xen the device driver may have a memory allocation that is unsuitable for the device to directly write to. In this case the Linux kernel has a component called the software input-output translation lookaside buffer (SWIOTLB) which has an internal buffer that is guaranteed to be accessible to all devices. The SWIOTLB transparently copies data to and from this internal buffer to the device drivers buffer at the time the device driver asks.
This particular driver was using memory in an unusual way - it was asking for the data to be synced from the hardware to the driver multiple times during the DMA transaction, and then did not care about the data at the time the DMA access for the device was finally unmapped. At the time of the DMA unmapping, the drivers memory buffer could have already been marked as free.
Technically unmapping DMA after marking the relevant buffer as free is invalid since the SWIOTLB will write to the original memory buffer when the unmapping occurs. But the SWIOTLB is not normally used with this device except with Xen, so the bug was not observed.
It’s possible to force use of the SWIOTLB outside of Xen, but then accesses to disks are so slow that the system becomes unusable. This is probably why nobody does it for testing.
To summarize, these are the reasons this bug was not fixed before:
- While it is not a bug in Xen, it will only show up under Xen under normal circumstances.
- It’s too slow to test with the component that causes the issue.
- There is no debug option that checks whether a page is currently valid at the time that a DMA unmapping is performed. This is a feature I would be interested in someone adding.
- Turning off certain network optimizations makes the problem go away even with Xen, so most people having a problem would stop looking after deploying the mitigation.
- The device is not integrated into a motherboard and so is not as commonly tested as other devices.
The reason it took us so long to identify it is I spent a long time going down the wrong path because we believed the problem to be related to customer traffic. We had a minor DoS attack against us in November exploiting a kernel bug that was so easy and obvious to test for that it shouldn’t have been present in the first place. So it was not unreasonable to think that this problem was also customer caused, especially given this was first observed on a host that tended to have newly provisioned systems.
A challenge I had when initially reading these specific backtraces is that often gdb won’t show terribly useful information about an address because it is buried within 5 layers of inline functions. The correct tool for getting this is addr2line. It has an “-i” option which recursively shows all the inlining done until there is nothing more to report.
I also wasted a lot of time reading the kernel source. My advice is: don’t. If you can’t find the bug within an hour or two by starting from the offending line or address, turn on all the debug options and try to reproduce. For anyone debugging potential memory corruption, I highly recommend enabling the following options in their kernel config:
- CONFIG_KASAN (if available and with a new enough compiler)
- CONFIG_CRASH_DUMP (kdump)
Not all of these work well with Xen unfortunately. Kasan has not been implemented for paravirtualized kernels, though it could be with not too much effort. Kdump is only feasible if you are able to dump all of RAM to disk after, which is fine for test systems but not production.
The next time we buy a new type of hardware, we will also run one of these kernels with debugging on before putting it into production. I would consider this generally good practice when buying new generations of devices or when running less common hypervisors or operating systems.