Python ate my log message
When testing what should have been a quick one-line fix to a Python service written many years ago, I noticed a strange thing: some of the messages that the program should have been sending to the syslog were missing.
Here’s a program simple-daemon.py
with just enough code to
highlight the issue:
#!/usr/bin/env python3
# Requires the following third-party Python packages:
# python-daemon: https://pypi.python.org/pypi/python-daemon
# pylockfile: https://github.com/openstack/pylockfile
import logging.handlers
import sys
import daemon
import daemon.pidfile as pidlockfile
import daemon.runner
from lockfile import AlreadyLocked
from lockfile import LockTimeout
def main():
pidpath = "/tmp/simple-daemon.pid"
logger = logging.getLogger("simple-daemon")
logger.setLevel(logging.INFO)
formatter = logging.Formatter(
"%(asctime)s %(module)s[%(process)d] %(levelname)s: %(message)s",
"%Y-%m-%d %H:%M:%S")
handler = logging.handlers.SysLogHandler(address="/dev/log")
handler.setFormatter(formatter)
logger.addHandler(handler)
logger.info("Starting.")
# Create pidfile
pidf = pidlockfile.TimeoutPIDLockFile(pidpath, 10)
# Remove any stale PID files, left behind by previous invocations
if daemon.runner.is_pidfile_stale(pidf):
logger.warning("Removing stale PID lock file %s", pidf.path)
pidf.break_lock()
daemon_context = daemon.DaemonContext(pidfile=pidf, umask=0o22)
try:
daemon_context.open()
except (AlreadyLocked, LockTimeout):
logger.critical("Failed to lock pidfile %s", pidpath)
sys.exit(1)
logger.info("Daemonized.")
logger.info("Another message.")
if __name__ == "__main__":
main()
To reproduce the bug, run the program and grep the system logs for the
string “simple-daemon
”. You’ll see “Starting.” and “Another Message.” but not
a “Daemonized.” message.
While newly written services shouldn’t daemonize since a modern service manager can manage the process, here is an even more simple program illustrating the exact same bug, without use of any daemonization:
#!/usr/bin/python3
import os
import socket
import time
s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
os.close(s.fileno())
r, w = os.pipe()
os.write(w, b'hi\n')
s = None
time.sleep(4)
print(os.read(r, 4096))
I tried the service on a few distributions. I could not reproduce the problem on Debian Stretch, but did reproduce it on CentOS, Arch Linux, and Debian Buster. I expected to find the message “Daemonized.” On the reproducing systems, the only strings in the logs were:
Aug 28 16:30:16 computer 2020-08-28 16:30:16 simple-daemon[26333] INFO: Starting.
Aug 28 16:30:16 computer 2020-08-28 16:30:16 simple-daemon[26335] INFO: Another message.
Running strace on the program and looking for the syscalls that accessed the syslog, I found:
socket(AF_UNIX, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path="/dev/log"}, 10) = 0
# ... and sometime later:
sendto(3, "<14>2020-08-27 06:16:45 simple-daemon[109680] INFO: Daemonized.\0", 64, 0, NULL, 0) = -1 EBADF (Bad file descriptor)
close(3) = -1 EBADF (Bad file descriptor)
Why is this program trying to write to and then close an invalid file
descriptor? The answer becomes a bit more obvious with an understanding
of what the python-daemon
package does and a read of
logging.handlers.SysLogHandler
.
By default, python-daemon
closes all open file descriptors. That includes
file descriptors referenced by Python objects. The file descriptor for
the syslog is referenced by the socket
member of SysLogHandler
objects.
The first write to the syslog wrote data to an already-closed file descriptor,
which triggered a “close socket and reconnect” sequence.
I first saw this problem in Python 2.7, on CentOS 7. Here the strace output is even more curious:
sendto(3, "<14>2020-08-27 06:43:12 simple-daemon[3708] INFO: Daemonized.\0", 62, 0, NULL, 0) = -1 EBADF (Bad file descriptor)
socket(AF_UNIX, SOCK_DGRAM, 0) = 3
close(3) = 0
connect(3, {sa_family=AF_UNIX, sun_path="/dev/log"}, 10) = -1 EBADF (Bad file descriptor)
close(3) = -1 EBADF (Bad file descriptor)
write(2, "Traceback (most recent call last):\n", 35) = 35
Here, after a failed write, SysLogHandler
immediately calls the socket
system call without closing the old socket. It then closes the socket it just
created and tries to connect to the closed socket! This is a confluence
of bugs: the dangling file descriptor bug I’ve been discussing, and
Python Issue 17981 (“SysLogHandler
closes connection before using it”).
Here’s the inoffensive but problematic line from
logging.handlers.SysLogHandler._connect_unixsocket
:
self.socket = socket.socket(socket.AF_UNIX, use_socktype)
That line creates a socket and assigns it to self.socket. As a
side-effect, it causes the old self.socket to be unreferenced and available
for garbage collection. When the old self.socket is garbage-collected,
its .close
method gets called, because Python doesn’t know that it is
already closed. Unluckily, both the old and new socket object used file
descriptor 3, so the new object now references a closed file descriptor.
The resolution to Python Issue 17981 is to close the socket on error. The object’s .close method will be called before creating the new socket, rather than after. The fix was committed upstream in May of 2013, but it has not yet made its way to CentOS 7. Issue 17981 is a red herring though, and the real problem is objects with dangling references to file descriptors.
The straightforward fix in this sample program is to tell DaemonContext not to close the syslog file descriptor:
daemon_context = daemon.DaemonContext(
pidfile=pidf,
umask=0o22,
files_preserve=[handler.socket],
)
One should think carefully before using functions like os.close
. This
syscall works on the underlying file descriptor, which may be
associated with a File object that expects to manage it’s lifetime.