When testing what should have been a quick one-line fix to a Python service written many years ago, I noticed a strange thing: some of the messages that the program should have been sending to the syslog were missing.

Here’s a program simple-daemon.py with just enough code to highlight the issue:

#!/usr/bin/env python3
# Requires the following third-party Python packages:
# python-daemon: https://pypi.python.org/pypi/python-daemon
# pylockfile: https://github.com/openstack/pylockfile
import logging.handlers
import sys
import daemon
import daemon.pidfile as pidlockfile
import daemon.runner
from lockfile import AlreadyLocked
from lockfile import LockTimeout


def main():
    pidpath = "/tmp/simple-daemon.pid"
    logger = logging.getLogger("simple-daemon")
    logger.setLevel(logging.INFO)
    formatter = logging.Formatter(
        "%(asctime)s %(module)s[%(process)d] %(levelname)s: %(message)s",
        "%Y-%m-%d %H:%M:%S")
    handler = logging.handlers.SysLogHandler(address="/dev/log")
    handler.setFormatter(formatter)
    logger.addHandler(handler)

    logger.info("Starting.")
    # Create pidfile
    pidf = pidlockfile.TimeoutPIDLockFile(pidpath, 10)

    # Remove any stale PID files, left behind by previous invocations
    if daemon.runner.is_pidfile_stale(pidf):
        logger.warning("Removing stale PID lock file %s", pidf.path)
        pidf.break_lock()

    daemon_context = daemon.DaemonContext(pidfile=pidf, umask=0o22)
    try:
        daemon_context.open()
    except (AlreadyLocked, LockTimeout):
        logger.critical("Failed to lock pidfile %s", pidpath)
        sys.exit(1)

    logger.info("Daemonized.")
    logger.info("Another message.")

if __name__ == "__main__":
    main()

To reproduce the bug, run the program and grep the system logs for the string “simple-daemon”. You’ll see “Starting.” and “Another Message.” but not a “Daemonized.” message.

While newly written services shouldn’t daemonize since a modern service manager can manage the process, here is an even more simple program illustrating the exact same bug, without use of any daemonization:

#!/usr/bin/python3
import os
import socket
import time

s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
os.close(s.fileno())
r, w = os.pipe()
os.write(w, b'hi\n')
s = None
time.sleep(4)
print(os.read(r, 4096))

I tried the service on a few distributions. I could not reproduce the problem on Debian Stretch, but did reproduce it on CentOS, Arch Linux, and Debian Buster. I expected to find the message “Daemonized.” On the reproducing systems, the only strings in the logs were:

Aug 28 16:30:16 computer 2020-08-28 16:30:16 simple-daemon[26333] INFO: Starting.
Aug 28 16:30:16 computer 2020-08-28 16:30:16 simple-daemon[26335] INFO: Another message.

Running strace on the program and looking for the syscalls that accessed the syslog, I found:

socket(AF_UNIX, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path="/dev/log"}, 10) = 0
# ... and sometime later:
sendto(3, "<14>2020-08-27 06:16:45 simple-daemon[109680] INFO: Daemonized.\0", 64, 0, NULL, 0) = -1 EBADF (Bad file descriptor)
close(3)                                = -1 EBADF (Bad file descriptor)

Why is this program trying to write to and then close an invalid file descriptor? The answer becomes a bit more obvious with an understanding of what the python-daemon package does and a read of logging.handlers.SysLogHandler.

By default, python-daemon closes all open file descriptors. That includes file descriptors referenced by Python objects. The file descriptor for the syslog is referenced by the socket member of SysLogHandler objects. The first write to the syslog wrote data to an already-closed file descriptor, which triggered a “close socket and reconnect” sequence.

I first saw this problem in Python 2.7, on CentOS 7. Here the strace output is even more curious:

sendto(3, "<14>2020-08-27 06:43:12 simple-daemon[3708] INFO: Daemonized.\0", 62, 0, NULL, 0) = -1 EBADF (Bad file descriptor)
socket(AF_UNIX, SOCK_DGRAM, 0)          = 3
close(3)                                = 0
connect(3, {sa_family=AF_UNIX, sun_path="/dev/log"}, 10) = -1 EBADF (Bad file descriptor)
close(3)                                = -1 EBADF (Bad file descriptor)
write(2, "Traceback (most recent call last):\n", 35) = 35

Here, after a failed write, SysLogHandler immediately calls the socket system call without closing the old socket. It then closes the socket it just created and tries to connect to the closed socket! This is a confluence of bugs: the dangling file descriptor bug I’ve been discussing, and Python Issue 17981 (“SysLogHandler closes connection before using it”).

Here’s the inoffensive but problematic line from logging.handlers.SysLogHandler._connect_unixsocket:

self.socket = socket.socket(socket.AF_UNIX, use_socktype)

That line creates a socket and assigns it to self.socket. As a side-effect, it causes the old self.socket to be unreferenced and available for garbage collection. When the old self.socket is garbage-collected, its .close method gets called, because Python doesn’t know that it is already closed. Unluckily, both the old and new socket object used file descriptor 3, so the new object now references a closed file descriptor.

The resolution to Python Issue 17981 is to close the socket on error. The object’s .close method will be called before creating the new socket, rather than after. The fix was committed upstream in May of 2013, but it has not yet made its way to CentOS 7. Issue 17981 is a red herring though, and the real problem is objects with dangling references to file descriptors.

The straightforward fix in this sample program is to tell DaemonContext not to close the syslog file descriptor:

daemon_context = daemon.DaemonContext(
    pidfile=pidf,
    umask=0o22,
    files_preserve=[handler.socket],
)

One should think carefully before using functions like os.close. This syscall works on the underlying file descriptor, which may be associated with a File object that expects to manage it’s lifetime.