• Prgmr.com Service Affected by DOS

    Wed, 16 Sep 2020 04:25:00 -0700 - Sarah Newman

    Prgmr.com was impacted by a network denial of service targeting a specific service starting at approximately 10:10 UTC and ending at 10:24 UTC. The immediate incident is resolved.

  • HE.net Switch Replacement

    Thu, 03 Sep 2020 14:30:00 -0700 - Chris Brannon

    In the next couple of hours, Hurricane Electric will be installing a new switch. This should fix the networking issues that have caused multiple outages.

    While the new switch is being installed, we will be using our secondary transit provider, CenturyLink. There will be no network redundancy. Also, VPSes that have HE.net IP addresses will not be reachable via the HE.net-owned IPs.

  • Loss of Network Redundancy

    Wed, 02 Sep 2020 05:20:00 -0700 - Sarah Newman

    UPDATE 19:55 UTC: Redundancy has been restored as of 19:30 UTC. More fiber optic related equipment was replaced.


    At approximately 11:40 UTC, we lost connectivity to our primary transit provider, Hurricane Electric. We automatically switched to our secondary transit provider, and the network is up, though without redundancy.

    This is likely the same problem as yesterday.

    We will update this post with any changes to this situation.

  • Loss of Network Redundancy

    Tue, 01 Sep 2020 14:00:00 -0700 - Chris Brannon

    UPDATE: On or around 00:00 UTC we automatically switched back to our primary transit provider, restoring network redundancy.

    At approximately 20:30 UTC, we lost connectivity to our primary transit provider, Hurricane Electric. We automatically switched to our secondary transit provider, and the network is up, though without redundancy.

    We will update this post with any changes to this situation.

  • Python ate my log message

    Sat, 29 Aug 2020 10:00:00 -0700 - Chris Brannon

    When testing what should have been a quick one-line fix to a Python service written many years ago, I noticed a strange thing: some of the messages that the program should have been sending to the syslog were missing.

    Here’s a program simple-daemon.py with just enough code to highlight the issue:

    #!/usr/bin/env python3
    # Requires the following third-party Python packages:
    # python-daemon: https://pypi.python.org/pypi/python-daemon
    # pylockfile: https://github.com/openstack/pylockfile
    import logging.handlers
    import sys
    import daemon
    import daemon.pidfile as pidlockfile
    import daemon.runner
    from lockfile import AlreadyLocked
    from lockfile import LockTimeout
    
    
    def main():
        pidpath = "/tmp/simple-daemon.pid"
        logger = logging.getLogger("simple-daemon")
        logger.setLevel(logging.INFO)
        formatter = logging.Formatter(
            "%(asctime)s %(module)s[%(process)d] %(levelname)s: %(message)s",
            "%Y-%m-%d %H:%M:%S")
        handler = logging.handlers.SysLogHandler(address="/dev/log")
        handler.setFormatter(formatter)
        logger.addHandler(handler)
    
        logger.info("Starting.")
        # Create pidfile
        pidf = pidlockfile.TimeoutPIDLockFile(pidpath, 10)
    
        # Remove any stale PID files, left behind by previous invocations
        if daemon.runner.is_pidfile_stale(pidf):
            logger.warning("Removing stale PID lock file %s", pidf.path)
            pidf.break_lock()
    
        daemon_context = daemon.DaemonContext(pidfile=pidf, umask=0o22)
        try:
            daemon_context.open()
        except (AlreadyLocked, LockTimeout):
            logger.critical("Failed to lock pidfile %s", pidpath)
            sys.exit(1)
    
        logger.info("Daemonized.")
        logger.info("Another message.")
    
    if __name__ == "__main__":
        main()
    

    To reproduce the bug, run the program and grep the system logs for the string “simple-daemon”. You’ll see “Starting.” and “Another Message.” but not a “Daemonized.” message.

    While newly written services shouldn’t daemonize since a modern service manager can manage the process, here is an even more simple program illustrating the exact same bug, without use of any daemonization:

    #!/usr/bin/python3
    import os
    import socket
    import time
    
    s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
    os.close(s.fileno())
    r, w = os.pipe()
    os.write(w, b'hi\n')
    s = None
    time.sleep(4)
    print(os.read(r, 4096))
    

    I tried the service on a few distributions. I could not reproduce the problem on Debian Stretch, but did reproduce it on CentOS, Arch Linux, and Debian Buster. I expected to find the message “Daemonized.” On the reproducing systems, the only strings in the logs were:

    Aug 28 16:30:16 computer 2020-08-28 16:30:16 simple-daemon[26333] INFO: Starting.
    Aug 28 16:30:16 computer 2020-08-28 16:30:16 simple-daemon[26335] INFO: Another message.
    

    Running strace on the program and looking for the syscalls that accessed the syslog, I found:

    socket(AF_UNIX, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 3
    connect(3, {sa_family=AF_UNIX, sun_path="/dev/log"}, 10) = 0
    # ... and sometime later:
    sendto(3, "<14>2020-08-27 06:16:45 simple-daemon[109680] INFO: Daemonized.\0", 64, 0, NULL, 0) = -1 EBADF (Bad file descriptor)
    close(3)                                = -1 EBADF (Bad file descriptor)
    

    Why is this program trying to write to and then close an invalid file descriptor? The answer becomes a bit more obvious with an understanding of what the python-daemon package does and a read of logging.handlers.SysLogHandler.

    By default, python-daemon closes all open file descriptors. That includes file descriptors referenced by Python objects. The file descriptor for the syslog is referenced by the socket member of SysLogHandler objects. The first write to the syslog wrote data to an already-closed file descriptor, which triggered a “close socket and reconnect” sequence.

    I first saw this problem in Python 2.7, on CentOS 7. Here the strace output is even more curious:

    sendto(3, "<14>2020-08-27 06:43:12 simple-daemon[3708] INFO: Daemonized.\0", 62, 0, NULL, 0) = -1 EBADF (Bad file descriptor)
    socket(AF_UNIX, SOCK_DGRAM, 0)          = 3
    close(3)                                = 0
    connect(3, {sa_family=AF_UNIX, sun_path="/dev/log"}, 10) = -1 EBADF (Bad file descriptor)
    close(3)                                = -1 EBADF (Bad file descriptor)
    write(2, "Traceback (most recent call last):\n", 35) = 35
    

    Here, after a failed write, SysLogHandler immediately calls the socket system call without closing the old socket. It then closes the socket it just created and tries to connect to the closed socket! This is a confluence of bugs: the dangling file descriptor bug I’ve been discussing, and Python Issue 17981 (“SysLogHandler closes connection before using it”).

    Here’s the inoffensive but problematic line from logging.handlers.SysLogHandler._connect_unixsocket:

    self.socket = socket.socket(socket.AF_UNIX, use_socktype)
    

    That line creates a socket and assigns it to self.socket. As a side-effect, it causes the old self.socket to be unreferenced and available for garbage collection. When the old self.socket is garbage-collected, its .close method gets called, because Python doesn’t know that it is already closed. Unluckily, both the old and new socket object used file descriptor 3, so the new object now references a closed file descriptor.

    The resolution to Python Issue 17981 is to close the socket on error. The object’s .close method will be called before creating the new socket, rather than after. The fix was committed upstream in May of 2013, but it has not yet made its way to CentOS 7. Issue 17981 is a red herring though, and the real problem is objects with dangling references to file descriptors.

    The straightforward fix in this sample program is to tell DaemonContext not to close the syslog file descriptor:

    daemon_context = daemon.DaemonContext(
        pidfile=pidf,
        umask=0o22,
        files_preserve=[handler.socket],
    )
    

    One should think carefully before using functions like os.close. This syscall works on the underlying file descriptor, which may be associated with a File object that expects to manage it’s lifetime.