After a recent upgrade of my Fedora 20 system to kernel 3.15.mumble, I started running into a problem (BZ 1121345) with my Docker containers. Operations such as su or runuser would fail with the singularly unhelpful System error message:

$ docker run -ti fedora /bin/bash
bash-4.2# su -c 'uptime'
su: System error

Hooking up something (like, say, socat unix-listen:/dev/log -) to /dev/log revealed that the system was logging:

Jul 19 14:31:18 su: PAM audit_log_acct_message() failed: Operation not permitted

Downgrading the kernel to 3.14 immediately resolved the problem, suggesting that this was at least partly a kernel issue. This seemed like a great opportunity to play with the git bisect command, which uses a binary search to find which commit introduced a particular problem.

Unfortunately, between the version I knew to work correctly (3.14) and the version I knew to have a problem (3.15) there were close to 15,000 commits, which seemed like a large space to search by hand.

Fortunately, git bisect can be easily automated via git bisect run subcommand, which after checking out a commit will run a script to determine if the current commit is "good" or "bad". So all I have to do is write a script...that's not so bad!

It actually ended up being somewhat tricky.

Testing kernels is hard

In order to test for this problem, I would need to use arbitrary kernels generated during the git bisect operation to boot a system functional enough to run docker, and then run docker and somehow communicate the result of that test back to the build environment.

I started with the Fedora 20 cloud image, which is nice and small but still the same platform as my laptop on which I was experiencing the problem. I would need to correct a few things before moving forward:

The Fedora cloud images (a) do not support password authentication and (b) expect a datasource to be available to cloud-init (without which you get errors on the console and potentially a delay waiting for the login: prompt), so prior to using the image in this test I made some changes by mounting it locally:

# modprobe nbd max_part=8
# qemu-nbd -c /dev/nbd0 Fedora-x86_64-20-20140407-sda.qcow2
# mount /dev/nbd0p1 /mnt
# systemd-nspawn -D /mnt

And then:

  • I set a password for the root account and
  • I removed the cloud-init package.

For this test I would be using the qemu-system-x86_64 command directly, rather than working through libvirt (qemu has options for convenient debugging with gdb, and is also able to access the filesystem as the calling uid whereas libvirt is typically running as another user).

I would need to perform an initial docker pull in the image, which meant I was going to need a functioning network, so first I had to set up a network environment for qemu.

Network configuration

I created a bridge interface named qemu0 to be used by qemu. I added to /etc/sysconfig/network-scripts/ifcfg-qemu0 the following:

DEVICE=qemu0
TYPE=Bridge
ONBOOT=yes
BOOTPROTO=none
STP=no
NAME="Bridge qemu0"
IPADDR=192.168.210.1
NETMASK=255.255.255.0

This is largely equivalent to the following, but persists after reboot:

brctl addbr qemu0
ip addr add 192.168.210.1/24 dev qemu0
ip link set qemu0 up

I created a tap interface named linux0:

ip tuntap add dev linux0 mode tap user lars

And added it to the bridge:

brctl addif qemu0 linux0

I also started up dnsmasq process listening on qemu0 to provide DNS lookup and DHCP service to qemu instances attached to this bridge. The dnsmasq configuration looked like this:

listen-address=192.168.210.1
bind-interfaces
dhcp-range=192.168.210.10,192.168.210.254

Running qemu

With the network environment set up, I needed to figure out an appropriate qemu command line. This is what I finally ended up with, in a script called boot-kernel:

#!/bin/sh

qemu-system-x86_64 -m 1024M \
  -drive file=fedora.img,if=virtio \
  -append "console=hvc0 root=/dev/vda1 selinux=0 $BOOT_ARGS" \
  -initrd initrd.img \
  -kernel arch/x86_64/boot/bzImage \
  -machine accel=kvm \
  -netdev tap,id=net0,ifname=linux0,script=no,downscript=no \
  -device virtio-net,netdev=net0,mac=52:54:00:c0:ff:ee \
  -chardev stdio,id=stdio,mux=on \
  -device virtio-serial-pci \
  -device virtconsole,chardev=stdio \
  -mon chardev=stdio \
  -fsdev local,id=fs0,path=$PWD,security_model=none \
  -device virtio-9p-pci,fsdev=fs0,mount_tag=kernel_src \
  -display none \
  $QEMU_ARGS

These lines set up the networking:

  -netdev tap,id=net0,ifname=linux0,script=no,downscript=no \
  -device virtio-net,netdev=net0,mac=52:54:00:c0:ff:ee \

These lines set up console on stdin/stdout and multiplex the console with the qemu monitor:

  -chardev stdio,id=stdio,mux=on \
  -device virtio-serial-pci \
  -device virtconsole,chardev=stdio \
  -mon chardev=stdio \

These lines set up access to the current working directory as a 9p filesystem:

  -fsdev local,id=fs0,path=$PWD,security_model=none \
  -device virtio-9p-pci,fsdev=fs0,mount_tag=kernel_src \

Within the qemu instance, this lets me access my working directory with:

mount -t 9p kernel_src /mnt

The $BOOT_ARGS and $QEMU_ARGS in the script allow me to modify the behavior of the script by setting environment variables when calling it, like this:

QEMU_ARGS="-s" sh boot-kernel

First boot

I tried to boot the image using my existing kernel and initrd from /boot, and ran into a problem:

[  184.060756] dracut-initqueue[218]: Warning: Could not boot.
[  184.062855] dracut-initqueue[218]: Warning: /dev/ssd/root does not exist
         Starting Dracut Emergency Shell...
Warning: /dev/ssd/root does not exist

Generating "/run/initramfs/rdsosreport.txt"

Entering emergency mode. Exit the shell to continue.

The what now? /dev/ssd/root is the root device for my host system, but wasn't anywhere in the kernel command line I used when booting qemu. It turns out that this was embedded in the initrd image in /etc/cmdline.d/90lvm.conf. After removing that file from the image...

# mkdir initrd
# cd initrd
# zcat /boot/initramfs-3.15.6-200.fc20.x86_64.img | cpio -id
# rm -rf etc/cmdline.d
# find . -print | cpio -o -Hcrc | gzip > ../initrd.img

...I was able to boot successfully and log in.

I bet you thought we were done!

Modern systems are heavily modular. Without access to a module tree matching the kernel, I would be unable to successfully boot the system, let alone use Docker. Looking at which modules were loaded when I ran docker with the above image, I set up a custom kernel configuration that would permit me to boot and run docker without requiring any loadable modules. This would allow me to use the same image for each kernel without needing to re-populate it with modules each time I built a kernel.

The kernel configuration I ended up with is available here.

Testing docker

The last step in this process is putting together something that tests docker and exposes the result of that test to the build environment. I added the following script to the image as /root/docker-test:

#!/bin/sh

grep NO_DOCKER_TEST /proc/cmdline && exit 0

if [ -d /mnt/test_result ]; then
docker run --rm -i fedora sh -c 'su -c true && echo OKAY || echo FAILED' \
  > /mnt/test_result/stdout \
  2> /mnt/test_result/stderr
poweroff
fi

This relies on the following entry in /etc/fstab:

kernel_src /mnt 9p defaults 0 0

That mounts the build directory as a 9p filesystem on /mnt. This allows us to write out test results to, e.g., /mnt/test_result/stdout and have that appear in the test_result directory inside the kernel source.

This script is run at the end of the boot process via an entry in /etc/rc.d/rc.local:

#!/bin/sh

sh /root/docker-test

Running the boot-kernel script without additional configuration will cause the image to boot up, run the docker test, and then exit.

Running git-bisect

At this point we have just about everything we need to start running git bisect. For the initial run, I'm going to use git tag v3.14 as the "known good" commit and v3.15 as the "known bad" commit, so we start git bisect like this:

$ git bisect start v3.15 v3.14

Then we run git bisect run sh bisect-test, where bisect-test is the following shell script:

#!/bin/sh

# Rebuild the kernel
make olddefconfig
make -j8

# Clear out old test results and run the test
rm -f test_result/{stdout,stderr}
sh boot-kernel

# Report results to git-bisect
if grep OKAY test_result/stdout; then
  exit 0
else
  exit 1
fi

...and then we go out for a cup of coffee or something, because that's going to take a while.

Keep digging, Watson

The initial run of git bisect narrowed the change down to the following commit:

commit b7d3622a39fde7658170b7f3cf6c6889bb8db30d
Merge: f3411cb d8ec26d
Author: Eric Paris <eparis@redhat.com>
Date:   Fri Mar 7 11:41:32 2014 -0500

    Merge tag 'v3.13' into for-3.15

    Linux 3.13

    Conflicts:
        include/net/xfrm.h

    Simple merge where v3.13 removed 'extern' from definitions and the audit
    tree did s/u32/unsigned int/ to the same definitions.

As you can see (from the Merge: header), this is a merge commit, in which an entire set of changes was joined into the master branch. So while this commit is technically the first commit in which this problem appears in the master branch...it is not actually the commit that introduced the problem.

I was in luck, though, because looking at the history for the left side of this branch (starting with f3411cb) showed a series of patches to the audit subsystem:

$ git log --oneline f3411cb
f3411cb audit: whitespace fix in kernel-parameters.txt
8626877 audit: fix location of __net_initdata for audit_net_ops
4f06632 audit: remove pr_info for every network namespace
262fd3a audit: Modify a set of system calls in audit class definitions
3e1d0bb audit: Convert int limit uses to u32
d957f7b audit: Use more current logging style
b8dbc32 audit: Use hex_byte_pack_upper
06bdadd audit: correct a type mismatch in audit_syscall_exit()
1ce319f audit: reorder AUDIT_TTY_SET arguments
0e23bac audit: rework AUDIT_TTY_SET to only grab spin_lock once
3f0c5fa audit: remove needless switch in AUDIT_SET
70249a9 audit: use define's for audit version

...etc.

I picked as a starting point the merge commit previous to f3411cb:

$ git log --merges -1
commit fc582aef7dcc27a7120cf232c1e76c569c7b6eab
Merge: 9175c9d 5e01dc7
Author: Eric Paris <eparis@redhat.com>
Date:   Fri Nov 22 18:57:08 2013 -0500

    Merge tag 'v3.12'

    Linux 3.12

    Conflicts:
        fs/exec.c

And ran git bisect again from that commit through to f3411cb:

$ git bisect start f3411cb fc582ae
$ git bisect run sh bisect-test

Which ultimately ended up with this commit:

33faba7fa7f2288d2f8aaea95958b2c97bf9ebfb is the first bad commit
commit 33faba7fa7f2288d2f8aaea95958b2c97bf9ebfb
Author: Richard Guy Briggs <rgb@redhat.com>
Date:   Tue Jul 16 13:18:45 2013 -0400

    audit: listen in all network namespaces

    Convert audit from only listening in init_net to use register_pernet_subsys()
    to dynamically manage the netlink socket list.

    Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
    Signed-off-by: Eric Paris <eparis@redhat.com>

Running git bisect log shows us what revisions were checked as part of this process:

# bad: [f3411cb2b2e396a41ed3a439863f028db7140a34] audit: whitespace fix in kernel-parameters.txt
# good: [fc582aef7dcc27a7120cf232c1e76c569c7b6eab] Merge tag 'v3.12'
git bisect start 'f3411cb' 'fc582ae'
# bad: [ff235f51a138fc61e1a22dcb8b072d9c78c2a8cc] audit: Added exe field to audit core dump signal log
git bisect bad ff235f51a138fc61e1a22dcb8b072d9c78c2a8cc
# bad: [51cc83f024ee51de9da70c17e01ec6de524f5906] audit: add audit_backlog_wait_time configuration option
git bisect bad 51cc83f024ee51de9da70c17e01ec6de524f5906
# bad: [ae887e0bdcddb9d7acd8f1eb7b7795b438aa4950] audit: make use of remaining sleep time from wait_for_auditd
git bisect bad ae887e0bdcddb9d7acd8f1eb7b7795b438aa4950
# good: [2f2ad1013322c8f6c40fc6dafdbd32442fa730ad] audit: restore order of tty and ses fields in log output
git bisect good 2f2ad1013322c8f6c40fc6dafdbd32442fa730ad
# bad: [e789e561a50de0aaa8c695662d97aaa5eac9d55f] audit: reset audit backlog wait time after error recovery
git bisect bad e789e561a50de0aaa8c695662d97aaa5eac9d55f
# bad: [33faba7fa7f2288d2f8aaea95958b2c97bf9ebfb] audit: listen in all network namespaces
git bisect bad 33faba7fa7f2288d2f8aaea95958b2c97bf9ebfb
# first bad commit: [33faba7fa7f2288d2f8aaea95958b2c97bf9ebfb] audit: listen in all network namespaces

The commit found by git bisect seems like a reasonable candidate; it's a patch against the audit subsystem and has something to do with namespaces, which are central to Docker's proper operation.

Debugging the problem

We can boot the kernel built from 33faba7 with the boot-kernel script, adding the -s argument to qemu to start a gdbserver on port 1234:

sh BOOT_ARGS=NO_DOCKER_TEST QEMU_ARGS="-s" boot-kernel

A caveat about attaching to qemu with gdb: qemu has a -S option that will cause the virtual machine to halt at startup, such that you can attach before it starts booting and -- in theory -- set breakpoints in the early boot process. In practice this doesn't work well at all (possibly because the vm switches from 32- to 64-bit operation during the boot process, which makes gdb unhappy). You're better off attaching after the kernel has booted.

In another window, we attach gdb to the running qemu process:

$ gdb vmlinux
Reading symbols from vmlinux...done.
(gdb) target remote :1234
Remote debugging using :1234
native_safe_halt () at /home/lars/src/linux/arch/x86/include/asm/irqflags.h:50
50  }
(gdb)

I know we're getting the EPERM in response to sending audit messages. Looking through the code in kernel/audit.c, the audit_receive_msg seems like a reasonable place to start poking about. At the beginning of audit_receive_msg, I see the following code:

err = audit_netlink_ok(skb, msg_type);
if (err)
  return err;

So let's set a breakpoint there if audit_netlink_ok() returns an error:

(gdb) br kernel/audit.c:752 if (err != 0)

And let our qemu process continue running:

(gdb) continue
Continuing.

Inside the qemu instance I start docker:

-bash-4.2# docker run -it fedora /bin/su -c uptime

And eventually gdb hits the breakpoint:

Breakpoint 1, audit_receive_msg (nlh=0xffff88003819a400, 
    skb=0xffff880038044300) at kernel/audit.c:752
752     if (err)

If I look at the value of err at this point:

(gdb) print err
$1 = -1

That it is, in fact, -EPERM, which suggests we're on the right track. Taking a closer look at audit_netlink_ok(), it's obvious that there are only three places where it can return -EPERM. I tried setting some breakpoint in this function but they weren't working correctly, probably due to to optimizations performed when compiling the kernel. So instead of gdb, in this step we just add a bunch of pr_err() statements to print out debugging information on the console:

if ((current_user_ns() != &init_user_ns) ||
    (task_active_pid_ns(current) != &init_pid_ns)) {
  pr_err("currnet_user_ns() check failed\n");
  return -EPERM;
}
.
.
.
case AUDIT_MAKE_EQUIV:
  if (!capable(CAP_AUDIT_CONTROL)) {
    pr_err("CAP_AUDIT_CONTROL check failed\n");
    err = -EPERM;
  }
  break;
case AUDIT_USER:
.
.
.
case AUDIT_FIRST_USER_MSG ... AUDIT_LAST_USER_MSG:
case AUDIT_FIRST_USER_MSG2 ... AUDIT_LAST_USER_MSG2:
  if (!capable(CAP_AUDIT_WRITE)) {
    pr_err("CAP_AUDIT_WRITE check failed\n");
    err = -EPERM;
  }
  break;

With these in place, if I run the docker command again I see:

[   12.239860] currnet_user_ns() check failed
su: System error

It looks like we've found out where it's failing! Of course, we're checking code right now that is several commits behind v3.15, so let's take a look the same function in the 3.15 release:

$ git checkout v3.15

Looking at audit_netlink_ok in kernel/audit.c, it looks as if that initial check has changed:

    /* Only support initial user namespace for now. */
    /*
     * We return ECONNREFUSED because it tricks userspace into thinking
     * that audit was not configured into the kernel.  Lots of users
     * configure their PAM stack (because that's what the distro does)
     * to reject login if unable to send messages to audit.  If we return
     * ECONNREFUSED the PAM stack thinks the kernel does not have audit
     * configured in and will let login proceed.  If we return EPERM
     * userspace will reject all logins.  This should be removed when we
     * support non init namespaces!!
     */
    if (current_user_ns() != &init_user_ns)
            return -ECONNREFUSED;

So let's insert our print statements into this version of the code and see if we get the same behavior:

if (current_user_ns() != &init_user_ns) {
  pr_err("current_user-ns() check failed\n");
  return -ECONNREFUSED;
}
.
.
.
case AUDIT_MAKE_EQUIV:
  /* Only support auditd and auditctl in initial pid namespace
   * for now. */
  if ((task_active_pid_ns(current) != &init_pid_ns)) {
    pr_err("init_pid_ns check failed\n");
    return -EPERM;
  }

  if (!netlink_capable(skb, CAP_AUDIT_CONTROL)) {
    pr_err("CAP_AUDIT_CONTROL check failed\n");
    err = -EPERM;
  }
  break;
.
.
.
case AUDIT_USER:
case AUDIT_FIRST_USER_MSG ... AUDIT_LAST_USER_MSG:
case AUDIT_FIRST_USER_MSG2 ... AUDIT_LAST_USER_MSG2:
  if (!netlink_capable(skb, CAP_AUDIT_WRITE)) {
    pr_err("CAP_AUDIT_WRITE check failed\n");
    err = -EPERM;
  }
  break;

Running the v3.15 kernel, I see:

[   26.273992] audit: CAP_AUDIT_WRITE check failed
su: System error

So it looks like the intial failure in audit_netlink_ok() was fixed, but we're stilling failing the CAP_AUDIT_WRITE check.

Summary

What's going on here?

Prior to 33faba7, audit messages were only accepted in the main network namespace. Inside other network namespaces, processes sending audit messages would simply receive ECONNREFUSED. For example, this is the result of using strace on that docker run command in a pre-33faba7 kernel:

539   sendto(3, "...authentication acct=\"root\" exe=\"/usr/bin/su\" hostname=? a"...,
      112, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = -1 ECONNREFUSED (Connection refused)

With 33faba7, audit messages are now accepted inside network namespaces. This means that instead of simply getting ECONNREFUSED, messages must pass the kernel capability check. I spoke with some of the audit subsystem maintainers (including Richard Guy Briggs, the author of this patch series), and the general consensus is that "if you want to write audit messages you need CAP_AUDIT_WRITE".

So while this patch did change the behavior of the kernel from the perspective of container tools such as Docker, the fix needs to be in the tool creating the namespaces.

Results

This issue was reported against Fedora in BZ 1121345 and BZ 1119849. This issue was also reported against Docker in GHI 6345 and GHI 7123.

This problem has been corrected upstream in PR 7179.

Package docker-io-1.0.0-9.fc20, which includes the above fix, is now available for Fedora 20 (and Fedora 19).