Diagnosing problems with an OpenStack deployment

I recently had the chance to help a colleague debug some problems in his OpenStack installation. The environment was unique because it was booting virtualized aarch64 instances, which at the time did not have any PCI bus support…which in turn precluded things like graphic consoles (i.e., VNC or SPICE consoles) for the Nova instances.

This post began life as an email summarizing the various configuration changes we made on the systems to get things up and running. After writing it, I decided it presented an interesting summary of some common (and maybe not-so-common) issues, so I am posting it here in the hopes that other folks will find it interesting.

Serial console configuration#

The problem#

We needed console access to the Nova instances in order to diagnose some networking issues, but there was no VGA console support in the virtual machines. Recent versions of Nova provide serial console support, but do not provide any client-side tool for accessing the serial console.

We wanted to:

Correctly configure Nova to provide serial console support, and
Get the novaconsole tool installed in order to access the serial consoles.

Making novaconsole work#

In order to get novaconsole installed we needed the websocket-client library, which is listed in requirements.txt at the top level of the novaconsole source. Normally one would just pip install . from the source directory, but python-pip was not available on our platform.

That wasn’t a big issue because we did have python-setuptools available, so I was able to simply run (inside the novaconsole source directory):

python setup.py install

And now we had a /usr/bin/novaconsole script, and we were able to use it like this to connect to the console of a nova instance named “test0”:

novaconsole test0

(For this to work you need appropriate Keystone credentials loaded in your environment. You can also provide a websocket URL in lieu of an instance name.)

Configuration changes on the controller#

The controller did not have the openstack-nova-serialproxy package installed, which provides the nova-serialproxy service. This service provides the websocket endpoint used by clients, so without this service you you won’t be able to connect to serial consoles.

Installing the service was a simple matter of:

yum -y install openstack-nova-serialproxy
systemctl enable openstack-nova-serialproxy
systemctl start openstack-nova-serialproxy

Configuration changes on the compute nodes#

We also need to enable the serial console support on our compute nodes, and we need to change the following configuration options in nova.conf in the serial_console section:

# Set this to 'true' to enable serial console support.
enabled=true

# Enabling serial console support means that spawning an instance
# causes qemu to open up a listening TCP socket for the serial
# console.  This socket binds to the `listen` address.  It
# defaults to 127.0.0.1, which will not permit a remote host --
# such as your controller -- from connecting to the port.  Setting
# this to 0.0.0.0 means "listen on all available addresses", which
# is *usually* what you want.
listen=0.0.0.0 

# `proxyclient_address` is the address to which the
# nova-serialproxy service will connect to access serial consoles
# of instances located on this physical host.  That means it needs
# to be an address of a local interface (and so this value will be
# unique to each compute host).
proxyclient_address=10.16.184.118

In a production deployment, we would also need to modify the base_url option in this section, which is used to generate the URLs provided via the nova get-serial-console command. With the default configuration, the URLs will point to 127.0.0.1, which is fine as long as we are running novaconsole on the same host as nova-serialproxy.

After making these changes, we need to restart nova-compute on all the compute hosts:

# openstack-service restart nova

And we will need to re-deploy any running instances, because they will still have sockets listening on 127.0.0.1.

The network ports opened for the serial console service are controlled by the port_range setting in the serial_console section. We must permit connections to these ports from our controller. I added the following rule with iptables:

# iptables -I INPUT 1 -p tcp --dport 10000:20000 -j ACCEPT

In practice, we would probably want to limit this specifically to our controller(s).

Networking on the controller#

The problem#

Nova instances were not successfully obtaining ip addresses from the Nova-managed DHCP service.

Selinux and the case of the missing interfaces#

When I first looked at the system, it was obvious that something fundamental was broken because the Neutron routers were missing interfaces.

Each neutron router is realized as a network namespace on the network host. We can see these namespaces with the ip netns command:

# ip netns
qrouter-42389195-c8c1-4d68-a16c-3937453f149d
qdhcp-d2719d67-fd00-4620-be00-ea8525dc6524

We can use the ip netns exec command to run commands inside the router namespace. For instance, we can run the following to see a list of network interfaces inside the namespace:

# ip netns exec qrouter-42389195-c8c1-4d68-a16c-3937453f149d \
  ip addr show

For a router we would expect to see something like this:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
18: qr-b3cd13d6-94: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN 
    link/ether fa:16:3e:61:89:49 brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.1/24 brd 10.0.0.255 scope global qr-b3cd13d6-94
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe61:8949/64 scope link 
       valid_lft forever preferred_lft forever
19: qg-89591203-47: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN 
    link/ether fa:16:3e:30:b5:05 brd ff:ff:ff:ff:ff:ff
    inet 172.24.4.231/28 brd 172.24.4.239 scope global qg-89591203-47
       valid_lft forever preferred_lft forever
    inet 172.24.4.232/32 brd 172.24.4.232 scope global qg-89591203-47
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe30:b505/64 scope link 
       valid_lft forever preferred_lft forever

But all I found was the loopback interface:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever

After a few attempts to restore the router to health through API commands (such as by clearing and re-setting the network gateway), I looked in the logs for the neutron-l3-agent service, which is the service responsible for configuring the routers. There I found:

2015-02-20 17:16:52.324 22758 TRACE neutron.agent.l3_agent Stderr:
'Error: argument "qrouter-42389195-c8c1-4d68-a16c-3937453f149d" is
wrong: Invalid "netns" value\n\n'

This is weird because clearly a network namespace with a matching name was available. When weird inexplicable errors happen, we often look first to selinux, and indeed, running audit2allow -a showed us that neutron was apparently missing a privilege:

#============= neutron_t ==============
allow neutron_t unlabeled_t:file { read open };

After putting selinux in permissive mode¹ and restarting neutron services, things looked a lot better. To completely restart neutron, I usually do:

openstack-service stop neutron
neutron-ovs-cleanup
neutron-netns-cleanup
openstack-service start neutron

The openstack-service command is a wrapper over systemctl or chkconfig and service that operates on whatever openstack services you have enabled on your host. Providing a additional arguments limits the action to services that match that name, so in addition to openstack-service stop neutron you can do something like openstack-service stop nova glance to stop all Nova and Glance services, etc.

Iptables and the case of the missing packets#

After diagnosing the selinux issue noted above, virtual networking layer looked fine, but we still weren’t able to get traffic between the test instance and the router/dhcp server on the controller.

Traffic was clearly traversing the VXLAN tunnels, as revealed by running tcpdump on both ends of the tunnel (where 4789 is the vxlan port):

tcpdump -i eth0 -n port 4789

But that traffic was never reaching, e.g., the dhcp namespace. Investigating the Open vSwitch (OVS) configuration on our host showed that everything look correct; commands I use to look at things were:

ovs-vsctl show to look at the basic layout of switches and interfaces,
ovs-ofctl dump-flows <bridge> to look at the openflow rules associated with a particular OVS switch, and
ovs-dpctl-top, which provides a top-like view of flow activity on the OVS bridges.

Ultimately, it turns out that there were some iptables rule missing from our configuration. On the host, looking for rules that match vxlan traffic I found a single rule for vxlan traffic:

# iptables -S | grep 4789
-A INPUT -s 10.16.184.117/32 -p udp -m multiport --dports 4789 ...

The compute node we were operating with was 10.16.184.118 (which is not the address listed in the above rule), so vxlan traffic from this host was being rejected by the kernel. I added a new rule to match vxlan traffic from the compute host:

# iptables -I INPUT 18 -s 10.16.184.118/32 -p udp -m multiport --dports 4789 ...

This seemed to take care of things, but it’s a bit of a mystery why this wasn’t configured for us in the first place by the installer. This may have been a bug in packstack; we would need to do clean re-deploy to verify this behavior.

Access to floating ip addresses#

In order to access our instances using their floating ip addresses from our host, we need a route to the floating ip network. The easiest way to do this in a test environment, if you are happy with host-only networking, is to assign interface br-ex the address of the default gateway for your floating ip network. The default floating ip network configured by packstack is 172.24.4.224/28, and the gateway for that network is 172.24.24.225. We can assign this address to br-ex like this:

# ip addr add 172.24.4.225/28 dev br-ex

With this in place, connections to floating ips will route via br-ex, which in turn is bridged to the external interface of your neutron router.

Setting the address by hand like this means it will be lost next time we reboot. We can make this configuration persistent by modifying (or creating) /etc/sysconfig/network-scripts/ifcfg-br-ex so that it looks like this:

DEVICE=br-ex
DEVICETYPE=ovs
TYPE=OVSBridge
BOOTPROT=static
IPADDR=172.24.4.225
NETMASK=255.255.255.240
ONBOOT=yes

If you’re not able to map CIDR prefixes to dotted-quad netmasks in your head, the ipcalc tool is useful:

$ ipcalc -m 172.24.4.224/28
NETMASK=255.255.255.240

The state of things#

With all the above changes in place, we had a functioning OpenStack environment.

We could spawn an instance as the “demo” user:

# . keystonerc_demo
# nova boot --image "rhelsa" --flavor m1.small example

Create a floating ip address:

# nova floating-ip-create public
+--------------+-----------+----------+--------+
| Ip           | Server Id | Fixed Ip | Pool   |
+--------------+-----------+----------+--------+
| 172.24.4.233 | -         | -        | public |
+--------------+-----------+----------+--------+

Assign that address to our instance:

# nova floating-ip-associate example 172.4.4.233

And finally we were able to access services on that instance (provided that our security groups (and local iptables configuration on the instance) permit access to that service).

…as a temporary measure, pending opening a bug report to get things corrected so that this step would no longer be necessary. ↩︎