Safely restarting an OpenStack server with Ansible

The other day on #ansible, someone was looking for a way to safely shut down a Nova server, wait for it to stop, and then start it up again using the openstack cli. The first part seemed easy:

- hosts: myserver
  tasks:
    - name: shut down the server
      command: poweroff
      become: true

…but that will actually fail with the following result:

TASK [shut down server] *************************************
fatal: [myserver]: UNREACHABLE! => {"changed": false, "msg":
"Failed to connect to the host via ssh: Shared connection to
10.0.0.103 closed.\r\n", "unreachable": true}

This happens because running poweroff immediately closes Ansible’s ssh connection. The workaround here is to use a “fire-and-forget” asynchronous task:

- hosts: myserver
  tasks:
     - name: shut down the server
       command: poweroff
       become: true
       async: 30
       poll: 0

The poll: 0 means that Ansible won’t wait around to check the result of this task. Control will return to ansible immediately, so our playbook can continue without errors resulting from the closed connection.

Now that we’ve started the shutdown process on the host, how do we wait for it to complete? You’ll see people solve this with a pause task, but that’s fragile because the timing there can be tricky to get right.

Another option is to use something like Ansible’s wait_for module. For example, we could wait for sshd to become unresponsive:

- name: wait for server to finishing shutting down
  wait_for:
    port: 22
    state: stopped

Be this is really just checking whether or not sshd is listening for a connection, and sshd may shut down long before the system shutdown process completes.

The best option is to ask Nova. We can query Nova for information about a server using the Ansible’s os_server_facts module. Like the other OpenStack modules, this uses os-client-config to find authentication credentials for your OpenStack environment. If you’re not explicitly providing authentication information in your playbook, the module will use the OS_* environment variables that are commonly used with the OpenStack command line tools.

The os_server_facts module creates an openstack_servers fact, the value of which is a list of dictionaries which contains keys like status, which is the one in which we’re interested. A running server has status == "ACTIVE" and a server that has been powered off has status == "SHUTOFF.

Given the above, the “wait for shutdown” task would look something like the following:

- hosts: localhost
  tasks:
     - name: wait for server to stop
       os_server_facts:
         server: myserver
       register: results
       until: openstack_servers.0.status == "SHUTOFF"
       retries: 120
       delay: 5

You’ll note that I’m targeting localhost right now, because my local system has access to my OpenStack environment and I have the necessary credentials in my environment. If you need to run these tasks elsewhere, you’ll want to read up on how to provide credentials in your playbook.

This task will retry up to 120 times, waiting 5 seconds between tries, until the server reaches the desired state.

Next, we want to start the server back up. We can do this using the os_server_action module, using the start action. This task also runs on localhost:

 - name: start the server
   os_server_action:
     server: larstest
     action: start

Finally, we want to wait for the host to come back up before we run any additional tasks on it. In this case, we can’t just look at the status reported by Nova: the host will be ACTIVE from Nova’s perspective long before it is ready to accept ssh connections. Nor can we use the wait_for module, since the ssh port may be open before we are actually able to log in. The solution to this is the wait_for_connection module, which waits until Ansible is able to successful execute an action on the target host:

- hosts: larstest
  gather_facts: false
  tasks:
    - name: wait for server to start
      wait_for_connection:

We need to set gather_facts: false here because fact gathering requires a functioning connection to the target host.

And that’s it: a recipe for gently shutting down a remote host, waiting for the shutdown to complete, then later on spinning it back up and waiting until subsequent Ansible tasks will work correctly.