Proxmox Cluster with Witness component to support HA

Sep 27, 2024 (Last modified: Sep 27, 2024)pve virtualization zfs ha ·

In this post we will expand on one of my previous posts regarding Proxmox clusters. The cluster post showed how to build a cluster consisting of two nodes with storage replication between them utilizing the built-in ZFS replication feature. However the failover of a workload had to be done manually, as with two nodes there's no quorum.

To mitigate this we will take a look at adding a third component to the cluster which can help us in case of node failures.

The clustering functionality used in PVE is based on Corosync which uses a QDevice daemon on the nodes to handle the clustering communication. On our third node we'll install that QDevice daemon on a standalone device. In my example we'll use a small Raspberry PI running Debian 12.

Below is a high-level view of what the environment would look like

Architecture

Our two nodes will still utilize a replication link directly between themselves, but we're adding a cluster network link on a separate nic which also our third component is connected to.

Existing setup

In our existing setup we have two nodes joined in a cluster. The existing cluster communication is happening over the management network which is not a best practice. We'll change this so that the communication is happening over the network on a separate VLAN.

Existing cluster

Host networking

On our server nodes we've prepared a NIC with the new cluster networking details, the 192.168.173.0/24 network in this example

Host networking

Change cluster network

Before adding our third component, the witness component, let's change the cluster network and verify that it works.

The Proxmox documentation specifies how to separate cluster networking after Cluster creation and this process will be used to change the cluster network

We'll start by taking a copy of the existing corosync.conf file, and we'll also copy it to a .new file since changes to this file might be effective immediately.

1cp /etc/pve/corosync.conf /etc/pve/corosync.conf.bak
2cp /etc/pve/corosync.conf /etc/pve/corosync.conf.new

The current corosync configuration specifies the old network

Existing corosync configuration

Next we'll edit the corosync.conf file to reflect the new network. Note that we'll also have to change the config_version to reflect that there has been a change in the config

New corosync config

Now, we'll overwrite the existing config with the new. And check the status of the cluster

1mv /etc/pve/corosync.conf.new /etc/pve/corosync.conf
2pvecm status

Change config and check status

As we can see the old config is still used so we'll restart the corosync service

1systemctl restart corosync
2pvecm status

Restart corosync

As we can see, after restarting the corosync service the IP address changes, but the second node is gone. Although the new config has synchronized over to node 2 we'll also need to restart the corosync service on that node to get things up and running

Config synchronized to second node, restart needed

After restarting the corosync service on node 2 let's check the pvecm status command again

Cluster status

Add witness component

To reiterate, at this point we have two nodes with their own ZFS volume on internal storage which is replicated between them. This helps migrating VMs between the two hosts since the storage is already present on the other node, hence the only thing that needs to be migrated are changes since the last replication, and the VM memory.

In case of a failure on one of the nodes we can relatively easy get the VMs up and running on the other node, albeit with some data loss. This (i.e. the failover) is however a manual process.

Now, let's add a third corosync component and see if we can automate the failover by implementing Corosync external vote support

Our third component will be installed on a Debian machine running on a Raspberry Pi computer

A Raspberry Pi is obviously not something one might use in a real-world scenario, but it shows that the witness could run on pretty much "anything"

The node is running on the same network as the cluster network, this is not a strict requirement. The witness node can be on a separate VLAN and does not come with the same low latency requirements as the server nodes.

Install packages

To set up qdevice we need to install some packages.

On the witness node we'll install the corosync-qnetd package

1apt install corosync-qnetd

Next we'll install the corosync-qdevice on both the witness node AND the two server nodes

1apt install corosync-qdevice

After installing the qdevice package on all three nodes we'll add the witness node to the cluster

1pvecm qdevice setup <IP-of-witness>

Configure qdevice

As we can see the setup will handle certificate distribution on all nodes and configure and start the qdevice daemon(s). Note that ssh is required for the root user so you'll need to enable this, at least temporarily

With qdevice setup we can check the cluster status

1pvecm status

Cluster status

Our cluster now has 3 expected votes and is quorated. The quorum is set to 2 which means we should be able to test failure on a node to see if HA will help us failover to the working node

Configure HA

Before testing the failover we'll need to configure High Availability (HA)

We'll start by creating a HA group with our two nodes

Create HA group

Next we'll add our VMs to the HA config and specify the HA group we just created

Add VMs to HA

With both VMs added we can verify the HA config

HA configuration

We can see that the quorum is OK, that the pve-2 host is the master, and that our two VMs are added as resources to the HA config and connected to the HA group we created.

Test failover

With this in place let's introduce a failure to the pve-2 host and see if our VMs get's failed over to the other node. Note that pve-2 is also the current HA master at this point

Current HA config and master

The pve-2 node will abrubtly get it's power cut, and our VMs will be offline.

Node failure

After around 20 seconds the server node is marked down

Node marked down

Around 60 seconds after the failure the HA master has changed to the working node

HA primary node changed

After approx two minutes the VMs are now starting up on the working node

VMs starting on working node

And finally we can verify that our VMs are started and that the VMs are online

VMs failed over to second node

Note that this failover will result in some downtime and possibly some data loss on the VMs, both because the failover isn't instant, but also since, in my environment, the VMs are not running on shared storage and are dependant on the replication interval.

The failover process takes some time because of a process called fencing. Fencing aims to ensure that the failing node is actually failed and won't come back up and cause race conditions and unwanted behavior.

Summary

This post has expanded on a previous post by adding a third device as a witness component to support automatic failover in a Proxmox Cluster

For more information about the High Availability (HA) manager, please refer to the Proxmox documentation

If you have any questions or comments, please feel free to reach out

This page was modified on September 27, 2024: Added missing link