Deploying a Proxmox Cluster backed by Ceph

In a few previous posts we've seen how to deploy Proxmox and how to configure a Proxmox Cluster. In the previous cluster post we worked with hosts backed by ZFS storage, i.e. no shared storage. In this post we'll take a look at how to use Ceph as the storage for the cluster.

Whereas the previous cluster posts also supported failover scenarios those all came with a cost of data loss depending on the replication of data between the hosts. With Ceph as the storage backend the workloads will run on shared storage between the hosts and in case of a failure the workload should still be up and running and no data should be lost.

Note that there's many different failure scenarios, many which might result in data loss even in my example architecture here in my lab environment. Please make sure that you design for failure scenarios relevant to your environment and architecture

Let's get started

Cluster architecture

The architecure used in this design looks something like the following

Cluster environment

We have three hosts with internal storage (SSD disk) and each host has a few networks connected. In the design we have separated the management, cluster and storage networking on different interfaces and networks. In this setup the management is shared with the VM traffic, but in a real-world scenario these would also be separated, and we would also have multiple network interface cards for the different networks/use-cases.

The best practice is to separate the storage replication from the cluster network traffic as both are latency sensitive and critical for the performance of the cluster

Please refer to this section from the Proxmox Ceph deployment documentation: If unsure, we recommend using three (physical) separate networks for high-performance setups:

  • one very high bandwidth (25+ Gbps) network for Ceph (internal) cluster traffic.
  • one high bandwidth (10+ Gpbs) network for Ceph (public) traffic between the ceph server and ceph client storage traffic.
  • one medium bandwidth (1 Gbps) exclusive for the latency sensitive corosync cluster communication.

The amount of nodes and disks in those nodes used in this environment does not neccessarily meet best practices. Please make sure that you design your architecture based on the the use-cases your cluster is supposed to solve.

Host networking

Before creating our cluster we'll set up the networking on the hosts. We'll configure the nic's for cluster traffic and storage traffic.

Host networking

Create cluster

Now, let's go ahead and create the cluster.

In the cluster creation wizard we'll give the cluster a name, and we'll specify the network to use for cluster traffic. In this example we'll specify both the cluster traffic nic and the storage nic as this could give us some failover capabilities should the cluster nic experience any issues.

Create cluster and specify cluster network

After successful creation of the cluster our Cluster page (on the Datacenter object) should look like something like the following

Cluster created

Add nodes to cluster

Now, we can go ahead and add the remaining two servers as cluster nodes.

From the cluster view we'll hit the "Join information" button to get the token needed for joining additional nodes.

Join information token

Now, on each of the two nodes we'll go to it's Datacenter object and hit cluster and then the "Join cluster" button and paste in the token. We'll also add the root password of the first node, and select the two links for the Cluster network

Join node to cluster

After hitting the join button the join cluster process will start. After a few seconds you might notice that there's a communication error displayed

Communcation error message

I suspect this comes from the fact that the Join cluster tasks restarts the pve-cluster service, but in my experience it won't finalize the task in the UI.

Looking at the Cluster view on our first node we might also get an error concerning a pve-ssl.pem file, but we can see that the node has joined.

Node joined cluster

After refreshing the page(s) everything should be fine, and we can go ahead and add our third node and finally our cluster should consist of our three nodes and there should be a green checkmark on all three nodes

Cluster nodes added

Install Ceph

Moving on we'll now go ahead and install Ceph on the nodes.

To help us with this we can use the wizard and functionality added in the Proxmox VE UI

Install Ceph from UI

The first step in the installation is to choose the Ceph version and the repository to use. In our case we'll go with the Reef (18.2) version, and the "No-subscription" (community) repository

Select Ceph version and repository

After selecting the version, the installer will start downloading and installing the packages needed

Installing packages

Next we'll select the network

Set Ceph network

Again, please read up on the networking requirements when deploying this in your own environment

And finally we'll hit Finish to finalize the Ceph installer

Finalize installation

Now, as the wizard suggests, it's time to install Ceph on the two remaining nodes

Add monitors

When all nodes have Ceph installed we'll also add the two last nodes as Ceph monitors

This is done to ensure that failure on any of the nodes will get picked up by the others. If we had omitted this step, a failure on the first node would result in the Ceph cluster being down.

Added Ceph monitors

In my environment I got a warning on clock skew between the three nodes. Ceph is very sensitive on this

Cluster warnings

After correcting the time synchronization on my three hosts the status warns about the fact that there's no "OSDs" and that the expected count is 3. We'll fix this next.

Time synchronization fixed

Add managers

In addition to monitors we can also choose to add the Manager role to the nodes. Our cluster would work with the manager role on just one node, but I added the role to all of them just in case. Only one manager can be active, hence two managers will be in Standby

Added managers

Create OSDs

An Object Storage Daemon (OSD) is a daemon that interacts with logical disks and to make those available to the Ceph cluster.

As a minimum three OSDs are recommended in a cluster hence we'll create one OSD backed by a SSD disk on each of our hosts.

Create OSD

After creating OSDs on all hosts we should have one OSD on each

OSDs created

And our Cluster status should be Green

Cluster status ok

Create Pool

Now to make the storage available to our cluster we'll create a Ceph Pool. A Pool is a logical group for storing objects.

Create Pool

The size specified is the number of replicas to use for each object. The default is 3. The min_size, i.e. the minimum number of replicas, is by default 2 and can be overridden in the Advanced settings. We'll leave the defaults which ensures that even if one of our nodes (or disks) encounters a failure the object will still be functional and our workloads should work.

And after pool is created we should now have the Ceph backed storage pool connected to each of our hosts and we can start utilizing it in our cluster

Ceph pool created and connected to nodes

Deploy workloads

Now to test our storage let's deploy some workloads. We won't go in to any details on how to deploy virtual machines in this post. The process is very straightforward and there's plenty of resources out there that goes through the steps. I've also shown an example of this in one of my previous posts.

Cluster workloads

At this point we have a few machines running. 3 Ubuntu VMs and 1 Windows VM. We also have one VM that is powered off.

Test failover

Now, without any further configuration let's see what happens if one of the three nodes suffers from an outage

One node down

One node is down, the cluster status is still Ok and we have a warning on Ceph

Ceph warning

So as we can see the Ceph cluster is still operational, but with a degraded data redundancy.

What might seem strange is that the virtual machine that was running on the failed node is not migrated/failed over to any of the other nodes. This is because we haven't configured any HA (High availability) yet.

Let's bring that failed node back online and look at how we can configure HA

Configure HA

To configure HA for hosts we'll go to the Datacenter object and the HA tab. We'll create a group and select our three nodes

Create HA group

Next we'll have to specify which VMs should be part of HA, and to what group the VM belongs to. Coming from a VMware world this might seem like a cumbersome step, but it also gives some flexibility.

Add VM to HA group

After specifying the resources I want to protect, my HA config looks like this

HA resources config

As we can see all VMs apart form the powered off one, 103, is protected.

Test failover with HA

Now let's test another failover and see what happens to our VMs. This time we'll simulate a failure on node pve-13 which has three VMs.

Node down

Although our host has encountered a failure, our VMs doesn't immediately fail over to our working nodes. In this example we can see that it took just over two minutes before our VMs are started up on the working hosts after the last timestamp of the failing node

HA failed over VMs

We can also see that the VM with ID 103 is not failed over which is expected since this VM was not part of the HA resources config.

More information on High availability can be found in the Proxmox documentation

Summary

In this post we've seen how to deploy a Proxmox cluster with Ceph as the shared storage solution. We've also seen how to configure High availability so that VMs automatically can be failed over to working nodes in case of a failure.

There's obviously a lot more tweaking and configuration that can (and should?) be done in a more production like environment, especially around Ceph, but hopefully this post can show some of the cababilties built in to the Proxmox Virtual Environment platform. For more information about deploying a Hyper-Converged Ceph cluster, please check out the Proxmox documentation

Thanks for reading, and please reach out if you have any comments or questions

This page was modified on September 25, 2024: Proxmox ceph cluster post