Infinitely Scalable Storage for Kubernetes

A destructive experiment to make sure our data recover

Published in

Towards Data Science

6 min readMay 7, 2023

Sometimes, you just need storage that works. The luxury of having a Cloud provider storage class is not always possible, and you have to manage it all by yourself. This is the challenge I had to answer for my on-premise client in healthcare.

In this article, you will learn why and how to install Rook Ceph to provide your Kubernetes cluster with an easy-to-use replicated storage class.

We will then deploy a file-sharing app, destroy the node on which it is deployed, and see what happens. Will Ceph make our files accessible again?

Containers to the horizon. Photo by Kelly on Pexels.

Choosing a storage solution

Storage has always been a challenge in Kubernetes as it doesn’t natively provide redundant and distributed storage solutions. With native Kubernetes, you can only attach a hostPath volume for persistent storage.

My client has its own on-premise infrastructure and wanted to make sure none of its data would get lost if one of its servers went down. Most of its apps are monoliths and don’t natively include data replication mechanisms.

So I had to choose from a variety of storage solutions. My client didn’t need ultra-high performances but wanted a stable solution. I came to choose Rook Ceph because :

It is a CNCF-graduated project (guarantee of stability and quality)
Is open source with great documentation and community support
It is easy to deploy and use
Performances are fair (chapter “The benchmarks”)

Prepare your cluster

We need a Kubernetes cluster of a minimum of 3 nodes and 1 empty attached disk for each.

I recommend using Scaleway Kapsule to easily instantiate a Kubernetes cluster and attribute unformatted disks. Once the Kubernetes cluster has started, we will create an attached volume (disk) for each node :

Go to “Instances”
Select your node
Click the “Attached volumes” tab
Click “+” (Create volume) and create a new disk

Download your kubeconf file and place it in ~/.kube/config . You should now get access to your cluster with your kubectl CLI.

Install Rook Ceph

1. This blog post has a companion repo on GitHub, let’s clone it to have all resources we need

git clone https://github.com/flavienbwk/ceph-kubernetes
cd ceph-kubernetes

2. Clone the Rook repo and deploy the Rook Ceph operator

git clone --single-branch --branch release-1.11 https://github.com/rook/rook.git
kubectl create -f ./rook/deploy/examples/crds.yaml
kubectl create -f ./rook/deploy/examples/common.yaml
kubectl create -f ./rook/deploy/examples/operator.yaml

3. Create the Ceph cluster

kubectl create -f ./rook/deploy/examples/cluster.yaml -n rook-ceph

Wait several minutes for Ceph to configure the disks. Health should be HEALTH_OK :

kubectl get cephcluster -n rook-ceph

4. Create the storage classes

Rook Ceph can provide you with two main storage classes. One is RBD and allows you to have a replicated storage in ReadWriteOnce mode. The second one we’ll install is CephFS, which allows you to have replicated storage in ReadWriteMany mode. RBD stands for RADOS Block Device and allows you to have a storage class to provision volumes in your Kubernetes cluster. This only supports ReadWriteOnce volumes (RWO). CephFS acts like a replicated NFS server. This is what will allow us to create volumes in ReadWriteMany mode (RWX).

kubectl create -f ./rook/deploy/examples/csi/rbd/storageclass.yaml -n rook-ceph
kubectl create -f ./rook/deploy/examples/filesystem.yaml -n rook-ceph
kubectl create -f ./rook/deploy/examples/csi/cephfs/storageclass.yaml -n rook-ceph

5. Deploy the Ceph dashboard

kubectl create -f ./rook/deploy/examples/dashboard-external-https.yaml -n rook-ceph

Forward dashboard’s HTTP access :

kubectl port-forward service/rook-ceph-mgr-dashboard -n rook-ceph 8443:8443

Connect with the username admin and the following password :

kubectl -n rook-ceph get secret rook-ceph-dashboard-password -o jsonpath="{['data']['password']}"

You should get the following page browsing https://localhost:8443

Deploying an app

We will deploy a self-hosted file-sharing app (psitransfer) to check if our volumes bind correctly.

1. Deploy the file-sharing app (NodePort 30080)

kubectl create -f ./psitransfer-deployment-rwx.yaml

2. See on which node it is deployed

kubectl get pods -o wide -l app=psitransfer

Retrieve the IP of this node (through the Scaleway interface) and check the app is running at http://nodeip:30080

3. Let's upload some files

Download the 5MB, 10MB and 20MB files from xcal1.vodafone.co.uk website.

Upload them to our file transfer app. Click the link that appears on the screen.

You should now see the tree files imported. Click on it and keep the link in your browser tab, we'll use it later.

After uploading around 400MB of files, we can see the replication of data is coherent across disks. We see that the 3 disks are written simultaneously while we upload files. In the following screenshot, usage is 1% for each disk: although I uploaded on the same host, it seems the replication is working as expected with data equally persisted across the 3 disks (OSDs). Disk 2 has a lot of "read" activity as the 2 other disks synchronize data from it.

Ceph's dashboard should look like this now :

C. Destroy and see

We're going to stop the node hosting the web app to make sure data was replicated on the other nodes.

See on which node the app is deployed

kubectl get pods -o wide -l app=psitransfer

2. Power off the node from the Scaleway console

This simulates a power failure on a node. It should become NotReady after several minutes :

$> kubectl get node
NAME                                             STATUS     ROLES    AGE    VERSION
scw-ceph-test-clustr-default-5f02f221c3814b47a   Ready      <none>   3d1h   v1.26.2
scw-ceph-test-clustr-default-8929ba466e404a00a   Ready      <none>   3d1h   v1.26.2
scw-ceph-test-clustr-default-94ef39ea5b1f4b3e8   NotReady   <none>   3d1h   v1.26.2

And Node 3 is unavailable on our Ceph dashboard :

Ceph's dashboard should now look like this :

3. Reschedule our pod

The scheduled pod node is unavailable. However, our pod still thinks it is active :

$> kubectl get pods -o wide -l app=psitransfer
NAME                                      READY   STATUS    RESTARTS   AGE   IP            NODE
psitransfer-deployment-8448887c9d-mt6wm   1/1     Running   0          19h   100.64.1.19   scw-ceph-test-clustr-default-94ef39ea5b1f4b3e8

Delete it to reschedule it on another node :

kubectl delete pod psitransfer-deployment-8448887c9d-mt6wm

Check the status of the newly-restarted pod. Your app should be available again at the link previously kept.

To avoid having to manually delete the pod to be rescheduled when a node gets "NotReady", scale the number of replicas of your app to at least 3 by default.

You can now restart the previously powered-off node.

When to use rook-ceph-block or rook-cephfs ?

If your applications need better performance and require block storage with RWO access mode, use the rook-ceph-block (RBD) storage class. On the other hand, if your applications need a shared file system with RWX (CephFS) access mode and POSIX compliance, use the rook-cephfs storage class.

If choosing RBD and trying to reschedule a pod while its original node is offline as we did with CephFS, you will get an error from the PVC stating: "Volume is already exclusively attached to one node and can't be attached to another". In that case, you just need to wait for the PVC to bind back (it took ~6 minutes for my cluster to automatically re-attribute the PVC to my pod, allowing it to start).

Try this behavior following the associated repo chapter.

Final word

You have learned how to install and deploy an app with Ceph. You have even proven that it replicates data. Congrats ✨

All images, unless otherwise noted, are by the author.