Persistent volume in Kubernetes cluster with multiple availability zones

August 25, 2020 - Last updated: April 26, 2021

After a wild using Kubernetes in AWS and set-up persistent volumes via EBS, I faced a problem with evicted pods after they are re-schedule to another node; The issue was the EBS volumes are dedicated by zone, makes sense because the volumes work via networking and are dedicated per datacenter, for the network latency.

The limitations are described in the official Kubernetes documentation:

https://kubernetes.io/docs/setup/best-practices/multiple-zones/#limitations

This post combines the infrastructure from this post:

https://varlogdiego.com/kubernetes-multiple-zones-and-autoscaling

Solution

The solution works in any cloud provider, in my case I use AWS as a cloud provider. The idea is to use nodeSelector for the pods that use persistent volume (EBS) and provide a fixed availability zone, so if the pods are re-scheduled to other nodes will land in the same availability zone as the volume.

1. Storage class

One important thing in the storage class is the following parameter volumeBindingMode: WaitForFirstConsumer with this parameter will delay the binding and provisioning of a persistent volume until a pod is created.

---
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: ebs
provisioner: kubernetes.io/aws-ebs
volumeBindingMode: WaitForFirstConsumer
parameters:
  type: gp2
  fsType: "ext4"

2. Persistent Volume Claim

The PVC that make reference to the storage class and will be included in the deployment.

---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: pvc-nginx
spec:
  storageClassName: ebs
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi

3. Deployment

The deployment has the nodeSelector which defines in which zone will be deployed the pod. The nodeSelector is only for the deployments that use persistent volume.

---
kind: Deployment
apiVersion: apps/v1
metadata:
  name: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      nodeSelector:
        zone: eu-west-1a
      containers:
        - image: nginx:latest
          name: nginx
          volumeMounts:
            - name: vol-nginx
              mountPath: /mnt/
      volumes:
        - name: vol-nginx
          persistentVolumeClaim:
            claimName: pvc-nginx

Solution for multiples replicas

If you want to run multiples replicas in different availability zones and also use persistent volume you can use podAntiAffinity to tell the Kubernetes scheduler to deploy each replica in different nodes in different availability zones.

The following deployment is an Nginx with persistent volume, with an amount of 3 replicas, running in 3 different availability zones. There is a limitation in the following example, the number of replicas need to be the number of availability zones.

---
kind: Deployment
apiVersion: apps/v1
metadata:
  name: nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - nginx
              topologyKey: zone
      containers:
        - image: nginx:latest
          name: nginx
          volumeMounts:
            - name: vol-nginx
              mountPath: /mnt/
      volumes:
        - name: vol-nginx
          persistentVolumeClaim:
            claimName: pvc-nginx

/var/log/diego