Kubernetes – StatefulSets

Kubernetes – StatefulSets

April 1, 2019 2 By Eric Shanks

We love deployments and replica sets because they make sure that our containers are always in our desired state. If a container fails for some reason, a new one is created to replace it. But what do we do when the deployment order of our containers matters? For that, we look for help from Kubernetes StatefulSets.

StatefulSets – The Theory

StatefulSets work much like a Deployment does. They contain identical container specs but they ensure an order for the deployment. Instead of all the pods being deployed at the same time, StatefulSets deploy the containers in sequential order where the first pod is deployed and ready before the next pod starts. (NOTE: it is possible to deploy pods in parallel if you need them to, but this might confuse your understanding of StatefulSets for now, so ignore that.) Each of these pods has its own identity and is named with a unique ID so that it can be referenced.

OK, now that we know how the deployment of a StatefulSet works, what about the failure of a pod that is part of a StatefulSet? Well, the order is preserved. If we lose pod-2 due to a host failure, the StatefulSet won’t just deploy another pod, it will deploy a new “pod-2” because the identity matters in a StatefulSet.

StatefulSets currently require a “headless” service to manage the identities. This is a service that has the same selectors that you’re used to, but won’t receive a clusterIP address, meaning that you can’t route traffic to the containers through this service object.

StatefulSets – In Action

The example that this blog post will use for a StatefulSet come right from the Kubernetes website for managing a mysql cluster. In this example, we’ll deploy a pair of pods with some mysql containers in them. This example also uses a sidecar container called xtrabackup which is used to aid in mysql replicaiton between mysql instances. So why a StatefulSet for mysql? Well, the order matters of course. Our first pod that gets deployed will contain our mysql master database where reads and writes are completed. The additional containers will contain the mysql replicated data but can only be used for read operations. The diagram below shows the setup of the application we’ll be creating.

The diagram below shows our application (which is also made up of a set of containers and services, but doesn’t matter in this case) and it connects directly to one of the containers. There is a headless service used to maintain the network identity of the pods, and another service that provides read access to the pods. The Pods have a pair of containers (mysql as the main container, and xtrabackup as a sidecar for replication). And we are also creating persistent storage based on the storage class we created in the Cloud Providers and Storage Classes post.

Before we get to deploying anything, we will create a new configmap. This config map has configuration information used by the containers to get an identity at boot time. The mysql configuration data below ensures that the master mysql container becomes master and the other containers are read-only.

apiVersion: v1
kind: ConfigMap
metadata:
  name: mysql
  labels:
    app: mysql
data:
  master.cnf: |
    # Apply this config only on the master.
    [mysqld]
    log-bin
  slave.cnf: |
    # Apply this config only on slaves.
    [mysqld]
    super-read-onlyCode language: PHP (php)

Deploy the manifest above by running:

kubectl apply -f [manifest file].ymlCode language: CSS (css)

Next, we’ll create our mysql services. Here we’ll create the headless service for use with our StatefulSet to manage the identities, as well as a service that handles traffic for mysql reads.

# Headless service for stable DNS entries of StatefulSet members.
apiVersion: v1
kind: Service
metadata:
  name: mysql
  labels:
    app: mysql
spec:
  ports:
  - name: mysql
    port: 3306
  clusterIP: None
  selector:
    app: mysql
---
# Client service for connecting to any MySQL instance for reads.
# For writes, you must instead connect to the master: mysql-0.mysql.
apiVersion: v1
kind: Service
metadata:
  name: mysql-read
  labels:
    app: mysql
spec:
  ports:
  - name: mysql
    port: 3306
  selector:
    app: mysqlCode language: PHP (php)

Deploy the above manifest file by running:

kubectl apply -f [manifest file].ymlCode language: CSS (css)

Once the services are deployed, we can check them by looking at the command:

kubectl get svcCode language: JavaScript (javascript)

Notice that the CLUSTER-IP for the mysql service is “None”. This is our headless service for our StatefulSet.

Lastly, we deploy our StatefulSet. This manifest includes several configs that we haven’t talked about including initcontainers. The init containers spin up prior to the normal pods and take an action. In this case, they reference the pod ID and create a mysql config based on that value to ensure the pods know if they are the master pod, or the secondary pods that are read-only.

You’ll also see that there are scripts written into this manifest file that setup the replication by using the xtrabackup containers.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: mysql
spec:
  selector:
    matchLabels:
      app: mysql
  serviceName: mysql
  replicas: 2
  template:
    metadata:
      labels:
        app: mysql
    spec:
      initContainers:
      - name: init-mysql
        image: mysql:5.7
        command:
        - bash
        - "-c"
        - |
          set -ex
          # Generate mysql server-id from pod ordinal index.
          [[ `hostname` =~ -([0-9]+)$ ]] || exit 1
          ordinal=${BASH_REMATCH[1]}
          echo [mysqld] > /mnt/conf.d/server-id.cnf
          # Add an offset to avoid reserved server-id=0 value.
          echo server-id=$((100 + $ordinal)) >> /mnt/conf.d/server-id.cnf
          # Copy appropriate conf.d files from config-map to emptyDir.
          if [[ $ordinal -eq 0 ]]; then
            cp /mnt/config-map/master.cnf /mnt/conf.d/
          else
            cp /mnt/config-map/slave.cnf /mnt/conf.d/
          fi
        volumeMounts:
        - name: conf
          mountPath: /mnt/conf.d
        - name: config-map
          mountPath: /mnt/config-map
      - name: clone-mysql
        image: gcr.io/google-samples/xtrabackup:1.0
        command:
        - bash
        - "-c"
        - |
          set -ex
          # Skip the clone if data already exists.
          [[ -d /var/lib/mysql/mysql ]] && exit 0
          # Skip the clone on master (ordinal index 0).
          [[ `hostname` =~ -([0-9]+)$ ]] || exit 1
          ordinal=${BASH_REMATCH[1]}
          [[ $ordinal -eq 0 ]] && exit 0
          # Clone data from previous peer.
          ncat --recv-only mysql-$(($ordinal-1)).mysql 3307 | xbstream -x -C /var/lib/mysql
          # Prepare the backup.
          xtrabackup --prepare --target-dir=/var/lib/mysql
        volumeMounts:
        - name: data
          mountPath: /var/lib/mysql
          subPath: mysql
        - name: conf
          mountPath: /etc/mysql/conf.d
      containers:
      - name: mysql
        image: mysql:5.7
        env:
        - name: MYSQL_ALLOW_EMPTY_PASSWORD
          value: "1"
        ports:
        - name: mysql
          containerPort: 3306
        volumeMounts:
        - name: data
          mountPath: /var/lib/mysql
          subPath: mysql
        - name: conf
          mountPath: /etc/mysql/conf.d
        resources:
          requests:
            cpu: 500m
            memory: 1Gi
        livenessProbe:
          exec:
            command: ["mysqladmin", "ping"]
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
        readinessProbe:
          exec:
            # Check we can execute queries over TCP (skip-networking is off).
            command: ["mysql", "-h", "127.0.0.1", "-e", "SELECT 1"]
          initialDelaySeconds: 5
          periodSeconds: 2
          timeoutSeconds: 1
      - name: xtrabackup
        image: gcr.io/google-samples/xtrabackup:1.0
        ports:
        - name: xtrabackup
          containerPort: 3307
        command:
        - bash
        - "-c"
        - |
          set -ex
          cd /var/lib/mysql

          # Determine binlog position of cloned data, if any.
          if [[ -f xtrabackup_slave_info ]]; then
            # XtraBackup already generated a partial "CHANGE MASTER TO" query
            # because we're cloning from an existing slave.
            mv xtrabackup_slave_info change_master_to.sql.in
            # Ignore xtrabackup_binlog_info in this case (it's useless).
            rm -f xtrabackup_binlog_info
          elif [[ -f xtrabackup_binlog_info ]]; then
            # We're cloning directly from master. Parse binlog position.
            [[ `cat xtrabackup_binlog_info` =~ ^(.*?)[[:space:]]+(.*?)$ ]] || exit 1
            rm xtrabackup_binlog_info
            echo "CHANGE MASTER TO MASTER_LOG_FILE='${BASH_REMATCH[1]}',\
                  MASTER_LOG_POS=${BASH_REMATCH[2]}" > change_master_to.sql.in
          fi

          # Check if we need to complete a clone by starting replication.
          if [[ -f change_master_to.sql.in ]]; then
            echo "Waiting for mysqld to be ready (accepting connections)"
            until mysql -h 127.0.0.1 -e "SELECT 1"; do sleep 1; done

            echo "Initializing replication from clone position"
            # In case of container restart, attempt this at-most-once.
            mv change_master_to.sql.in change_master_to.sql.orig
            mysql -h 127.0.0.1 <<EOF
          $(<change_master_to.sql.orig),
            MASTER_HOST='mysql-0.mysql',
            MASTER_USER='root',
            MASTER_PASSWORD='',
            MASTER_CONNECT_RETRY=10;
          START SLAVE;
          EOF
          fi

          # Start a server to send backups when requested by peers.
          exec ncat --listen --keep-open --send-only --max-conns=1 3307 -c \
            "xtrabackup --backup --slave-info --stream=xbstream --host=127.0.0.1 --user=root"
        volumeMounts:
        - name: data
          mountPath: /var/lib/mysql
          subPath: mysql
        - name: conf
          mountPath: /etc/mysql/conf.d
        resources:
          requests:
            cpu: 100m
            memory: 100Mi
      volumes:
      - name: conf
        emptyDir: {}
      - name: config-map
        configMap:
          name: mysql
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 10GiCode language: PHP (php)

Deploy the manifest file above by running:

kubectl apply -f [manifest file].ymlCode language: CSS (css)

Once the containers have been deployed, we can check on them with the kubectl get pods command.

kubectl get podsCode language: JavaScript (javascript)

Here, you can see that we have two mysql pods and they contain 2 containers. Each are running and they are numbered with the [podname]-X naming structure where we have a mysql-0 and a mysql-1. Our application needs to be able to write data, so it will be configured to write to mysql-0. However, if we have other reports that are being reviewed, those read only reports could be build based on the data from the other containers to reduce load on the writable container.

We can deploy our application where it is writing directly to the mysql-0 pod and the application comes up as usual.

Summary

Sometimes, you need to treat your containers more like pets than cattle. In some cases not all containers are the same, like the example in this post where our mysql-0 server is our master mysql instance. When the order of deployment matters, and not all containers are identical, a StatefulSet might be your best bet.