Kubernetes – StatefulSets
April 1, 2019We love deployments and replica sets because they make sure that our containers are always in our desired state. If a container fails for some reason, a new one is created to replace it. But what do we do when the deployment order of our containers matters? For that, we look for help from Kubernetes StatefulSets.
StatefulSets – The Theory
StatefulSets work much like a Deployment does. They contain identical container specs but they ensure an order for the deployment. Instead of all the pods being deployed at the same time, StatefulSets deploy the containers in sequential order where the first pod is deployed and ready before the next pod starts. (NOTE: it is possible to deploy pods in parallel if you need them to, but this might confuse your understanding of StatefulSets for now, so ignore that.) Each of these pods has its own identity and is named with a unique ID so that it can be referenced.
OK, now that we know how the deployment of a StatefulSet works, what about the failure of a pod that is part of a StatefulSet? Well, the order is preserved. If we lose pod-2 due to a host failure, the StatefulSet won’t just deploy another pod, it will deploy a new “pod-2” because the identity matters in a StatefulSet.
StatefulSets currently require a “headless” service to manage the identities. This is a service that has the same selectors that you’re used to, but won’t receive a clusterIP address, meaning that you can’t route traffic to the containers through this service object.
StatefulSets – In Action
The example that this blog post will use for a StatefulSet come right from the Kubernetes website for managing a mysql cluster. In this example, we’ll deploy a pair of pods with some mysql containers in them. This example also uses a sidecar container called xtrabackup which is used to aid in mysql replicaiton between mysql instances. So why a StatefulSet for mysql? Well, the order matters of course. Our first pod that gets deployed will contain our mysql master database where reads and writes are completed. The additional containers will contain the mysql replicated data but can only be used for read operations. The diagram below shows the setup of the application we’ll be creating.
The diagram below shows our application (which is also made up of a set of containers and services, but doesn’t matter in this case) and it connects directly to one of the containers. There is a headless service used to maintain the network identity of the pods, and another service that provides read access to the pods. The Pods have a pair of containers (mysql as the main container, and xtrabackup as a sidecar for replication). And we are also creating persistent storage based on the storage class we created in the Cloud Providers and Storage Classes post.
Before we get to deploying anything, we will create a new configmap. This config map has configuration information used by the containers to get an identity at boot time. The mysql configuration data below ensures that the master mysql container becomes master and the other containers are read-only.
apiVersion: v1
kind: ConfigMap
metadata:
name: mysql
labels:
app: mysql
data:
master.cnf: |
# Apply this config only on the master.
[mysqld]
log-bin
slave.cnf: |
# Apply this config only on slaves.
[mysqld]
super-read-only
Code language: PHP (php)
Deploy the manifest above by running:
kubectl apply -f [manifest file].yml
Code language: CSS (css)
Next, we’ll create our mysql services. Here we’ll create the headless service for use with our StatefulSet to manage the identities, as well as a service that handles traffic for mysql reads.
# Headless service for stable DNS entries of StatefulSet members.
apiVersion: v1
kind: Service
metadata:
name: mysql
labels:
app: mysql
spec:
ports:
- name: mysql
port: 3306
clusterIP: None
selector:
app: mysql
---
# Client service for connecting to any MySQL instance for reads.
# For writes, you must instead connect to the master: mysql-0.mysql.
apiVersion: v1
kind: Service
metadata:
name: mysql-read
labels:
app: mysql
spec:
ports:
- name: mysql
port: 3306
selector:
app: mysql
Code language: PHP (php)
Deploy the above manifest file by running:
kubectl apply -f [manifest file].yml
Code language: CSS (css)
Once the services are deployed, we can check them by looking at the command:
kubectl get svc
Code language: JavaScript (javascript)
Notice that the CLUSTER-IP for the mysql service is “None”. This is our headless service for our StatefulSet.
Lastly, we deploy our StatefulSet. This manifest includes several configs that we haven’t talked about including initcontainers. The init containers spin up prior to the normal pods and take an action. In this case, they reference the pod ID and create a mysql config based on that value to ensure the pods know if they are the master pod, or the secondary pods that are read-only.
You’ll also see that there are scripts written into this manifest file that setup the replication by using the xtrabackup containers.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: mysql
spec:
selector:
matchLabels:
app: mysql
serviceName: mysql
replicas: 2
template:
metadata:
labels:
app: mysql
spec:
initContainers:
- name: init-mysql
image: mysql:5.7
command:
- bash
- "-c"
- |
set -ex
# Generate mysql server-id from pod ordinal index.
[[ `hostname` =~ -([0-9]+)$ ]] || exit 1
ordinal=${BASH_REMATCH[1]}
echo [mysqld] > /mnt/conf.d/server-id.cnf
# Add an offset to avoid reserved server-id=0 value.
echo server-id=$((100 + $ordinal)) >> /mnt/conf.d/server-id.cnf
# Copy appropriate conf.d files from config-map to emptyDir.
if [[ $ordinal -eq 0 ]]; then
cp /mnt/config-map/master.cnf /mnt/conf.d/
else
cp /mnt/config-map/slave.cnf /mnt/conf.d/
fi
volumeMounts:
- name: conf
mountPath: /mnt/conf.d
- name: config-map
mountPath: /mnt/config-map
- name: clone-mysql
image: gcr.io/google-samples/xtrabackup:1.0
command:
- bash
- "-c"
- |
set -ex
# Skip the clone if data already exists.
[[ -d /var/lib/mysql/mysql ]] && exit 0
# Skip the clone on master (ordinal index 0).
[[ `hostname` =~ -([0-9]+)$ ]] || exit 1
ordinal=${BASH_REMATCH[1]}
[[ $ordinal -eq 0 ]] && exit 0
# Clone data from previous peer.
ncat --recv-only mysql-$(($ordinal-1)).mysql 3307 | xbstream -x -C /var/lib/mysql
# Prepare the backup.
xtrabackup --prepare --target-dir=/var/lib/mysql
volumeMounts:
- name: data
mountPath: /var/lib/mysql
subPath: mysql
- name: conf
mountPath: /etc/mysql/conf.d
containers:
- name: mysql
image: mysql:5.7
env:
- name: MYSQL_ALLOW_EMPTY_PASSWORD
value: "1"
ports:
- name: mysql
containerPort: 3306
volumeMounts:
- name: data
mountPath: /var/lib/mysql
subPath: mysql
- name: conf
mountPath: /etc/mysql/conf.d
resources:
requests:
cpu: 500m
memory: 1Gi
livenessProbe:
exec:
command: ["mysqladmin", "ping"]
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
readinessProbe:
exec:
# Check we can execute queries over TCP (skip-networking is off).
command: ["mysql", "-h", "127.0.0.1", "-e", "SELECT 1"]
initialDelaySeconds: 5
periodSeconds: 2
timeoutSeconds: 1
- name: xtrabackup
image: gcr.io/google-samples/xtrabackup:1.0
ports:
- name: xtrabackup
containerPort: 3307
command:
- bash
- "-c"
- |
set -ex
cd /var/lib/mysql
# Determine binlog position of cloned data, if any.
if [[ -f xtrabackup_slave_info ]]; then
# XtraBackup already generated a partial "CHANGE MASTER TO" query
# because we're cloning from an existing slave.
mv xtrabackup_slave_info change_master_to.sql.in
# Ignore xtrabackup_binlog_info in this case (it's useless).
rm -f xtrabackup_binlog_info
elif [[ -f xtrabackup_binlog_info ]]; then
# We're cloning directly from master. Parse binlog position.
[[ `cat xtrabackup_binlog_info` =~ ^(.*?)[[:space:]]+(.*?)$ ]] || exit 1
rm xtrabackup_binlog_info
echo "CHANGE MASTER TO MASTER_LOG_FILE='${BASH_REMATCH[1]}',\
MASTER_LOG_POS=${BASH_REMATCH[2]}" > change_master_to.sql.in
fi
# Check if we need to complete a clone by starting replication.
if [[ -f change_master_to.sql.in ]]; then
echo "Waiting for mysqld to be ready (accepting connections)"
until mysql -h 127.0.0.1 -e "SELECT 1"; do sleep 1; done
echo "Initializing replication from clone position"
# In case of container restart, attempt this at-most-once.
mv change_master_to.sql.in change_master_to.sql.orig
mysql -h 127.0.0.1 <<EOF
$(<change_master_to.sql.orig),
MASTER_HOST='mysql-0.mysql',
MASTER_USER='root',
MASTER_PASSWORD='',
MASTER_CONNECT_RETRY=10;
START SLAVE;
EOF
fi
# Start a server to send backups when requested by peers.
exec ncat --listen --keep-open --send-only --max-conns=1 3307 -c \
"xtrabackup --backup --slave-info --stream=xbstream --host=127.0.0.1 --user=root"
volumeMounts:
- name: data
mountPath: /var/lib/mysql
subPath: mysql
- name: conf
mountPath: /etc/mysql/conf.d
resources:
requests:
cpu: 100m
memory: 100Mi
volumes:
- name: conf
emptyDir: {}
- name: config-map
configMap:
name: mysql
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
Code language: PHP (php)
Deploy the manifest file above by running:
kubectl apply -f [manifest file].yml
Code language: CSS (css)
Once the containers have been deployed, we can check on them with the kubectl get pods command.
kubectl get pods
Code language: JavaScript (javascript)
Here, you can see that we have two mysql pods and they contain 2 containers. Each are running and they are numbered with the [podname]-X naming structure where we have a mysql-0 and a mysql-1. Our application needs to be able to write data, so it will be configured to write to mysql-0. However, if we have other reports that are being reviewed, those read only reports could be build based on the data from the other containers to reduce load on the writable container.
We can deploy our application where it is writing directly to the mysql-0 pod and the application comes up as usual.
Summary
Sometimes, you need to treat your containers more like pets than cattle. In some cases not all containers are the same, like the example in this post where our mysql-0 server is our master mysql instance. When the order of deployment matters, and not all containers are identical, a StatefulSet might be your best bet.
Can you go into a bit more detail as to why the ‘headless’ service is necessary to magnage identities? What’s the purpose of this? Is it required to make this work?
[…] 15. Stateful Sets […]