Resizing Tanzu Kubernetes Grid Cluster Nodes

Resizing Tanzu Kubernetes Grid Cluster Nodes

December 9, 2020 0 By Eric Shanks

Have you ever missed when trying to properly size an Kubernetes environment? Maybe the requirements changed, maybe there were wrong assumptions, or maybe the project took off and it just needs more resources. Under normal circumstances, I might suggest to you to build a new Tanzu Kubernetes Grid (TKG) cluster and re-deploy your apps. Unfortunately, as much as I want to treat Kubernetes clusters as ephemeral, they can’t always be treated this way. If you need to resize your TKG nodes without re-deploying a new cluster, then keep reading.

Tanzu Kubernetes Grid is built atop of the ClusterAPI project, and as such, we can use the details about how ClusterAPI provisions our clusters to update them.

My favorite ClusterAPI reference diagram to visually understand the components of ClusterAPI can be found on Chip Zoller‘s site at Neon Mirrors. Mr. Zoller visualizes the cluster object dependencies and we will use these to update a running cluster with no down time.

clusterctl workload cluster manifest reference
Image from neonmirrors.net

From the diagram, we can see that there is a MachineDeployment Object. The machine deployment defines how our nodes are configured. The machine deployments reference a KubeadmConfigTemplate which defines how nodes will join a Kubernetes cluster. It also references a MachineTemplate that defines how the nodes are deployed on a cloud. In the diagram from neonmirrors.net you see a “vSphereMachineTemplate” object, but for this example, we’ll use the AWSMachineTemplate object.

Modify Node Settings

Now, that you have some background on the TKG objects, we can try out modifying the configuration. First, lets take a look at the objects we’re working with. To get access to these objects, you’ll want to set your KUBECONFIG context to the management cluster responsible for your workload clusters. Once you set your context you can run:

kubectl get machines

You can see that in my lab I have a six node Kubernetes cluster. I plan to update one of the workload nodes resources. Notice in the name of the nodes there are some with a md in them. The md stands for “machine deployment” and these are our workload nodes. Lets pick md-0 as a node to update.

Lets look for the machine deployment object.

kubectl get machinedeployments

Notice that I have three different machine deployments. Each machine deployment could consist of multiple machines, but we have three in this case because each AWS Availability zone gets their own machine deployments. Lets check out the md-0 machine deployment in further detail.

kubectl get machinedeployment tanzuworkloads1-md-0 -o yaml

The snippet below shows how the machine deployment reference both a Kubeadmconfig and an AWSMachineTemplate.

You can see that our machinedeployment is referencing a tanzuworkloads-md-0 AWSMachineTemplate. Lets go take a look at those.

kubectl get awsmachinetemplates

Notice that we have a template for the control plane, as well as a template for each of our availability zones. NOTE: You can create more templates for special use cases such as high memory nodes, high compute nodes or GPUs, etc.

Lets dive deeper and look at the awsmachinetemplate for md-0.

kubectl get awsmachinetemplates tanzuworkloads-md-0 -o yaml

Ah ha! We now see where the AWS instance size and root volume size are located. Under normal circumstances I’d tell you to just edit this manifest and you’re all set. However these template are supposed to be immutable, so what we’re going to do is copy this template to a file, make our changes and apply it with a new name.

To copy the template to a file, you can use the same command from above out to a file.

kubectl get awsmachinetemplates tanzuworkloads-md-0 -o yaml > myfile.yaml

After your file is written to your workstation, edit the file to make the changes you want to the machines and give it a new name. I’ve also removed the status fields and have posted my full file below. For my environment i changed the instancetype to t3.xlarge and a root volume of 100 GB.

apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
kind: AWSMachineTemplate
metadata:
  generation: 1
  managedFields:
  - apiVersion: infrastructure.cluster.x-k8s.io/v1alpha3
    manager: manager
  name: tanzuworkloads-md-0-new
  namespace: default
  ownerReferences:
  - apiVersion: cluster.x-k8s.io/v1alpha3
    kind: Cluster
    name: tanzuworkloads
    uid: 89f5020a-2cf6-4c0b-be15-1c9d9c9deddb
spec:
  template:
    spec:
      ami:
        id: ami-058759e4c2532dc14
      iamInstanceProfile: nodes.tkg.cloud.vmware.com
      instanceType: t3.xlarge
      rootVolume:
        size: 100
      sshKeyName: vmc-cna-adminCode language: JavaScript (javascript)

After your modifications are made, you can apply the configuration to the cluster. No updates will happen to your workload cluster at this point.

kubectl apply -f myfile.yaml

kubectl get awsmachinetemplates

The last step for us to update our nodes, is to modify our existing machinedeployment object to point at our new awsmachinetemplate.

kubectl edit machinedeployment tanzuworkloads-md-0

use vim to edit the configuration and save the config.

Note: to enter insert mode press i and to exit vim use the command :wq

Once you save the configuration, you should take a look at your AWS instances. With any luck, a new instance is being provisioned with your correct settings.

You can also verify this by running a get on your machine deployments.

kubectl get machinedeployments

If you watch this operation in full, you’ll see that new nodes will be provisioned and joined to your cluster. Once they are healthy, tkg will remove the old node and effectively replaced it. This same process can be used for upgrades.

Summary

Sometimes you need to resize an existing Tanzu Kubernetes Grid cluster without disrupting the containers that are already running. You can do this by making a new MachineTemplate from the existing one, and modifying the machinedeployment to use the new template. Tanzu Kubernetes Grid will then do a rolling update to the nodes in the cluster.