Tanzu (TKGI) Control Plane Sizing Notes

Mindwatering Incorporated

Author: Tripp W Black

Created: 10/21 at 02:50 PM

Category:
VMWare
Tanzu

Overview:
- Sizing is performed using Configuration Plans in TKGI.
- Plan 1 must be activated and configured, before activating plans 2 through 10. Plans 11-13 support Windows-based K8s clusters on vSphere with NSX, with Antrea (K8s network egress), and are a BETA feature only.
- Control plane node VM is sized to the number of workers notes it manages in the the cluster.
- Master/ETCD node instances default selections are 1, 3, or 5 nodes.

WARNINGS:
- To change the number of control plan/etcd nodes for a plan, no clusters can currently be using the plan. TKGI does not support changing control plane nodes for existing clusters.
- Confirm sufficient hardware and network bandwidth for the increased disk write load and network traffic.
- TKGI does not support changing availability zones (AZs) of existing control plane nodes after set-up.

Calculating the Number of Workers Nodes:
Gather the following statistics from your applications being developed:
- Maximum number of pods you expect to run for each application [p]
- Memory requirements per pod [m]
- CPU requirements per pod [c]
- Add +1 to worker nodes to handle pod/deployment rollovers and upgrades
- For HA add 10-25% for HA redundancy

Split the applications by cluster, and calculate by each cluster:
- pTotal = Add all the [p] entries together for all the application deployments running in "this" cluster
- mTotal = Add all the [m] entries together for all the application deployments running in "this" cluster
- cTotal = Add all the [c] entries together for all the application deployments running in "this" cluster

For each cluster, VMware requirement calculations have an estimation of 100 apps/worker node:
Minimum number of worker nodes w/+1 and a 25% HA allowance: W = (pTotal/100) + 1 + (pTotal/100/4)
Minimum RAM per worker = WR = mTotal * 100
Minimum CPU per worker = WC = cTotal * 100

One calculation method would be to add up all the apps that are going to be deployed in the cluster like so:
Distributive App1:
p: 100
m: 4 GB
c: 0.5 CPU

Distributive App2:
p: 20
m: 1 GB
c: 0.1 CPU

Hefty App 3:
p: 2
m: 16 GB
c: 2.5

Watch for estimator limits that are out-of-bounds for the maximum size of a VM on a single physical host The estimator has limits if the containers have high CPU or memory. It is very common that an on-prem ESXi host may have only 256 logical cores and 16 TB of RAM itself. With the examples above, we would not likely get to 100 pods/worker, because the CPU calculation is a bit over 4 times the ESXi host (and there is overhead using some CPU cores), the CPU needs to be split over 4 or 5 workers.

Another option is to use the calculator might be to calculate App1, App2 and App3 independently:
App 1:
100 pods * 4 GB RAM = 400 GB RAM
100 pods * 0.5 GB CPU = 50 CPUs (cores)
W = (100/100) + 0 rollover worker allowance + (100/100/4) = 1.25 workers including HA allowance
WR = 4 GB x 100 = 400 GB, actual: 320 GB/worker node
WC = 0.5 CPU x 100 = 50 cores, actual: 40 cores/worker node

App2:
20 pods * 1 GB RAM = 20 GB RAM
20 pods * 0.1 CPU = 2 CPU (cores)
W = (20/100) + 0 + (20/100/4) = 0.25 workers including HA allowance (1 worker required)
WR = 1 * 100 = 100 GB, actual: 20 GB/worker node
WC = 0.1 * 100 = 10 cores, actual 2 cores/worker node

App3:
2 pods * 16 = 32 GB RAM
2 pods * 2.5 CPU = 5 CPU (cores)
W = (2/100) + 0 + (2/100/4) = .025 workers including HA allowance (1 worker required)
WR = 16 * 100 = 1600 GB, actual: 32 GB/worker node
WC = 2.5 * 100 = 250 cores, actual 5 cores/worker node

Reality checks.
- Ensure the actual memory and cpu/worker is actually less than the size of the worker K8's worker nodes (VMs).
- The sizing is done by selecting plans in the TKGI edition tile.

For example, lets say we have a QA/UA testing environment:
Tile: small (not for production workloads)
K8s control plane/etc VM size: medium.disk (cpu: 2, ram: 4 GB, disk 32 GB, with Master Persistent Disk Type: Automatic: 10 GB)
Cluster Workers: medium.disk (cpu: 2, ram: 4 GB, disk 32 GB, with Master Persistent Disk Type: Automatic: 50 GB), with 50 max worker nodes, and 3 minimum
Number of worker nodes: 50

App1 reality check:
- 320 GB memory/worker, the workers have 1/10 of that capacity. So instead of needing 1.25 workers, it may require at least 12.5 workers to fully scale.
- 50 cores/worker, the workers have 2, which is 25x that a worker has available
- So instead of needing 1.25 workers, this app may require (50/2 cores/worker) 25 workers for this app to fully scale

App2 reality check:
- memory usage is smaller than one worker
- cores usage is exactly the same at maximum scaling
- So instead of 0.25 workers, exactly 1 worker is required.

App3 reality check:
- memory is exactly same as the maximum worker availability, but each pod is 16 GB, which is half of the total memory, with memory over-utilization, only one worker is required, but most likely the app will be split on two workers.
- 5 cores/worker, the workers have only 2, which is 2.5x what a worker node has available.
- So instead of needing 0.25 workers, 2.5 workers are required.
- - BUT PERFORMANCE WILL LIKELY SUFFER. Each worker has only 2 cores, so it can only give 2 cores, not 2.5 cores. The workers will have either to configured to run overcommitted, or the pod will not start.

So for this testing environment to have all 3 apps fully scaled, we need 25 + 1 + 2.5 = 29 workers. The testing environment has 50.
The most any pod needs is 16 GB of RAM, and With 50 worker nodes, this testing environment can perform

Worker Node Sizing Chart for Control Plane:

Number Worker Nodes	Control Plane # CPU	Control Plane RAM (GB)
1-5	1	3.75
6-10	2	7.5
11-100	4	15
101-250	8	30
251-500	16	60
500+	32	120

Tanzu Kubernetes Grid Integration Edition tile -->

previous page