2
The Ultimate Guide To Kubernetes Capacity Management
Overview You might already be feeling the pains of managing Kubernetes based infrastructure. While automation can help you fix a few issues with your infrastructure and applications, Capacity Management is an area that requires team wide coordination, and the proper processes in place to do it correctly
Capacity Management Touches Different Critical Areas:
User Experience: You don’t want your users to suffer from performance or stability issues because you didn’t provision the needed resources or capacity inside your Kubernetes cluster.
Team Productivity: Engineers and teams think that providing enough machines will fix the issue. However, more is required for Kubernetes. Teams need to be aware of how many resources should be allocated to each pod, where pods should be scheduled, and the best VM family to accommodate different workloads. Ignoring these concerns causes teams to be in constant reaction mode, and to receive many late night alerts!
The Ultimate Guide To Kubernetes Capacity Management
3
Business Viability And Efficiency: Poor capacity management leads to paying too much to cloud providers to run your applications. This causes your business to suffer from weak ROIs, and continuous pressure from leadership to improve cost. For some price sensitive businesses, this could put them out of business. This ebook provides a few steps that can help you and your team avoid a lot of the confusion around capacity management. You will be able to bring your team up to speed quickly, and realize significant savings without jeopardizing your end users’ experience.
What Is Capacity Management? Capacity management is mostly science, but requires a bit of human intelligence as well. Put simply, capacity management, in the context of cloud-based infrastructure, is finding the balance between application performance, resource planning, and cost savings. Your application’s code is evolving resulting in workloads that are constantly changing. Cloud infrastructure is also advancing at a rapid pace, making capacity management a continuous process. Finding that balance depends on each organization’s velocity and sensitivity to these factors.
Application performance is impacted by your code, workloads, and your system configurations, assuming that your application can get all the CPU, memory, & I/O resources it needs. Setting performance goals and being able to articulate them in a simple and quantitative way will help you to share the same goals across the organization. You need to set performance goals given the changes in workloads you have.
Resource planning is an area that many ignore. Resource planning is about finding the best types of the underlying infrastructure that support your application’s needs. For example, if your pod or microservice is memory intensive, it should run on memoryoptimized instances. The process of resource planning can become complex very quickly due to the dynamics that you need to consider while allocating your resources.
Cost Savings is about finding the best billing model for your short and long termplans. This is one of the most commonly used tactics. We see many companies jump into
4
The Ultimate Guide To Kubernetes Capacity Management
buying reserved capacity to save somewhere between 30% and 50% of their capacity. While this step saves significantly, you could be throwing away a significant amount of money if you perform this step without considering your applications’ needs, and without proper resources planning. We will show you, step by step, how to keep all three areas optimized. We come across many teams that have an imbalance in one or more of these areas causing poor capacity management. We’ve seen customers that have great performance, but very low utilization, and eventually spend too much - see figure 1 below. These customers usually get a surprising monthly bill for their cloud service. The next group of customers usually use the wrong tactics to optimize their infrastructure, and suffer from bad application performance - see figure 2 below. These kinds of customers face continuous alerts and escalations due to customer complaints typically for changing available capacity without understanding the impact on the user experience. The last group is usually the least common. They typically have great utilization with good user experience, but they are running with the wrong billing model - see figure 3 below. This group can usually fix the cost-saving issue pretty quickly if they use some resource reservation.
Figure 1
Figure 2
Figure 3
5
The Ultimate Guide To Kubernetes Capacity Management
Again, balance in these three areas is key. This is an ongoing process that you and your team should collaborate on regularly, especially if you feel that balance is jeopardized. We will discuss how you can achieve this in a more structured way.
Figure 4
Why Is Kubernetes Capacity Management A Challenging Task? First, cloud infrastructure is supposed to give you dynamic capacity. Capacity management is a continuous process that must be done habitually to achieve your operational and business goals. You want to set reasonable goals and maintain them. Trying to be efficient by building highly sensitive measures, and capacity optimizations that are too frequent can cause a lot of churn, and makes your team highly distracted. Managing capacity sporadically, and with very large buffers, causes you to spend too much money, and is driven by the panic that either you or your team experience when you see the invoice from your cloud provider. Second, when engineers try to isolate pod/container level resources optimization from infrastructure level capacity management, this can result in team member friction. Each role has its own motivations to get the job done. Software engineers may prefer to be on the safe side by provisioning all needed requests and limits, while DevOps or SREs try to limit these values to make infrastructure more efficient. Additionally, developers are motivated to ship features quickly, and have no time to analyze needed resources or to improve the efficiency of their code. Some mutual understanding of application workloads and infrastructure capabilities is needed in order to avoid this challenge. Let’s dig deeper into the motivations of each role.
6
The Ultimate Guide To Kubernetes Capacity Management
Capacity Management Framework
Step 1 - Understand How Everything Is Connected With Kubernetes Capacity The experience of your users/customers should be your main focus throughout this process. The responsiveness and performance of your pods and containers should be at the forefront of your planning process. The key to your success is making sure that your users are not negatively impacted when you start tweaking your capacity, so you should have the right metrics and observability that can tell you the correlation between different workloads and resource consumption. • Identify your pod/container’s SLO (Service Level Objectives). • Identify the impact of workload on pod/container KPI(s).
The Ultimate Guide To Kubernetes Capacity Management
7
• Identify the impact of KPIs on resources usage. A. For example, if your workloads go up at certain times of the day, understand how these workloads will impact each pod’s use of resources; does it use more CPU, memory, or I/O? What’s the ratio of between the workload increase and the resources change? For example, does a 2X increase in workload, means a 2X increase in CPU utilization?
Step 2 - Set The Right Dashboards And Service Level Indicators (SLIs) If you are a DevOps/SRE or a technical lead working closely with code and infrastructure, your goal is to connect how workloads impact resource usage, and eventually capacity that you will need for your infrastructure. Make sure you have the following dashboards: • A dashboard that shows pods, i.e. API calls per minute, responses, and resource usage, i.e. CPU, memory, network, and disk. • Aggregate resource usage across all your pods. This dashboard should show how much CPU, memory, and network are used, and compare them to available capacity. • Infrastructure resources utilization, which shows the percentage of memory, CPU, and memory allocation per node. You want to learn if you have some overallocated or under-allocated machines. • In case of having large clusters, you can use heatmaps to get a proper visual of all your clusters.
8
The Ultimate Guide To Kubernetes Capacity Management
Step 3 - Adjust Your Resources The previous step will show you if your pods are either over or under-provisioning resources. In this step, you should start thinking of adjusting each pod independently, and not about the available capacity. It might be tempting to start working on shutting down VMs or to adjust what you have. Just focus on allocating the right resources to your pods. You should take following steps: 1. Make sure you have all requests and limits adjusted to the reasonable `request` and “limit� values. A. Consider using the right statistical units. B. Consider also the buffer that you would like to include in this case.
9
The Ultimate Guide To Kubernetes Capacity Management
2. Monitor the performance of your pods and see if KPIs are negatively impacted by those changes. A. You may want to observe, and leave your pods long enough to go through different workload patterns if those changes take place frequently enough, i.e., on a daily basis. 3. If KPIs are impacted, adjust your allocated resources until you get the right values.
Requests and limits adjusted
1
3
er
th
ep
M
on
it fo or rm an c
e
d ate lloc ra you rces st ou res
ju Ad
2
Warning Jumping into VMs optimization without finishing this step, may disrupt your application’s stability and will cause you to take unnecessary risks with your capacity. At the end of this step, you should be confident of the number of resources that each pod needs. You should now be able to set the right types of VMs.
10
The Ultimate Guide To Kubernetes Capacity Management
Step 4 - Adjust Allocatable Capacity Once you are done with the previous step, you should start thinking about the nature of workloads you are running on your cluster to identify the node pools that you need for your cluster: 1. Categorize Your Workloads - If you are running memory-intensive applications, you will need to start thinking of running a pool of nodes that are memoryoptimized. The same for CPU and I/O. A. Make sure you understand if your application is CPU, memory, or I/O intensive to select the right VM type. 2. Map Workloads To Proper Node Pools - While it is easier to have a single pool of instances and allocate pods equally to this single pool, you will get a much greater value if you diversify your pool, and allocate the appropriate workloads to each pool. 3. Create Nodes Pools - Design the right pools and map your pods to these pools. 4. Instruct Kubernetes Scheduler To Allocate Pods To The Right Pools - You are now ready to gradually create those pools, and instruct Kubernetes scheduler to allocate the right pods to the right nodes pool.
Step 5 - Optimize Your Bill Finishing the previous step can potentially save you a lot of money. However, in many cases, you still can squeeze more value out of your infrastructure. At this point, you should start thinking about the best ways to reduce the amount you pay at the end of each month by considering the best billing model for your organization. If you are confident about the sustainability of the workloads you identified in previous steps, you should start thinking of capacity reservations in order to save between 35% and 50% on your computer bill. You can save even more if you consider creating a special pool to run pods on spot or preemptive VMs.
The Ultimate Guide To Kubernetes Capacity Management
11
Kubernetes Capacity Management Tips & Tricks • Make sure you use the right statistical measure for each metric type. 1. Use 95 percentile for CPUs. 2. Use the maximum for memory. 3. Use 90% for I/O. • Understand how the Linux CPU schedule can impact the allocation of your resources. 1. Don’t forget to measure CPU throttling. • Review your resource budgeting once or twice before optimizing your next bill.
Avoid These Common Capacity Management Mistakes: Fire and Forget: Capacity management is a continuous effort and impacted by a lot of moving parts. You should be alert and stay ahead of the curve.
It Can Be Done By A Single Person Or SREs Only: This is a team sport. Software engineers and infrastructure folks should work closely together
Becoming Too Efficient: You need to find the balance between being efficient and the cognitive overhead it may have on you and your team. Don’t optimize to a very granular level that requires you to adjust capacity too frequently. On the other hand, don’t optimize it with a lot of idle capacity buffers. You will end up wasting a lot of money and capacity.
12
The Ultimate Guide To Kubernetes Capacity Management
Optimize Your Bill Only: Many teams automatically jump onto VMs’ reservations to save money without really looking into the allocation of the right resources to their pods and containers. It may give you initial relief, however, you will still end up paying too much for an extended period of time. Focus on pod-level optimization as we described earlier in the document, and then work on the capacity reservation to save money more confidently.
Ignore System Overheads: Many teams cause themselves incidents because they plan based on pure containers capacity. Make sure you consider other operating system overheads when you plan for your capacity.
Missing The Big Picture: Capacity management is a budgeting exercise. Everyone needs to know their pods budget and make sure they stick to it whenever it makes sense. A simple Google sheet can help you plan the best/worst-case resources needs. Aggregate those needs, and plan your nodes accordingly. The problem can become a complex optimization problem. However, you don’t need to be super-efficient. Start somewhere and ask your team to perform that simple budgeting exercise.
Using Generic Huge Virtual Machines: You should make sure that you use the right type of VMs to get the best value for the CPU and memory you get for your team.
Magalix Can Help! We can help you move faster and more confidently with your Kubernetes capacity management. At Magalix we help teams to organize and implement the best practices in Kubernetes and cloud capacity management. Our product can help you with a single command line to get some basic capacity management. In addition to our product’s reports and automation, we help our customers as well with customized on-boarding, and capacity management plans.
13
The Ultimate Guide To Kubernetes Capacity Management
Register today and get a 14-day trial to manage your cluster’s capacity. Get Started
Contact us for a demo, or to discuss your specific capacity management needs. Request Demo
Thank You This book is brought to you by the team at Magalix. Magalix accelerates your cloud-native journey. Go to production with cloud-native and Kubernetes in a snap, with Magalix.
team@magalix.com Magalix.com
MagalixCorp