- 22 oct. 2023
- 7 min de lecture

Eliminating Cloud Waste: A Comprehensive Guide for FinOps and Cloud Professionals

Companies increasingly rely on cloud computing to manage their IT operations. The advertised ‘‘infinite" capacity of cloud computing, with its pay-as-you-go pricing, allows developers and architects to over-provision cloud resources. Unfortunately, with greater access comes greater responsibility: minimizing instances of wastage that come with employing such resources without proper planning. Cloud waste refers specifically to inefficient utilization leading up to overspending on these resources and environmental harm resulting from excessive energy consumption.

Through this article, we attempt not only to understand if your company is losing money through suboptimal resource management but also to guide you on how you can adopt optimal strategies for better cost-effectiveness alongside environmentally friendly solutions.

Understanding Cloud Waste

What is Cloud Waste?

When discussing the problem of cloud waste there are three primary categories worth noting:

Unused resources,
Overprovisioning,
and inefficient architectures.

Unused resources refer to any provisioned but dormant virtual machines or unattached storage volumes lying idle within the system. Overprovisioning describes a situation where too much capacity has been allocated, leading to excessive costs without adequate utilization. Finally, poorly designed cloud infrastructures can result in poor performance outcomes and increase resource consumption unnecessarily.

How big is the impact, and where is it felt?

Cloud waste can have serious implications across multiple areas of business operations.

Financial implications: unnecessary expenses on unused or underutilized resources may strain budgets and divert funds from more critical needs. It’s commonly estimated that about 30% of cloud spend is wasted, representing $147B wasted every year.
Environmental impact: The cloud has a greater carbon footprint than the airline industry, a single Data Center can consume the equivalent electricity of 50,000 homes. Additionally cloud waste contributes to higher energy consumption and an increased carbon footprint for data centers. According to Worlddata.info, a 30% decrease in data centre power consumption is enough to power the entire U.S. for almost 40 years.
Productivity and UX impact: Furthermore, inefficient cloud architectures can negatively impact employee productivity, leaving cloud projects coming in at an average 13% over budget. User experience can also be impacted by slow response times and decreased application performance.

Identifying Cloud waste

To identify cloud waste, it's crucial to leverage the right monitoring tools, starting with the Cloud Service Provider tools, that provide a view of resource usage and provide recommendations. CSP tools can be complemented by Third-Party solutions in Multi-Cloud and Hybrid scenarios, or when specific capabilities are needed, such as limiting Kubernetes clusters overprovisioning.

Cloud provider-native tools

Most cloud providers offer built-in monitoring and recommendation services to help manage resources and optimize costs:

AWS offers the AWS Trusted Advisor, which will analyse resource utilization and recommend actions such as stopping or terminating instances with low utilisation

Azure offers Azure Advisor: In the same spirit as the AWS Trusted Advisor, Azure Advisor screens resource utilization and recommends stop, terminate or right-size Virtual Machines

Third-party tools

Several third-party solutions offer advanced features for managing cloud resources and identifying waste in multi-cloud or hybrid setups. A notable solution is provided by Cast.ai.

Cast.ai is an all-in-one platform for Kubernetes automation, optimization, security ad cost management. It monitors the Kubernetes clusters and applies changes in real-time, removing unused capacity via bin packing and downscaling, with better performance than native solutions on AWS EKS.

Analyzing Usage Metrics: pick the right metrics and tooling

Tracking the right key usage metrics can help detect waste:

CPU utilization

Monitor the CPU usage of your instances to identify underutilized resources. You can use native CSP tooling:

AWS CloudWatch: Monitor CPU usage of EC2 instances and set custom alarms to notify you of underutilized resources.
Azure Monitor: Collect and analyze CPU performance data for Azure virtual machines and receive alerts when thresholds are breached.

or Third-party tools like Datadog or New Relic, which are perfect to gain insight into CPU usage across multiple cloud providers and set custom alerts for underutilized instances.

Memory Usage

Low memory consumption can indicate a need for optimization or right-sizing. You can monitor memory usage with CSP native tools:

AWS CloudWatch: Collect memory usage data from EC2 instances by installing the CloudWatch agent and configuring custom metrics.

Pro Tip: AWS CloudWatch doesn't capture by default an AWS EC2 instance memory utilisation because the necessary metric cannot be implemented at the hypervisor level. In order to report the memory utilisation using CloudWatch you need to install an agent (script) on the instance and create a custom metric (let's name it EC2MemoryUtilization) on the AWS CloudWatch dashboard. The instructions required for installing the monitoring agent depend on the Operating System used by the instance. Please refer to this URL for more details.

Azure Monitor: Access memory performance data for Azure virtual machines using the built-in guest OS diagnostics.

Third-party solutions are useful in Multi-Cloud or Hybrid set-ups:

Datadog: Monitor memory usage across various cloud platforms and set custom alerts for overconsumption or leaks.
New Relic: Analyze memory usage patterns, identify potential bottlenecks, and receive alerts when memory consumption exceeds predefined thresholds.

Network throughput

Analyze network traffic patterns to identify bottlenecks and optimize bandwidth allocation, with Cloud provider-native tools:

AWS VPC Flow Logs: Capture information about network traffic in your VPC, allowing you to analyze traffic patterns and detect bottlenecks.
Azure Network Watcher: Monitor network performance metrics like throughput, packet drops, and latency for Azure resources.

or Third-party tools:

SolarWinds Network Performance Monitor: Gain visibility into network performance across multiple cloud providers and receive alerts for potential issues.
ThousandEyes: Monitor network performance, visualize end-to-end network paths, and detect bottlenecks or traffic anomalies.

Storage capacity

Review storage usage to detect unused or underused volumes that can be deleted or resized.

AWS Storage Lens can help you get insights into your S3 storage usage, analyze trends, and identify potential cost savings by optimizing storage.
Azure Storage Metrics: Monitor storage account capacity and usage patterns, and receive alerts when capacity thresholds are exceeded.

Third-party tools like CloudHealth and NetApp Cloud Insights deliver their value in Multi-cloud or Hybrid set-ups

Strategies for Eliminating Cloud Waste

Effectively eliminating cloud waste requires a multi-faceted approach that combines several strategies, tools, and best practices to achieve optimal resource utilization and cost efficiency.

Those strategies include:

Right-sizing instances: it is a vital first step towards eliminating cloud waste.
Implementing auto-scaling allows for dynamic resource adjustment based on demand.
Cloud Storage optimization and improving network efficiency can bring substantial savings on data transfer costs.
And modern architectural patterns such as serverless computing and containerization can minimize resource consumption.

Right-Sizing Instances

Ensure that your instances are appropriately sized for your workloads:

Matching workload requirements: Understand the performance requirements of your applications and choose instance types accordingly.

Pro tip: this cheat sheet for AWS instance family selection is handy when choosing an instance family :

Evaluating performance metrics: Regularly review CPU, memory, and network usage to identify underutilized instances that can be downscaled.

Implementing Auto-Scaling

Auto-scaling can help you dynamically adjust your resources based on demand:

Configuring auto-scaling groups: Set up auto-scaling groups to manage instances that share similar scaling requirements, and Define policies based on factors like CPU utilization, network throughput, or custom metrics to trigger scaling events.
Monitoring scaling events: Keep track of scaling activities to evaluate the effectiveness of your auto-scaling configurations.

Pro Tip: In AWS, create Cloudwatch alarms, one for scaling out (metric high), and one for scaling in (metric down). When the threshold of an alarm is breached, for example, because the traffic has increased, the alarm goes into the ALARM state. This state change invokes the scaling policy associated with the alarm. The policy then instructs Amazon EC2 Auto Scaling how to respond to the alarm breach, such as by adding or removing a specified number of instances.

Scheduling downscaling and upscaling: Schedule downscaling during periods of low demand and upscaling during periods of high demand to ensure efficient resource allocation.

Pro Tip: A third-party solution like MemVerge can improve the scalability by allowing to scale up or down during the runtime of the job. MemVerge Memory Engine Cloud Edition monitors in real-

time the CPU and memory usage of your VMs, and will scale up or down, without interrupting the workload. By leveraging Spot, they offer 30% better performance for 90% cheaper, this is far superior to what native scaling solutions can offer, in particular for High-Performance Computing and High-Throughput Computing workloads Try it for free for 60 days.

Optimizing Storage

Improve your cloud storage efficiency by

Identifying unused or underused storage: Detect and delete unattached volumes or resize overprovisioned volumes to save costs.
Implementing data lifecycle policies: Automate data management by defining rules for data retention, archival, and deletion based on age, access patterns, or other criteria.
Compressing and archiving data: Compress data to reduce storage requirements and use archival storage classes for infrequently accessed data to lower costs.

Pro Tip: Use the ‘Intelligent Tiering’ option when managing AWS S3 buckets, it will make the optimal choice of Tiering for your objects, in a consistent manner, saving you money and time you can spend on designing and building new features and services instead of archiving objects.

Enhancing Network Efficiency

Optimize your network infrastructure to reduce waste and watch out for Data Transfer costs:

Load balancing: Distribute traffic across multiple instances to ensure optimal resource utilization and improve application performance.
Content delivery networks (CDNs): Leverage CDNs to cache content closer to users, reducing data transfer costs and improving latency. AWS Cloudfront is a managed service with tiering for data transfer to optimize costs.
Traffic management: Implement traffic routing policies to direct users to the nearest or most optimal instances, reducing latency and network costs.

Pro Tip : for data transfer costs optimization, in addition to the techniques mentioned above to improve network efficiency, there are strategies you can employ to optimize data transfer costs when using AWS in particular. Those strategies include leveraging VPC peering and minimizing NAT Gateway costs. OptimNow will elaborate on those techniques in a future post.

Adopting Serverless and Containerization

Utilize modern architectural patterns to minimize resource consumption:

Serverless computing: Adopt serverless platforms (e.g., AWS Lambda, Azure Functions) to execute code on-demand, only paying for the actual compute time used.
Container orchestration: Use container orchestration platforms (e.g., Kubernetes) and CSP managed services (ECS, EKS for AWS) to manage containerized applications, improving resource utilization and scalability.

Pro Tip: Karpenter is an open-source autoscaling solution for Kubernetes that brings advancements in node management. It simplifies and automates node management by directly communicating with the AWS EC2 Fleet API, eliminating the need for node group abstractions. Compared to existing solutions like Cluster Autoscaler, Karpenter offers intelligent scaling, cluster awareness, customizable configurations, and seamless integration with existing workflows.

The value Karpenter brings includes optimal resource utilization, customizable scheduling configurations, cost savings, fine-grained control over downscaling, AWS integration, and built-in spot capabilities. These features allow Karpenter to efficiently scale workloads and reduce costs by optimizing resource allocation based on application requirements.

However, Karpenter has some limitations, such as the inability to optimize spend based on existing commitments (e.g., savings plans or reserved instances), failure to reconsider spot prices, complexity in configuration, and a short notice period for spot terminations. These challenges may require users to have significant technical knowledge and expertise to configure and manage Karpenter effectively.

Conclusion

Eliminating cloud waste is a crucial responsibility for cloud professionals to ensure cost-effectiveness, environmental sustainability, and optimal user experience. By understanding the primary categories of cloud waste and leveraging appropriate monitoring tools, companies can identify inefficiencies and take decisive action. Strategies such as right-sizing instances, implementing auto-scaling, optimizing storage and network efficiency, and adopting serverless computing and containerization help minimize resource consumption and improve overall cloud performance. With a proactive approach to reducing cloud waste, organizations can not only save billions of dollars annually but also contribute to a greener, more sustainable future.

Sources:

https://techmonitor.ai/technology/cloud/cloud-spending-wasted-oracle-computing-aws-azure

https://siliconangle.com/2022/10/31/gartner-spending-public-cloud-services-will-exceed-591b-2023/#:~:text=Gartner%3A%20Spending%20on%20public%20cloud%20services%20will%20exceed%20%24591B%20in%202023,-by%20Maria%20Deutscher

https://thereader.mitpress.mit.edu/the-staggering-ecological-impacts-of-computation-and-the-cloud/#:~:text=The%20Cloud%20now%20has%20a,as%20much%20as%2025%20percent.

https://aws.amazon.com/fr/ec2/instance-types/

https://docs.aws.amazon.com/awssupport/latest/user/get-started-with-aws-trusted-advisor.html

https://cast.ai/the-state-of-kubernetes-overprovisioning/

https://karpenter.sh/

and OptimNow GPT alpha ;-)