The good news is that with a little work, you can take control of your cloud costs and realize the full benefits of elasticity. The bad news is: you’re going to have to roll up your sleeves and put in some sweat equity.
Here is my 5 step process to managing cloud costs:
Step 1 - Gain Visibility
The management of cloud costs starts with visibility into your infrastructure. Unfortunately, cloud consoles , such as the AWS Console, provide a very limited view of your infrastructure. If you want to effectively manage costs, you need to be able answer application-specific questions such as:
- What is the current configuration of nodes in a functional cluster?
- What is the cost of a functional cluster?
- What are my current costs by type of infrastructure (e.g. EC2, EBS, Snapshot, RDS)?
- What is my current costs by functional clusters?
- What infrastructure did I launch and/or shutdown this week?
- What is the resource utilization (e.g. average, minimum and maximum disk utilization) of the nodes in a functional cluster
Third-party management frameworks and console features such as tagging can help, but unfortunately as of 2011 still provide at best a limited view. My recommendation: invest in an internal application that will provide you the view of your application you will require to manage costs.
Step 2 - Define Blueprint
Now that you have visibility into your infrastructure, your next objective is to standardize it. Most cloud applications are developed incrementally, resulting in a proliferation of non-standard configurations. This is often exacerbated by the lack of a consolidated view, which makes these variations go unnoticed. e.g. you may have started using a c1.medium instance with 5GB EBS as a default node in your web server clusters, but modified this configuration over time based on business needs.
To manage costs, you will need to formalize the blue print or reference architecture for how you deploy and configure your application on the cloud. A successful reference architecture should be precise, covering the specifics of the infrastructure you will use, its costs, and projected capacity. Often the first version of a reference architecture is based on known best practices and experience, but eventually will need to be driven from tested metrics.
Step 3 - Manage Capacity
If the reference architecture provides the blue print for your infrastructure, a capacity management policy details how you will scale this blueprint based on business needs. A successful capacity management policy should define:
- Initial configuration for each functional area of an architecture
- The dimensions upon which each functional area will scale
- The criteria for scaling based on these dimensions (e.g. if disk usage exceeds 80%)
- The response to the criteria for scaling on each dimension (e.g. add additional node to cluster)
The key objective of a capacity management policy should be to operate your infrastructure with a target level of efficiency. The closer you can manage to 100% utilization of your cloud resources without impairing application availability or performance, the greater your cost efficiencies. Defining the policy will require gathering and analyzing data from your existing infrastructure, as well as performing targeted tests in non-production environments. Often the additional data will result in adjustments to the reference architecture.
Step 4 - Rightsize
With your reference architecture and capacity management policy defined, it is now time to rightsize. Using your consolidated view, you can now identify target infrastructure in one of these categories:
- Non-standard - infrastructure that deviates from the reference architecture
- Under-utilized - infrastructure that can be consolidated based on the capacity management policy
- Unused infrastructure - infrastructure that is used infrequently or not at all
You then need to define a plan to rightsize the target infrastructure over time, minimizing impact to your customers. Depending on the scale of your infrastructure and its current efficiency, this may require days or weeks to execute.
Step 5 - Optimize
This is the step that Sonian founder and CTO, Greg Arnett, likes to call “gaming the cloud.” Your ability to optimize costs will be limited based on the available options from your cloud provider. Amazon is the most mature in this area, providing two options:
- Reserved instances - Reserved instances allow you to pay a reservation cost for your instances over a defined term (1 or 3 years), in exchange for guaranteed availability and a substantially reduced hourly cost. Reserved instance are useful for infrastructure expected to be always-on based on the current and projected infrastructure needs of your application. Typical savings can be 40%+.
- Spot instances - Spot instances allow you to purchase temporary instances from a marketplace, with the price determined by demand. Spot instances are useful for on-demand infrastructure, whose needs are short term in nature and can be interrupted by unplanned termination (e.g. as the market price rises above your bid). Your ability to leverage spot instances will unfortunately be directly proportional to the elasticity of your architecture. Typical savings can be 20%, but can be substantially greater if your architecture enables pervasive use of spots.
Unfortunately, cost management in the cloud requires continuous investment and vigilance. Each week, my organization is investing in new techniques, processes and tools to better manage our costs. Each month we are fine tuning our reference architecture and capacity policies to enable us to run more efficiently. Just like the application server market that followed the internet boom, a few years from now we will choose from a series of off the shelf solution for cloud asset, capacity and cost management. In the meantime, roll up your sleeves and start making the investment in the tools and processes for managing your costs in the cloud.
This blog post is based on best practices derived from designing, deploying and operating infrastructure that utilizes over 4K cores of compute and over 1 petabyte of storage.