Over the last few weeks, I have found a new type of crumb to clean up: cloud crumbs. Like my kids’ crumbs, they are the result of an overly eager appetite for a very desirable product: cloud infrastructure. Often the crumbs are just pieces of infrastructure left behind by error or oversight - e.g. a stripe of a disk, a database backup, a volume snapshot. But in some cases, I have found full and working infrastructure. But unlike my kids crumbs, these leftovers have a lasting consequence: lost money.
I have found the following types of crumbs in my cloud kitchen:
- Formerly used infrastructure - This is infrastructure that once was used, but now no longer has a purpose. For example, this could be a server you provisioned to handle a peak load but then forgot to spin down after the load had passed.
- Infrastructure remnants - In the process of spinning up and down infrastructure, some times pieces will get left behind. I recently found a handful of volumes whose deletion had failed due to a defect with one of our cloud providers.
- Under-utilized infrastructure - This is infrastructure that has been over-provisioned, either by accident or deliberately. I recently found a cluster of servers running with more compute capacity than required to fulfill their current function.
There are many reasons for cloud cookie crumbs: operator error, a defect in an internal automation framework, or even a defect in a cloud provider’s service. At a small scale, these crumbs cost you pennies per hour - quite likely not worth your attention. But at large scale, these pennies become dollars, and can accumulate to represent tens of thousands of dollars of additional cloud costs per month or year.
The answer to managing cookie crumbs in the cloud and at home is the same: vigilance. Ensure you and your team have tools to provide you full visibility into your infrastructure, with a view that allows you to clearly identify the purpose of the infrastructure within your application or service. Perform regular audits on your infrastructure to ensure you are only running what you need to be running. Identify and fix defects within your process or automation that result in cloud crumbs.
Having run engineering teams in a Fortune 100 company, I have intimate familiarity with the long lead times and institutional bureaucracy required to provision physical infrastructure. The ability to spin up large and complex environments in minutes, with no advanced planning, is a disruptive change in our industry. But with this great power comes the great responsibility to... well, clean up after your kids.