The rise of the now common practices of virtualization and cloud computing stemmed from setbacks that were encountered in traditional IT environments. In the not so distant past, hardware and software were tightly coupled, where large hardware machines housed sensitive and complex hardware and software configurations. Consequently, enterprises utilized their hardware sparingly, leaving machines untouched for years in fear of devastating their hosted applications. I was once responsible for a certain product that required cross-platform compilation. This meant that the product had to be compiled against multiple platforms, such as HP Unix, Sun Solaris, and Windows IDX, to name a few. There was a need to cross compile on every single platform, every quarter, without any option to touch the environment in between the system upgrades. Moreover, finding an available server to perform tests on was quite a challenge. Be that as it may, purchasing enough of these massive servers would increase any capital expenditure budget for at least a few years. At that time, there was no solution to this problem. That is, until virtualization and the cloud came along.
Virtualization: A Great Step Forward?
Posing a viable solution to the traditional hardware environment dilemma, virtual machines brought about the ability to decouple hardware from software for the first time. Suddenly, users were free from the chains that bound them to particular pieces of hardware, and were able to reuse servers and compute capacity for various purposes. Virtualization and software tests go hand in hand. For example, seeing that QA requires the majority of the machines in a development environment, virtualization helped simplify the replication process, releasing hardware procurement bottlenecks. This resulted in user flexibility and a high level of efficiency.
However, with machines left running unattended without further followup, a new problem arose – the virtual machine sprawl. This new phenomenon turned out to be even more severe than the initial traditional dilemma. Submitting instances on virtual machines was such a simple task, that often times they were forgotten, keeping resource utilization at the low average of 15%, similar to that of the traditional world.
The Cloud Sprawl!
So, how could this issue be dealt with? The mass perception was that if non-IT employees would start dealing with virtual machines, themselves, it would be far more simple to control and generate better machine usage and utilization. While the cloud was intended to solve this problem, instead, many enterprises and organizations today mostly benefit from its seemingly limitless flexibility that is derived from its self-service approach and endless capacity. The cloud provided tools that automatically launched, reconfigured, and moved instances from one point to another. However, due to the fact that so many things happened simultaneously behind the scenes, such as inadequate server sizing and placement, a new phenomenon emerged, called the cloud sprawl.
In addition to self-service features, the cloud and its API extended resources provision automation. However, due to the various types of automation that exist, let’s first clarify that we are talking about the type that deals with changes made within a workload, similar to autoscaling. In web-scale production environments, we see that autoscaling works well, but it does tend to cause cloud sprawl at times. Virtual machines are launched in large amounts, not due to workload divisions, but rather, because of spikes in demand, whereas others are left idle. Returning to the example above, in QA environments, different tests are performed that are in line with traditional Waterfall or modern Agile methodologies. There are sprints, or development cycles, where every cycle requires some sort of new test, be it functional or regression, where a lot of machines are started and far fewer are eventually shut down. As a result, cloud sprawl became even more severe, and the idea of reaching the perfect CPU hour or 100% CPU utilization per hour seemingly impossible.
In the end, the CPU utilization average rate needs to be assessed by how much it increased over time. Some even claim that 50% CPU utilization can be achieved with the cloud, however, generally, this does not seem to be the case (at least for now). Nonetheless, compute utilization in the cloud can most likely reach higher rates than traditional hardware. In an automated cloud environment, compute utilization and metering need to be addressed, tracked, monitored and enhanced, in order to avoid proliferation.
Monitor, Tag and Act
You cannot automate chaos. First of all, you need to define your convention, follow that convention, then when it is in place, you can define the rules. However, if you can’t understand the server’s purpose, the smart thing to do would be to kill it. On the other hand, an alternative step would involve distinguishing which instances are actually required and which can be safely killed.
Segmenting the environment is important and requires specific procedures, such as instance tagging. For example, every machine in Cloudyn’s environment has at least 2 tags. The first is “Stack” (which includes production, development, tests, and so on) and the second is “Purpose” (which defines how the machine is used, whether it is used for the database, back-end, or front-end, and so forth). It is important to segment your environment to know which areas are used for various tasks (i.e. production, development, or testing), allowing you to shut down the servers that are not in use.
Recently, I read an article on RavelloSystems’ blog that described how their “Cleaner” works. It runs every six hours and shuts down unused instances (ones that weren’t tagged to stay), which is definitely a smart and nice solution to keep cloud costs under control. As in the case of Ravello Systems’ “Cleaner”, you can kill the unused dev/test instances, however a better solution would be to continuously monitor and track utilization. Then, you would be able to act in line with the defined utilization threshold, such as getting rid of unused capacity and consolidate resources in order to reshape your average compute utilization.