I would like to present one of my favorite cloud discussions: cloud costs monitoring for IT departments working in the cloud. In the following post, I will cover the evolution of IT monitoring and the importance of cloud costs metrics.
IT Monitoring in the Past
If you look at the evolution of traditional monitoring systems, you will see that they started with applications, like BMC Patrol, IBM Tivoli, Microsoft Operations Manager (MOM), and HP OpenView, monitoring particular infrastructure components, including hosts, applications, and specific infrastructure devices. The idea behind this was that monitoring would be used to detect particular infrastructure component malfunctions that were important for IT operations. Traditional monitoring focuses on inventory and performance levels for specific infrastructure components with custom configured IT management applications. When a threshold is exceeded, the monitoring system sends an alert to the respective IT support staff that determines the severity of the situation. In the past, this method worked well for traditional IT environments.
Later on, a new layer was introduced on top of this system, that worked as an enterprise console. It enabled IT support staffs to see a consolidated view of their system rather than each and every server. Nonetheless, there was still a missing link that connected IT infrastructure events to the actual business.
Linking with the Actual Business
Consequently, BMC introduced BSM (Business Service Management), which made the link between IT and actual businesses a reality. This approach ensured that business objectives would be used when defining an IT environment. As a result, the ITIL methodology was brought to light, which focused on aligning IT services with primary business needs.
This new approach includes monitoring business services while linking them to related infrastructure components. For example, a URL might contain a business transaction with several infrastructure components, such as web servers, app servers or database servers. If one of these components fails, the whole business fails. The basic assumption here is that a single component failure has a direct impact on the service. This has been a tremendous step forward for monitoring, from taking a collection of unrelated infrastructure components to having the ability to show the impact of certain errors or outages on specific business activities or application service levels.
Nevertheless, if the defined metrics (operational or business) exceed their given thresholds, the system would send an alert to their respective support staff, which could then take action according to the severity of the event.
Then Came the Cloud
A great change took place over the last few years with the introduction of the cloud. New cloud architecture has provided a level of redundancy that pretty much guarantees that no infrastructure component will go down. Environments, such as Amazon, provide the building blocks and mechanisms needed to support high availability, redundancy and scalability. With features like multiple regions, comprehensive load balancing and autoscaling, the element of failure in the cloud is nearly eliminated – everything can automatically scale to the size of demand. Proper cloud architecture with the right automation and DR mechanisms, in most cases will not be impacted by a single IT component failure. Hence the BSM approach is more or less not applicable in the cloud.
Lets take the scenario of a buggy application update, for example. Software updates that lead to high resource consumption in the cloud no longer run the risk of leading to failure. Auto-scale kicks in and provisions additional compute resources making sure that business service metrics don’t exceed their given thresholds. Another scenario could be if a region goes down. The cloud’s DR mechanism is set up to move applications and data to another region, making sure there is no impact on the business. In both cases it is clear that there was an impact on the cost of the environment, be it paying for additional servers or the data transfer (moving to another region in the case of AWS’ cloud).
So, aside from its value on a financial level, the cost of your cloud operations is one of the most important operational metrics (are you surprised?). Cloud spending behavior can actually be used as an operational metric to see the maturity level of an environment, or to detect operational issues or abnormal behavior. As mentioned at the beginning of this post, I found that indeed, one of the most interesting shifts that needs to be made when operating in the cloud is, in fact, in state of mind and perception. This is one of Cloudyn’s main principles.
Having covered the cloud costs metric as an indicator for the health and the efficiency of your deployment, it is crucial to understand that the optimal cloud operations equation also includes considerations such as deployment events and a number of other business metrics. In part 2 of this series, we will further explore the considerations and features needed for a healthy cloud environment.