Monitoring in the Cloud

Monitoring

At yesterday's Advanced AWS meetup on monitoring, someone asked if they need something like New Relic if they already have Stackdriver. My answers was "yes", but I wanted to dive deeper into why I think that.

I consider there to be 3 different buckets/types of monitoring

  • Infrastructure
  • Application
  • External

Infrastructure monitoring is focused on the instances, load balancers, etc. This includes SaaS offerings like Stackdriver and Datadog. The metrics they are gathering mainly concern CPU, memory, network and disk. To get to the memory (and some disk metrics), OS level integration is needed, which usually means installing an agent of some sort on each instance. If the monitoring software is really good, it will aggregate the metrics that make sense up to the service level. An example of this might be: number of requests per second from all production web instances. Since Stackdriver integrates directly with AWS APIs, it looks at the Auto Scaling Groups and does this automatically.

Boundary, who I also put into the infrastructure bucket, take a network centric approach, monitoring every packet between every instance. This allows them to do a lot of things automatically as well, like mapping dependencies between services and highlighting performance bottlenecks.

These infrastructure monitoring solutions couldn't care less if you are running a Java web app, NodeJS or Ruby on Rails. This is where Application monitoring shines.

Providers like New Relic and AppDynamics will be able to tell you the response time of each method called in your MVC stack of choice during a request, what database queries are causing issues, stack trace analysis and more. These tools are commonly referred to as Application Performance Monitoring (APM). If you are trying to do root cause analysis, this is usually the tool you would turn to.

Ping times

Finally, you have external monitoring - that is, using your web site/service from various points around the globe. Here you have no shortage of providers. From the old guys like Keynote and AlertSite (now Smartbear), but also the newer fancy ones like Pingdom and Monitis. If it's basic, it might only give you ping times from a few locations around the world. If the service is more advanced, it might be able to login to your website, execute JavaScript and make sure your AJAXy, Web 2.0 website is in full working order, much like feature specs would when developing rails locally.

These are the three "what am I monitoring" buckets. The ingestion/input side of it.

For output, most of the providers have graphs, and customizable dashboard, and timelines, and annotations, and other eye candy to put on the big monitor in the office.

Notice that I haven't mentioned alerting yet. Getting the information in is one thing, but knowing what to do with it is another. At the very basic level, you set up a min and max bound for the metrics of interest, and integrate your monitoring SaaS of choice into PagerDuty, with some sucker an admin on call.

Oh, while I am at it, let me get all glossary on you. An alert is not a notification. People use these words interchangeably, when they mean two very different things. As that sucker who spent 6 years on call, I want to get the definitions right now:

<rant>

A notification is an event you want to be notified about. An alert means get your butt out of bed at 3am and go fight the fire. If the website can't be reached from Perth, Australia, but is working fine from New York, please notify me, but don't you dare send me an alert. People thinking these two words are the same cause a lot of lost sleep, and leads to alert fatigue. This difference is something not all SaaS providers handle well, so when evaluating them, keep this is mind.

</rant>

What we are starting to see now are easy ways to add automation when alerts are triggered. For example, an instance is detected to be using all available memory. An alert is fired, and rather than a human dealing with it, the instance is rebooted or terminated. #TreatServersLikeCattle #LetThereBeSleep

The more advanced SaaS offerings are also getting into anomaly detection too, which is really surprising and refreshing (in the sense that I didn't have to manually set a good band for the metric; the system learned what normal was). It's just the beginning for this, and certainly not perfect, but it is progress.

OK, so that is my long winded way of saying "you need to monitor, and monitor in different ways to address the different issues that come up".

Dodo

There is one more thing. I never said "manage your own monitoring solution". When it comes to the dynamic nature of the cloud, AWS in particular, Nagios, Cacti, Ganglia, Zabbix all need to go the way of the Dodo. They are all terrible, and end up costing you far more in engineering effort than you would ever spend on any decent SaaS. Focus on your product and your users. #LeanStartup

When evaluating your monitoring solutions, keep in mind that you will probably need more than one. Some started out in one bucket and stayed there, others have branched out into multiple buckets. Choose wisely.

Disclaimer: Part of this is controversial. I have probably missed things. If so, let me know. Also, don't take the above companies as recommendations. Do your research. I'll say it again: Choose wisely. Of course being a consultant, if I can help, please contact me :)

Tuesday 03/04/2014 at 04:50pm | Peter Sankauskas
comments powered by Disqus