Lets face it, monitoring is hard.  We want to know every detail of our systems, from the state of the RAID, the latency on the storage, or the size of the Java heap, through to the number of failed login attempts or the speed of the database transactions.  Possibly even hotspots in your code, although pre-production profiling should really have caught those.  This is a huge range of metrics to try and capture in one place.  We also want to be alerted to problems, without being spammed by false positive alerts.   Combined with every organisation having it’s own configuration and architecture, and a myriad of different products available – both open source and commercial –  it’s hard to know where to start.

This article is going to document some approaches to bringing a level of insight into a web farm based on Tomcat, running on Linux.  We’re not talking web-scale here, the companies deploying thousands of the these things have the level of staffing to build their own custom solutions.  But a medium size organisation, running a webfarm with 20  or so web servers probably doesn’t have that level of resource.

So what do we want from a monitoring solution?

The most basic question is obviously “is my website up”?  Great, if that’s all you want then is your answer.  But knowing your website is up (or down) isn’t that helpful.  If it went down, you’d probably know soon enough anyway.  So the next question is “how fast is my page response?”, and more specifically “are the average response times in the last minute OK/bad/terrible?”  At least this might give you a chance to catch problems before the site blows up in your face.  From a Java application point of view, you might want to know if the number of garbage collections / min has suddenly increased, or you’re about to run out of perm-gen space.

But crucially, you also need to know what the baseline norms are.  There’s little point in an alert saying the number of page faults/sec is 150 if you have no idea if that’s high or low.  I’m a massive fan of graphs:  without needing to know anything about a system, if your website crashes at 9am, and you can see from your monitoring application that a particular metric skyrocketed at the same time, you know where to look.  Simply logging onto a box and running [top|iostat|vmstat|etc] isn’t much help without knowing what the numbers are supposed to be.  And who has time to do all that when there is a problem?

So far this has all been about system monitoring, but as an application developer you will also have application metrics you want to capture.  Say the number of completed purchases a min, or the number of uploads of cute cats a sec or something.  So how do we expose those in an easy way?

Finally there is log monitoring.  Everything that moves writes to a log file, often in an unstructured, inconsistent way.  There’s no point having these logs scattered across file systems and servers.  We need a quick searchable interface for reactive problems, and someway of raising automatic alerts from the logs.