Everyone ends up using nagios or a
derivative just because... well everyone else does. The size of your org
really matters a lot with what you are doing here as Zabbix might fit
you right or not at all.
Lately I've been setting up nagios with a graphite back end for people.
Then taking advantage of writing your own plugins for nagios to send
data to both systems. You can throw a lot of data at graphite and make
some super pretty graphs if that is what you are after. For example
imagine having all the contents of a vmstat/iostat every X seconds...
for ALL your servers that can be queried with less than a minute
latency. You can do that with nagios+graphite+yourownfixins. ... and
then you show Dev how easy it is to log data into carbon/graphite and
become a super hero.
When you start hoarding this much data you can start asking some really
detailed questions about disk performance, network latencies, system
resources, etc... that before were just guestimates. Now you have the
data and the graphs to back them up.
I'm also a big fan of Pandora FMS but I've never implemented it anywhere professionally and the scope it takes is pretty large.
(I should note, nagios is pretty terrible, it's no better than things we had a decade ago.)
The real truth here is that all the current monitoring systems are
pretty terrible given that they are no better than what we had a decade
ago. Every good sysadmin group makes them work well enough, but there is
a lot of making them work. Great sysadmins go on to combine a couple of
them with their own bits to make the system a bit more proactive than
reactive, which is what most people expect out of monitoring.
Reactive monitoring is fine for certain companies and certain situations
and it is easily obtainable with nagios, zabbix, home-brew,
stupidspendmoney solution, etc... However reactive monitoring is just
the base point for most, it certainly doesn't handle big problems well,
or have the capacity to predict events slightly before they are
happening. This level of monitoring also doesn't give you much data
after an event to figure out what went wrong.
Great admins go on to add proactive systems monitoring and in some cases
basic logic monitoring. This is what a lot of us do all the time, to
avoid getting paged in the middle of the night, or to know what to pick
up at fry's on the way into the office. Proactive monitors a lot more
things than basic, and it is essentially the level where everyone works
at now, with nagios, etc... That's certainly fine for today and
tomorrow. But it doesn't tell you anything about next quarter, and when
you ask queries about events in the past they are often very basic in
scope.
The other amazingly huge drawback with current monitoring is that if you
want to monitor business or application logic, it is going to be
something you custom fit into whatever monitoring system you have. This
will lead to it being unwieldy and while effective for answering basic
questions like, "What's the impact on sales if we lose the east coast
data center and everything routes through the west?" That's a fine
question but it isn't a question that will get you to the next level,
better than your competitors.
So what's next? I'll tell you where I think we should be going and how I am sort of implementing it at some places.
Predictive monitoring on systems AND business logic, with lots of data,
and very complex questions being answered. This can be done right now
with nagios, graphite and carbon. Nagios fills the monitoring and
alerting needs. Carbon stores lots of numerical data, very fast from a
lot of sources. Finally with Graphite you can start asking really
serious questions like "How did the code push effect overall page
performance time, while one colo site was down? What's the business cost
loss? Where were the bottlenecks in our environment? Server? Disk?
Memory? Network? Code? Traffic?" Once you've constructed one of these
list of questions in graphite you can save it for the future, and not
only monitor it, but because of legacy data kept on so many key points
use it for future predictions.
That said, how do you all that now? Well you throw nagios, graphite and
carbon out there and then you CREATE a whole lot of stuff that is
specific to your org. This is a lot of work, a lot of effort and takes
time and real understanding of the full application and what your end
SLA goals are.
So how do we do all this?
You as an admin do this, by creating custom nagios plugins and data
handlers on your systems and throwing them in to carbon. As an admin you
measure everything, and I mean everything. Think all of the output from
a vmstat and an iostat logged in aggregate one minute chunks on every
single server you have and kept for years.
From the dev site you get the Lead Dev to agree on some key points where
the AppStack should put out some data to carbon. This can be things
like time to login, some balance value, whatever metric you want to
measure. The key here is to have business logic metrics AND system
metrics in the same datastore within Carbon. Now you get to ask question
across both data sets, and you get to ask them frequently and fast. You
are able to easily make predictions about more load impacting the
hardware in what manner, i.e. do we need more spindles, more memory,
etc...
This is what I have been doing with some companies in SV right now. It's
not pretty or fully blown out yet, because it is a big huge problem
and our current monitoring sucks. :D
but it IS doable with current stuff and is quite amazing to know answers to questions that were previously only dreamed about.
What's after that? The pie in the sky next level, would be having an app
box in every app group running in debug mode, receiving less traffic of
course through the load balancers, and loading all that debug data into
carbon. Then you get to ask questions about specific bits of a code
release and performance on your real production environment.
... so those are my initial thoughts. Any comments? :)
Further once you have all this, you can now write nagios plugins to poll
carbon for values on questions you have created and then alert not only
on systems logics and basic app metrics, but real queries that are
complex. Stuff like "How come no one has bought anything off page X in
the last two hours, is it related to these other conditions? Oh. It is.
Create me an alert in nagios so we can be warned when it looks like this
is about to happen again." With much more data across more areas you
can ask and alert on pretty much anything you can imagine. This is how
you make it to next level.
*Disclosure: I am a real user, and this review is based on my own experience and opinions.
Chris, do you still find this to be true? Is Nagios still a default tool when people are searching for IT Infrastructure Monitoring solutions?