Anders Håål – CTO and Founder
Customers often ask us – “What is the difference between check and metrics based monitoring?”. And, as expected, we get the follow up question – “What is the best choice?”
Let’s start with first question:
What is the difference between check and metrics based monitoring?
Check based systems are normally the systems that are built in the classic Nagios way. The checks are encapsulated in plugins. A plugin is a program/script that is executed at some interval by the “Nagios” system. The plugin must typically implement three layers:
- Access of the remote system that is subject for monitoring.
- Parsing and formatting the data from the remote system.
- Determine if the data from the system constitute if the state of the system is OKAY or is in a bad condition, defined as WARNING or CRITICAL
So the monitoring system is not responsible for determining the health of the instrumented system, that is the responsibility of the plugin. Typically, that means that the plugin determines the state of the current collected data/metrics without any history of that metric or with any relations to other collected metrics from other plugins. This stateless approach makes the plugin simple in design, but not very advanced in determining good or bad health.
A typical example is a plugin measuring cpu utilization. It determines if the cpu utilization is good or bad from a single metric in time, like over 90% the state is critical. It does not consider that an application running on the server still perform it’s expected workload. Typical Nagios-based system can suppress the state for a number of check intervals, normally defined as SOFT vs HARD states, but the fact remains that the scope of what the plugin take into account is limited in time and in relation to other metrics. It is important to understand that the purpose of most check based systems was to manage alarms and notifications, not to collect metrics.
In metric-based systems, alarms and notifications represent just one aspect of the system. The focus is to collect metrics and enable means to query the metrics. That’s where the big difference lies. In a metrics based system we still need ways to collect metrics. In the Prometheus world, applications natively expose their metrics or we create exporters to collect metrics and then transform them to the Prometheus or Open metrics standard exposition format. But the exporter itself does not apply any logic to determine if the metrics is “good or bad” to determine if an alarm should be triggered. This task is the responsibility of the platform, completely separate from the collection of the metrics. With a metrics based platform we use a query language to aggregate and combine different metrics over some time period to compare it defined thresholds. This is a clear separation of concerns between collecting metrics and operating on the metrics. So, in systems like Prometheus, InfluxData and Elastic we have storage of metrics that can be queried without any knowledge on how they were collected. This means that with a metric-based system, alarms and notifications are just one use case. Trend analysis, capacity planning, problem analysis, visualization, etc are other areas, all applied on the same stored metrics. With a system like Influx we can also store events and with Elastic store logs. And with a tool like Grafana we can combine all different sources in a visual way.
Does this mean that existing check based systems are not useful anymore?
From a metrics based system we can absolutely use our existing check based system as a source of metrics. At Opsdis we have developed a number of tools that extract metrics from check based system. These are tools like the monitor-exporter, icinga-exporter and monitor2influx, so organizations do not have to throw away existing investments and still have a way to start with a metrics based system as a complement and a way forward.
In the end it is not the selection of tools that will make you successful. So many companies are focusing on the tools, but what they should focus on is a strategy and a culture for observability. The question is not Influx, Prometheus, Elastic, Loki or something else, it’s that you learn how they fit in to your observability stack.
What becomes super important when you start use metrics based systems is the design of your metrics and event model, because tags, labels and keys matters. These are the foundation to be able to combine and aggregate different metrics and events in a consistent way to achieve observability and insight.