If you implement infrastructure or application monitoring, you mainly care about three things: You want to have a fast and scalable core that queues check execution and manages state changes as well as notifications. You need a collection of check plugins. You might want to visualize performance data from the past or do some kind of trend monitoring.
The perfect tool for the first part is Icinga2. The last part can be done using InfluxDB and Grafana, for example.
When it comes to the second part, the Icinga project does not provide any check plugins out of the box, so most of you start using nagios-plugins
from monitoring-plugins.org. This has some impacts, therefore this article is about the art of check plugins, the drawbacks of nagios-plugins
and why and how Linuxfabrik implemented a replacement.
Imagine a minimal Linux server. This is what IMO should be monitored to increase uptimes and assist you in troubleshooting:
When trying to implement the above mentioned aspects using nagios-plugins-all
all you get is:
That's it – but beside that you also get a bunch of mysterious plugins for checking the signal strength of some special wireless equipments or even game or Novell Netware servers.
If you have a closer look, the plugins are written in three different languages (C, shell script, Perl), are of different age and quality and differ noticeably in configuration options, check behavior and output details. Assuming you installed nagios-plugins-all on CentOS 7 Minimal, the size of nagios-plugins-all including all (Perl-)dependencies is twice as big as the size of the Icinga2 Core.
With regard to datacenter monitoring, searching for a replacement or complement is even worse: thousands of authors just released one check plugin years ago with a very special feature subset. Most of those plugins are not "Enterprise Grade" in terms of error handling for example, and are written in even more languages like Ruby, Go etc., which leads to more library or tool dependencies.
After years of using all kind of plugins (including self-written ones) while the world has moved on, we started writing a new Check Collection from scratch, with the following global rules of thumb in mind:
Before we kick-started this project, we defined some essential software requirements, which outline functional and non-functional requirements and describe user interactions that the new check plugins must provide for perfect interaction.
Excerpt:
Today (2020-05-26), our checks are written in Python 2, because in a datacenter environment (where those checks are mainly used) the python == python2
side is still more popular. In CentOS 7, Python 2.7 is the default, Python 3 became available in CentOS 7.8. In CentOS 8, there is no default, you just need to specify whether you want Python 3 or 2. Support for Python 2 has officially ended, but not in CentOS 8 (Python 2 remains available in CentOS 8 until the late 2020's decade - for further details have a look here). Nevertheless, providing a Python 3 variant of each check is on our roadmap.
If we have to use 3rd party libraries for various reasons, we stick to official versions. At the time of writing, some check plugins need the Python libs:
Other shared functions are located in our self-written Python Library, for example dealing with:
get_service()
, set_downtime()
etc.)fetch()
, `fetch_json()`` with timeout, proxy and TLS handlingWe tested the checks on:
Host Aliveness
In the wild check_ping
is mainly used for checking host aliveness. Due to the fact that services (and in host hierarchies other hosts) depend on a host state, a new ping
plugin has to be reliable and tolerant:
Beside that, our ping
check is fast: it sends five pings in one second (by default), so it has the shortest plugin execution time amongst all ping checks.
Time Periods
Imagine a "cpu usage" check that reports 100% usage (resulting in crit), 20% (ok), 90% (crit) and so on: because the check plugin doesn't consider past results, we get an annoying flapping behaviour, so that everyone working with Icinga gets used to state changes. It would be better if the check only alerts when the condition has been above the warn/crit threshold for a specific amount of time – much like Prometheus does with its "for: 5m" construct. This behaviour is currently implemented in:
Don't reinvent the Wheel – instead port Well-Known Tools
Before implementing a new check, we always have a look at the source code of monitoring-plugins.org, existing tools that do the job today or even the Linux kernel and try to port the ideas according to our Development Guidelines. Examples:
Communicate with Icinga
Sometimes we just want to be informed on something, for example on a new release on GitHub or on news item on a Security Portal. Unfortunately there is no simple NOTICE state in Nagios or Icinga, so one way to simulate this functionality is:
This behaviour is currently implemented in:
Checking for Application Updates
If checking for application updates, we have to compare external resources (for example releases on GitHub) to locally installed software. Always be nice when using external resources: even if running every minute, don't fetch external URLs that don't change too often. Use a local cache to minimize traffic. This behaviour is currently implemented in:
systemd-unit
The "Swiss Army Knife" among our checks: it replaces legacy plugins checking for services, mounts, devices etc.
Some popular questions this check can answer:
Debugging and Troubleshooting
Checks that provide some additional information to assist you in debugging and troubleshooting:
Releases:
Please ensure that you always use an official release, and always the same release for the checks and the libraries.
Beside defining deliverables and development patterns like "naming conventions" or "prefer percentages over absolute values to assist users in comparing different systems with different absolute sizes", we also make use of some established Python coding styles:
pydoc lib/base.py
work.pylint
for the libraries, and with pylint --disable=C0103,C0114,C0116
for the check plugins, on a more regular basis.For details, have a look at CONTRIBUTING.rst.
Hands on: we want to implement a simple plugin that checks the current SELinux enforcement state. If it is not equal to the default (enforcing
) or the state given via a parameter, it fires a warning.
A first iteration that does nothing, simply returns OK and serves as a development template looks like this:
01: #! /usr/bin/env python3
02: # -*- encoding: utf-8; py-indent-offset: 4 -*-
03:
04: import argparse
05: import sys
06: from traceback import print_exc
07:
08: from lib.globals import STATE_UNKNOWN, STATE_OK
09: import lib.base
10:
11: __author__ = 'Linuxfabrik GmbH, Zurich/Switzerland'
12: __version__ = '2020051501'
13:
14: DESCRIPTION = '''Lorem ipsum.'''
15:
16:
17: def parse_args():
18: parser = argparse.ArgumentParser(description=DESCRIPTION)
19: parser.add_argument(
20: '-V', '--version',
21: action='version',
22: version='%(prog)s: v{} by {}'.format(__version__, __author__),
23: )
24: return parser.parse_args()
25:
26:
27: def main():
28: try:
29: args = parse_args()
30: except SystemExit as e:
31: sys.exit(STATE_UNKNOWN)
32:
33: lib.base.oao('It works.', STATE_OK)
34:
35:
36: if __name__ == '__main__':
37: try:
38: main()
39: except Exception as e:
40: print_exc()
41: sys.exit(STATE_UNKNOWN)
On line 04..06 we import some Python core libraries. On line 08 and 09 we do this for some of the Linuxfabrik libs as well.
After defining how to parse command line arguments, in the main() function at line 33 we simply say "Over and Out (oao)", print "It works." and fire OK.
Now, let's improve.
01: #! /usr/bin/env python3
02: # -*- encoding: utf-8; py-indent-offset: 4 -*-
03:
04: import argparse
05: import sys
06: from traceback import print_exc
07:
08: from lib.globals import STATE_UNKNOWN, STATE_OK
09: import lib.base
10:
11: __author__ = 'Linuxfabrik GmbH, Zurich/Switzerland'
12: __version__ = '2020051901'
13:
14: DESCRIPTION = '''Lorem ipsum.'''
15:
16: CMD = 'getenforce'
17: DEFAULT_SELINUX_MODE = 'enforcing'
18:
19:
20: def parse_args():
21: parser = argparse.ArgumentParser(description=DESCRIPTION)
22: parser.add_argument(
23: '-V', '--version',
24: action='version',
25: version='%(prog)s: v{} by {}'.format(__version__, __author__),
26: )
27: return parser.parse_args()
28:
29:
30: def main():
31: try:
32: args = parse_args()
33: except SystemExit as e:
34: sys.exit(STATE_UNKNOWN)
35:
36: stdout, stderr, retc = lib.base.coe(lib.base.shell_exec(CMD))
37: if (stderr or retc != 0):
38: lib.base.oao('Bash command `{}` failed.nStdout: {}nStderr: {}'.format(
39: CMD, stdout, stderr), STATE_UNKNOWN)
40: selinux_mode = stdout.strip().lower()
41:
42: lib.base.oao('It works.', STATE_OK)
43:
44:
45: if __name__ == '__main__':
46: try:
47: main()
48: except Exception as e:
49: print_exc()
50: sys.exit(STATE_UNKNOWN)
Line 16 defines the shell command we want to use, line 17 what we expect if we don't get a command line argument from the operator later on.
Line 36 uses shell_exec()
to execute the external command, returning the complete output as strings (stdout
, stderr
) and the program exit code (retc
). It is surrounded by "continue or exit" (lib.base.coe
), meaning if anything fails, the check exits here, returning UNKNOWN and the system error message. The last thing we have to do is to provide a help text and some real-world command line params, and check against them:
01: #! /usr/bin/env python3
02: # -*- encoding: utf-8; py-indent-offset: 4 -*-
03:
04: import argparse
05: import sys
06: from traceback import print_exc
07:
08: from lib.globals import STATE_UNKNOWN, STATE_OK, STATE_WARN
09: import lib.base
10:
11: __author__ = 'Linuxfabrik GmbH, Zurich/Switzerland'
12: __version__ = '2020051901'
13:
14: DESCRIPTION = '''Checks the current mode of SELinux against a desired mode,
15: and returns a warning on a non-match.'''
16:
17: CMD = 'getenforce'
18: DEFAULT_SELINUX_MODE = 'enforcing'
19:
20:
21: def parse_args():
22: parser = argparse.ArgumentParser(description=DESCRIPTION)
23: parser.add_argument(
24: '-V', '--version',
25: action='version',
26: version='%(prog)s: v{} by {}'.format(__version__, __author__),
27: )
28: parser.add_argument(
29: '--always-ok',
30: dest='ALWAYS_OK',
31: action='store_true',
32: default=False,
33: )
34: parser.add_argument(
35: '--mode',
36: default=DEFAULT_SELINUX_MODE,
37: dest='SELINUX_MODE',
38: choices=['enforcing', 'permissive', 'disabled'],
39: )
40: return parser.parse_args()
41:
42:
43: def main():
44: try:
45: args = parse_args()
46: except SystemExit as e:
47: sys.exit(STATE_UNKNOWN)
48:
49: stdout, stderr, retc = lib.base.coe(lib.base.shell_exec(CMD))
50: if (stderr or retc != 0):
51: lib.base.oao('Bash command `{}` failed.nStdout: {}nStderr: {}'.format(
52: CMD, stdout, stderr), STATE_UNKNOWN)
53: selinux_mode = stdout.strip().lower()
54:
55: if selinux_mode == args.SELINUX_MODE.lower():
56: lib.base.oao('SELinux mode is {} (as expected).'.format(
57: selinux_mode), STATE_OK)
58: lib.base.oao('SELinux mode is {}, but supposed to be {}.'.format(
59: selinux_mode, args.SELINUX_MODE), STATE_WARN, always_ok=args.ALWAYS_OK)
60:
61:
62: if __name__ == '__main__':
63: try:
64: main()
65: except Exception as e:
66: print_exc()
67: sys.exit(STATE_UNKNOWN)
Assuming you save this as mycheck, you can call it like so:
./mycheck
./mycheck --version
./mycheck --help
./mycheck --mode permissive
The check plugins and the libraries are constantly evolving. We are publishing new releases on a frequent basis, so stay informed.