Retry Interval Passive Checks

Hello everyone,

I have a problem with the behavior of passive checks in Check_MK 1.2.6p16.
I would like to configure a certain type of passive check, that is collected via a plugin (Batch file) from a monitored Windows Server 2012 R2 host, to retry a failed plugin (reached timeout for example) only at a given interval (5 minutes) and only a defined number of times (2 times maximum).

I have already tried to set the “Retry check interval for service checks” and “Maximum number of check attempts for Service” via WATO, only to find out, that the retry interval only applies to active checks. The maxmimum number of check attempts seems to be ignored too. I get intsantly alerted as soon as one of the plugins on the monitored host has a timeout. I would expect that it at least tries to run the plugin two times after the first fail before alerting but that doesn’t happen.

My check_mk.ini on the Windows Server Looks like this:

[global]
  async_script_execution = sequential

[plugins]
  execution check*.bat = async
  timeout check*.bat = 90
  cache_age check*.bat = 1800
  retry_count check*.bat = 2

I also thought the retry_count would Change some behavior, but also was ignored. Or do I just assume wrong and that Setting does not Impact the retry behavior? I’m a Little confused here :neutral_face:.
By the way all the other settings in the check_mk.ini are working perfectly.

My Setup:
Monitoring Host: Debian 8.4, Check_MK 1.2.6p16
Monitored Client: Windows Server 2012 R2, 1.2.6p16 (Agent Version)

I hope someone can help me out.
Thanks in advance.

For testing purposes make sure to have only one plugin which appears to have timeout issues in your plugins/ dir. Get rid of the option cache_age in your .ini. Set the timeout option for the plugin (use the full name for testing) to a value that you know is sufficient for the plugin to be executed. Restart the check_mk_agent service.
Additionally you should not need any WATO rules, so delete/deactivate them aswell.

Thank you for the quick reply @TheLucKy.
I did as you said configured the check_mk.ini and left only one plugin-file in the plugins-folder.
This way the timeout value is not hit and the check runs through normally.

I hope I wasn’t too unclear in my first post and understood you correct. But the problem is not constant hitting of the timeout value (and it does not normally). The issue is if the server has a high amount of workload (not caused by the plugins) and therefore some checks reach their timeout, the Check_MK Agent constantly starts these checks anew and so is contributing to the workload on the machine even more. I want to prevent this as good as possible.

Oh ok. But the agent in general does not run into a timeout, even when the server has a high workload, does it? Because the default service check timeout for the active Check_MK service is 60 seconds (if you didn’t configure it otherwise). So you should check that the hosts cmk agent (without executing any plugins!) responds in time at a high workload in the first place.

1 Like

I indeed changed that some time ago to 240 seconds.

I did a test on a very similar test machine today and brought the CPU-Utilization to 100% over 45 Minutes. Without any plugins the agent was fine and replied within time and with the expected results.
Surprisingly with some and even all plugins the agent was able to process it all in time. There were a few seconds delay but not so much. The 240 seconds threshold of the Check_MK Timeout was never reached - was not even close to it.

We then copied the plugins until there were too many for the system to process and the system became unresponsive to the monitoring. The Agent was not able to collect all the plugins data anymore and repeated the process of collecting data indefinitely without ever succeeding (Cannot get data from TCP port xxx.xxx.xxx.xxx:6556: [Errno 111] Connection refused.). This exact same state of the machine just recently happened to our productive system too, but not caused by too many plugins rather than high workload created by some power users.

I’d also like to mention that these machines are high available Virtual Machines on a Hyper-V 2012 R2 Cluster and their only job is running Oracle Databases. The related plugins are all DB plugins for Oracle.

I am not a specialist in networking, but doesn’t “connection refused” indicate that the huge load on the system blows up the whole tcp socket/connection in the first place? If only a plugin times out, just the according service should show something like “plugin timed out”. I am not sure if it would help to set a higher retry count and retry interval on the active Check_MK service of the affected host(s).

Hi,
I just wanted to update this thread.
I somewhat solved my problem, by writing myself a plugin that checks if there is a certain process on a monitored host running for a time x.
It is not the solution I was aiming for, but it alarms if there is one or mutiple plugins running longer than expected, so I can see if there is constantly a plugin-process running or not.
Nevertheless thanks for the help and the suggestions.