Sporadic issue with some snmp checks


(Pablo Venturini) #1

I’m running Icinga2 v 2.5.3 with Director in a distributed top down configuration. All is working mostly well. I’m seeing some sporadic issues with only one of my check commands and it seems to be limited to a few systems…

The check command script is check_snmp_win.pl. For a few systems on which i’m monitoring multiple services, I sporadically get “Process name table : No response from remote host…” The service checks change from OK to UNKNOWN to OK repeatedly. They only remain UNKNOWN on average 5 minutes or so. This seems to happen for all of the services being monitored but not all at the same time.

Could there perhaps be some SNMP settings on these few systems that are not properly setup causing some requests to be ignored? I’m using SNMP for most other checks like CPU, Disk, memory and I don’t have this problem with any of those checks. So if it was SNMP related I think the same issue would occur with these other checks but this is not the case. It only happens with these service checks.
Any help/tips would be appreciated. Thanks.


(Michael Friedrich) #2

I would analyse which check plugin call is responsible for that, and which parameters are used. Some specific SNMP OID calls may provide a huge dataset on return, and this error normally happens if the device is busy with other tasks (the thing it was made for, SNMP interfaces tend to get low resources on network hardware).

Maybe a different plugin or OID would solve the issue. Without knowing about the above mentioned details, it will be hard to give helping advice.


(Pablo Venturini) #3

Thank you Michael. I will keep digging into this and report back what I find.


(watermelon) #4

To add on to dnsmichi’s answer, it sounds like you have an issue with the plugin timing out because of something like a huge dataset that gets returned or because of the machine being busy for whatever reason (as dnsmichi says). Perhaps these checks also stay “UNKNOWN” for 5 minutes or so because of how you’ve set up the check/retry intervals and check attempts as well, hard to say without seeing your config.

Perhaps configure a longer timeout for the plugin using the -t flag because the default is only 5 seconds? Here’s a list of all of the parameters that can be tweaked for check_snmp_win.pl (assuming that you’re using this version).


(Pablo Venturini) #5

I tried adjusting both the Check timeout property on the service and also setting the timeout with the -t argument on the check command itself. Previously I had nothing set for Check timeout for any service. I set them both to 20 and now I get a different error “ERROR: Alarm signal (Nagios time-out)” instead of “Process name table : No response from remote host…”.
It seems to be the same systems having the problem as before. I did notice that now, the execution times for these checks has increased from an average of 10s to 15s. That seems to be the only affect so far in changing the timeout values.


(Thomas Widhalm) #6

Try to set both timeouts to a very high level. This way you can do some baselining for the timeouts. Just use a grapher for collecting the execution time and then after some time (at least several days) set the timeout higher than what you see as a “normal” or average execution time.


(Pablo Venturini) #7

makes sense. I set them both to 300s. Let me see how that works. I will adjust it higher if needed. thanks


(Pablo Venturini) #8

Now i get this… ERROR: The timeout value 300 is out of range (1…60) :slight_smile:
Not sure if that’s for the check command of service property. I will change one at a time.


(Pablo Venturini) #9

I adjusted the timeout argument for check_snmp_win.pl to -t 60 and that removed the “out of range error”. Now i’m back to getting the “Nagios time-out” error.
What’s strange is that all of the checks with the “Nagios time-out error” have execution times around 15 seconds even though the check command is passed the -t 60 argument and the service check timeout is set to 300.


(Pablo Venturini) #10

I found an old post discussing timeouts of check_snmp_win.pl in Nagios.
https://sourceforge.net/p/nagios-snmp/discussion/458328/thread/678db0e9/

It mentions that in Nagios there is a “global” timeout which is set to 15 seconds. Is there any such global timeout in Icinga2? Something must be limiting these checks to 15 seconds. I can’t figure it out.


(watermelon) #11

So you say that this problem is only occurring for certain servers right? Is there anything special about them? Do they exist in a separate zone than your other servers? Are you able to reproduce the error via command line?

Somehow your check is still timing out even after a 60 second timeout, however, you say that it executes within 15s to give you the error? It is a HARD or SOFT Unknown state?


(Michael Friedrich) #12

CheckCommand objects have the timeout attribute which iirc defaults to 1m. You can either create your own CheckCommand definition with a custom timeout, or override this on a host/service level with the attribute check_timeout.


(Pablo Venturini) #13

Michael, thanks. check_timeout is what I have set at the service level to override the check command timeout. I’m still not sure why the execution times are limited to 15 seconds.

Watermelon, thanks. Nothing special about these servers. They are in the same zone as other systems where the problem is occurring. Yes, I am able to reproduce the same sporadic results from the command line if I execute the check command from the Icinga node. Most are in soft Unknown state since the checks are successful sometimes prior to going into a hard state. For one server the checks are all in a hard Unknown sate for several days.

I will be doing more troubleshooting today. Thanks


(Michael Friedrich) #14

15 seconds could be the plugin timeout itself, hardcoded in the plugin and a bug. Can you share a link which plugin you are using exactly? I’d like to know where to look at.


(Pablo Venturini) #15

I’ve been using this version of the check command.
http://nagios.manubulon.com/snmp_windows.html

I did, however, make one adjustment to this script due to buffer size errors as described in this thread:
https://support.nagios.com/forum/viewtopic.php?f=6&t=11645
Two lines changed
From: $session->max_msg_size(5000);
To: $session->max_msg_size(10000);

I did see that you have a new version of this script which I have not yet tested. Will do so now.


(Pablo Venturini) #16

ok. Some progress with using the new version of the script. Note, I also adjusted the max_msg_size line in this version as well…
Now I’m able to get the execution time greater than 15. So the timeout values I set at the service or command definitions seem to be working. I’m back to getting the “no answer from host” error now. I will try to adjust timeouts to see if I can extend the execution time until it works.


(Michael Friedrich) #17

Ok, that one. I remember that there was a bug with the timeout parameter in the old scripts. I’ve just “forked” the dead project years ago to allow community members to use and provide patches. Primarily this was just a base for the Debian packages. I for myself don’t have an infrastructure to test these plugins unfortunately. In case you’ve patched something working, kindly send a PR then :slight_smile:

As said some replies above, the “no answer from host” normally leads to a huge dataset collected on the host, which then terminates the connection. This is the most common pitfall, and I would believe that some OIDs in that script generate too much data on the Windows host … might be worthwhile to look into different OIDs then, or a different plugin solving the requirement too.


(Pablo Venturini) #18

The old script had this line near the top which is why my checks were getting limited to 15s regardless of what I was setting in the service or command definitions :slight_smile:

# Icinga specific
my $TIMEOUT = 15; 

The new script still has a max of 60s for the timeout argument which is not long enough for my few systems having the problem. Not clear to me where in the script this limit is set but I will look into it and perhaps put in a PR with an updated version including the buffer size adjustment.

I will also do some troubleshooting on the affected systems to see why only these systems are having the problem. Two of them are domain controllers by the way and perhaps that’s contributing to large amount of data being returned by the OIDs.

Thanks for the help. I will post my findings here in case others run into similar issues in the future.


(Michael Friedrich) #19

Seems Net::SNMP as Perl library chose to cap the timeout to 60s at max. Just found this blogpost but I would think that patching the Perl library is not a good idea, as this will be overridden on updates.


(Pablo Venturini) #20

Yes. I was just about to comment that the limit is with Net::SNMP. Thanks for that blogpost. I agree and want to avoid patching Perl. So I’ll explore other options for now. thanks.