Monitoring via SNMP works, then fails after some time

This forum was archived to /woltlab and is now in read-only mode.
  • Hey all,


    I would just like to preface this thread with the fact that I know that this may not be an Icinga2 problem, but perhaps others have run into this problem while using Icinga2 or the abundant network monitoring experts on this forum may know the problem which is why I'm asking here. Thank you for any and all help/suggestions in advance.


    Background

    - I am monitoring multiple port interfaces on multiple firewalls and switches using check_snmp_int.pl

    - I have set up the commands and services correctly and applied them to the hosts

    - The "problem" firewall and switch interfaces were able to monitored using these definitions until just recently

    - The "problem" firewall and switch settings are untouched (community string is the same, settings allow the Icinga2 server to use SNMP)

    - I am still monitoring some switches using the same method

    - I was able to use SNMP to monitor the port interfaces seamlessly before but now I'm not able to, despite not changing any configurations regarding the SNMP monitoring


    Proof that it works on one of the switches:



    Problem

    - I am unable to initiate SNMP connections at all to the "problem" firewalls and switches at all (meaning I can't even run snmpwalk, snmpget, check_snmp, etc.)


    What the problem looks like:


    Note: This is just one example. I currently have 5 network devices that have this same problem.


    Current leads

    1. I don't think that the problem lies in the plugin that I'm using or my command/service definitions because it works for other network devices

    2. I know this is not limited to this environment specifically, this problem is also on a different environment that I have another Icinga2 instance on


    I simply have no clue what is wrong. Could somebody give a hand?


    If more information is needed, let me know. Thanks!

  • It sounds like a firewall issue if you can't use snmpwalk from your Icinga2 machine to the firewall and switches. I'm going to guess the switches are behind the firewall. Your best bet is to look at the logs of the firewall when this is happening and filter for your Icinga2 machine's IP address to see what the firewall is doing with the traffic (and verify the traffic is reaching the firewall). I'm not too familiar with SonicWALL, but all firewalls have a way to filter real-time logs or run debug commands to narrow down issues like this. You will want to filter for just the IP of the Icinga2 machine and the SNMP service, then watch the output as you run snmpwalk commands from the Icinga2 machine.

  • Network hardware tends to limit resources for SNMP services, i.e. if your router is under heavy load, SNMP queries have low priority. I've seen that with Cisco Catalyst a couple of years ago. I'd look into the hardware's console log if you see anything suspicious after the query attempt.

  • nobodynew


    Based on the firewall logs, it doesn't seem like the SNMP traffic is reaching the firewall at all, which is strange. Also strange is that the logs for the SNMP connections seem to be nonexistent (because at one point, this setup worked), other than a log for the persistent SNMP connection that we set up for the initial use of SNMP. We tried running debug commands to filter the real-time monitoring logs to no avail. Perhaps we are doing this wrong?


    Our switches are behind the firewall, but since the Icinga2 server sits on the same subnet as the other switches, the traffic does not go through the firewall thus I don't think that it could be a firewall issue.


    Also, the settings for the firewall/switches are untouched, which means that I can't see why the connection would suddenly fail.


    dnsmichi


    Since our network is internal, it doesn't go out into the the Internet which means that the router could not be a factor in this problem. Plus, from monitoring the firewall interfaces over time, the internal network traffic never seemed to be under heavy load at all.

  • Have you figured this out yet , Im seeing the same "UNKNOWN" response from the servers i 'am monitoring in my environment, so not even network devices.


    I suggest you try the following :


    1. increase the timeout on the plugin, think default is 10s.

    2. Make sure you define an MIB file that the plugin need to look into

    3. Also you can try to specify an OID for each item you want to monitor.


    https://docs.icinga.com/icinga…/agent-based-checks-addon

  • Please DON'T use MIB files but use OIDs instead. Using MIB files requires lookup times increasing the execution time.


    Ok thanks this appears to contradict what it says in the documentation,




    If no snmp_miblist is specified, the plugin will default to ALL. As the number of available MIB files on the system increases so will the load generated by this plugin if no MIB is specified. As such, it is recommended to always specify at least one MIB.


    https://docs.icinga.com/icinga…/agent-based-checks-addon






  • Well, to be honest I don't know the logic inside.


    If the search is limited to ONE file it is surely far better than if ALL files are scanned but if the OID is just numerical a lookup isn't necessary at all.


    Edit: The manual page of check_snmp says

    Quote
    Code
    1. -m, --miblist=STRING
    2. List of MIBS to be loaded (default = none if using numeric OIDs or 'ALL'
    3. for symbolic OIDs.)
  • mrwest


    No, I have not figured this out yet and now this problem has expanded into two more network switches as of Saturday. I have no idea what to do here...


    Quote

    1. increase the timeout on the plugin, think default is 10s.

    2. Make sure you define an MIB file that the plugin need to look into

    3. Also you can try to specify an OID for each item you want to monitor.

    1. Why would increasing the timeout on the plugin do anything? The checks were working before and only recently have started acting up.

    2 & 3. No need for a MIB file or OID because I am using a plugin called check_snmp_int.plto check it (it's configured for strictly checking port interfaces)

  • Wolfgang


    Is this actually a problem? If so, how do I fix it?


    Surely it can't be fixed by increasing the timeout of the plugin... or increasing the time between checks. What do you suggest?


    Is there such thing as an "SNMP overload" on a switch/firewall to the point of rejecting all SNMP traffic?

  • It sounds as if an increasing number of checks with more SNMP traffic might be causing problems.

    It's just a guess based on your description.


    now this problem has expanded into two more network switches as of Saturday

    So what has changed in your environment?

  • Wolfgang

    Quote

    It's just a guess based on your description.

    Yes, I was also just wondering whether or not that could actually be a problem? Is it possible at all?


    Quote


    So what has changed in your environment?

    I actually just fixed this problem. It happened to be that the switches had somehow rebooted or reset randomly, which caused the SNMP configurations to also reset. I have since reconfigured the SNMP settings on the switches and am able to monitor via SNMP for the. However, I cannot say the same for the other switches that are having problems.


    I have already tried going into those switch settings but have verified that the SNMP settings that I configured a long time ago are still existing and should theoretically be working.

  • Wolfgang


    They seem to fail at a certain point and stay down forever.


    For instance, I have not had SNMP monitoring to one of my switches since January 31st. One day, all of a sudden, it decided to fail as mentioned.


    There seems to be no pattern other than the fact that they all go down and have stayed down ever since.


    Here's a timeline:

    January 22, iSCSI switch monitoring fails

    January 31, Dell PowerConnect switch monitoring fails

    February 4, VRTX Chassis switch module monitoring fails

    February 22, Dell SonicWall monitoring fails

    February 23, Dell SonicWall (separate) monitoring fails

    March 6 (Today), Dell X1052 switches go down, but was able to remediate

  • Wolfgang


    I mean that the devices are still on and functioning, but I am not able to execute SNMP checks on them (see the error I have from the first post).


    The snmpwalk command times out when I try to run it.


    Thanks for the help so far, by the way.

  • The snmpwalk command times out when I try to run it.

    Since you were having problems with devices losing their SNMP configuration I'd double check the switches in question.

    Verify that no firewalls or other blocking factors are between the monitoring server and the device to be tested.

    Executing snmpwalk/snmpget using v1 and "simple" OIDs like uptime or something similar would be my next approach. If that is successful run your plugin.

  • I have already tried going into those switch settings but have verified that the SNMP settings that I configured a long time ago are still existing and should theoretically be working.

    The switch settings are the same, they have not been reset.


    The network devices are on the same subnet as the Icinga2 server, thus making it so that there can be no firewall in the way of the SNMP checks.