SNMP Traps receiver and correlation options

This forum was archived to /woltlab and is now in read-only mode.
  • Hello everyone,


    This question may not be completely related to icinga2, but I am sure I am not the first one to have this doubts/need for this functionality, So I thought maybe someone could share his findinds/solutions.


    I have made some tests with SNMPTT and SNMPTRAPD in order to receive snmp traps on icinga2, however I think we all agree icinga2 is not really traps focused. The system works and after receiving a trap you can modify a service on icinga2, send and email, etc. but what I cannot figure out is how to correlate incidents. I will try to give an example.


    Let us supose that 3 ports on a switch go down (ports 1,2 and 10), that will generate 3 traps indicating the ports going down. After receiving the traps we could set a service or services to a critical state for example.


    Then after 5 minutes ports 1 and 2 go up, and new traps are generated. After receiving this new traps we would ideally clear the critical status for ports 1 and 2, while keeping the status for port 10.


    In this situation, I see two options.


    1. Create one service per every switch port. The inconvenient is this would mean literally 1000+ services for our monitored network

    2. Have some logic correlate the Up and Down traps.


    I know we could potentially create the logic needed for this correlation, but I do not plan on reinventing the wheel, so is there any opensource software arround that does this correlation? I have taken a look at SEC and eventDB, but what does people usually use in this scenario?


    Thanks for your time

  • I have no configured SNMP traps, but I have configured SNMP monitoring for "one service per every switch port" like you mentioned.


    I am not sure about your environment but for mine, configuring it this way makes perfect sense because I only have a handful of switches and firewalls to monitor. I would be happy to share my configurations with you if you like.


    My question for you is in what fashion would you want to correlate the up/down states of ports? Do you use multiple ports for the same service? Otherwise, I don't see how much "better" it would be to create some kind of correlation logic as you have described it.


    These are just my thoughts. Please enlighten me!

  • Hi,


    In our environment we have around 200 devices with an average of 24 ports each, that would mean around ~4800 checks if we use the 1 port - 1 check/service relation.


    The idea behind the correlation is having 1 service per device, but then the problem is keeping track of what is up and what is down. One example


    t=0 All ports are up --> the service is on OK state.


    t=5 Ports 1,2,3 go down --> service on CRITICAL state.


    t=10 Ports 1,2 go up, 3 remains down --> service still on CRITICAL state, but two port problems have been solved.


    In this situation if you look at the service history tab after the indicent, it can be troublesome to identify the order of the ports going up/down to help you understand what happened/caused the error. Especially since the history tab does not have a multiline output (at least not that I know of).


    In this situation the only option I see is a setup like yours, where every service will show the up/down times per port, but I do not see this as an scalable solution (with 5000 services and growing), that is why we are considering if there are other options available.

  • I think in either case you'll be monitoring the same amount of ports but it seems like for you, it's just a matter of presenting the information. I think if I were you, I would do it as I have configured in my environment and not with any correlation logic because there is an option to search through the services that I have configured. For example, say I need to find out what's going on with port 0/48 on my Cisco Catalyst. I'll go to the "Services" tab and use the filter option to search for services named for "Port 0/48" and for hosts named "Cisco Catalyst".


    You can define this search even further if you like, as I'm sure you might need to. An alternative solution would be to instead use the "Hosts" tab, search for "Cisco Catalyst", then to the "Services" tab of the host, and I can find port 0/48 with relative ease, especially if it is critically flagged.


    Furthermore (if you configure it like I have), you can present the information in a broader perspective with NagVis. It's a network mapping plugin that you can add into Icinga2 to visualize your network better than a text-based interface. Here's an example:




    You see here that the host is critical because one of its services is down. This could be the logic you're looking for your switches. From there, you can create a link to your host in the Icinga2 web interface so that you can remediate easily. I just started using this plugin and so far it's very nice, apart from a few minor bugs.


    Let me know what you think.


    P.S.


    Regarding the multiline output -


    I'm not sure what plugin you're using to check your port interfaces, but I use "check_snmp_int.pl" and the output is usually only 1 line (interface name, up/down status, throughput %'s, and overall status of the port). Regardless, I have seen multiline output through the "History" tab with a different check that I use for check_wmi.

  • Hi,


    The problem I see with your logic, that we have also thought using, is not being really scalable. At the moment we would have around 5000 checks but that will increase soon. If we do not find another solution we may end up using several icinga satelite servers to load balance the load, but that is something we would like to avoid if possible.


    Regarding NagVis is a very good piece of software we already use.


    And lastly about the multiline History information of a service. Can you tell me which icinga/Icingaweb2 version are you using? In our current version we can see a multiline output for the check current output, but as soon as it becomes history the output is "cut" to only the first line information. I have attached two images showing this "problem".


    Current Output information





    History information, we can only see the first line.


  • I see, but I feel like in either case discussed above, the same amount of checks would have to be executed to check the same amount of ports. Or, perhaps there is a better way like you describe. Maybe dnsmichi might know?


    I'm currently on version 2.6.1 of Icinga2 and 2.4.1 of Icingaweb2.


    Here's a screenshot of the History tab:



    I think it also may depend on the plugin that you're using, but I am not sure.

  • Although your icinga2 and icinga2web versions are newer than ours, I am not sure that is a multiline history output, let me explain.


    In our case if the plugin output returns a very long line the history shows all the information, however you do see it as multiline because of webbrowser formatting. If you were to have a gigantic screen were all this long line fits, you will probably see it as a single line.


    On the other hand, if your plugin output has crlf or new line feeds, possibly for better readability the history tab only show the first line like in the example I showed.


    Nevertheless this is a bit offtopic from the original question, and I am afraid it seems noone has thought/found yet a better solution to the problem I described in my original post :(.

  • Hey,

    based on your example, I can say, that I also do it like watermelon and check every Port separately.

    For this, I use the plugin check_snmp_int.pl ( http://nagios.manubulon.com/snmp_int.html )

    If you are worried about the sum of checks, what about using an "apply.....for" rule and templates?

    For example, if you have 300x24-port switches, you can just create a template where all the ports are already defined. Because the ports are standing in dependency with the hostname, there will no problems (internal it is running like switch1!FastEthernet1/1 and so on).

    So as you have to define all switches anyway, you can reduce the effort to three lines: Object Name, Template, IP.

    That's it :)


    Regards,


    Marcus

  • pacofer


    Ah, I understand. Perhaps support for multiline output in the History tab will be a feature in the future. You should suggest it!


    MarcusCaepio


    I think pacofer's problem is not necessarily with the deployment of the definitions, but scaling the resources with it. 5000+ checks would probably require a decent amount more resources than he would like to use (which is why he mentioned the distributed monitoring with several satellites). I'm not sure that your solution would reduce the amount of the checks, it just aids in the deployment of the checks themselves (tell me if I'm wrong).

  • watermelon


    You are right, I do not have problems with using an apply for, templates , etc. In fact I think this would be the "easy" part. Of course not all devices have the same number of interfaces and/or names but that is not our concern at the moment. Our concern is the high number of checks in terms of computer resources for the monitoring server.


    I will take a look on information regarding the necessary/recommended resources against the number of checks.


    Thank you all for you help and suggestions.