Icinga2 Massive Load Issues monitoring multiple router/switch Interfaces

This forum was archived to /woltlab and is now in read-only mode.
  • Hi,

    First of all I want to apologize for any formatting issues i might be causing, due to my newness to this portal.

    I´m in the middle of setting up a new, more flexibel monitoring environment in my company. Therefore I´m in the need to monitor a variety of routers and switches as well as their trafficounters for different interfaces. As a matter of fact i found the plugin check_nwc_health to be almost exactly what i have been looking for to do so.

    But with a larger environment the plugin seems to eat up a really high amount of ressources on the monitoring host (CentOS 7). I´ve got a distributed setup with hosts connecting to my master and issuing commands on his behalf.

    Everything working really well as long as i am using the plugin without any additional parameters (just routeraddress and snmp community, but needed to adjust timeout to 60 seconds to make nwc_health return in time) . On the router i have excluded ARP-Tables from snmp because check_nwc_health really was blowing the routers cpu while walking trough them.

    Now i would like to measure Interfaces on routers/switches to adjust dashboards in grafana overnight (cron) to automatically cover configuration-changes on the interfaces. I would like to run a script logging in on every router at night gather information based on the interface description and generate a configuration based on the parameters (for each router and switch), as well as to (re-)generate the dashboards for each realization on the setup.

    I have been playing around with the idea to develop a "description-based" naming-schema for my checks, to make searching easier in icingaweb2 and grafana. Since i have to realize different szenarios like hsrp or vlan-based connections on the same router or abroad different routers i wanted to measure each interface individually and especially give the check a name based on the usage of the connection that a human person can handle instead of the pure interface-name that nwc_health gives per default. (Login on routers is limited and documentation would be great to be automized by description of the particular interfaces).

    Up to now, i have been trying the following:

    In the worst case szenario (no service yet existing and everything is generated for the first time) i am trying to realize roundabout 2500 new instances of check_nwc_health. This really breaks the system and results in a Load-Average of > 600 ;( (Host has 4 Cores and 16G Memory)

    As far as i have been reading, there is no option any more to limit the amount of services icinga2 trys to shoot over a time any more. I would like to limit those to like maximum x at a time, until those are finished so that they are not tried to be run all at the same time. It would not be critical if the waiting process gets longer (like 2-3 hours worst case would be acceptable), as long as the load would not exceed a certain limitation and other (already functional) checks would not be affected in the meantime.

    If there is no such funcionality anymore would it be possible to reimplement this? Or is nwc_health the wrong way to go?

    Would it be possible to implement another plugin, thats not using up so many ressources? Or would it be the same with lets say 2500 snmp-checks at the same time?

    I am pretty stuck at this moment. Any help would be greatly appreciated :thumbsup: