False perfdata in icingaweb2/pnp

This forum was archived to /woltlab and is now in read-only mode.
  • For specific services on specific hosts (there is no obvious pattern) we get false perfdata. The data we get is physically not possible (like a throughput of 4Pbit/s where just 2Gbit/s are connected). After some reasearch I am sure that the perfdata returned by the plugin are correct. For me it seems that icingaweb2 (or even icinga2) processes the perfdata wrong.


    To keep in mind the same check plugin on different hosts works fine.


    Here is an example output of the check plugin:

    I am not really sure which information would be helpful in this case. Please let me know which information you need. I'll be happy to provide the data.

  • We use the following versions:

    icinga2: v2.5.4-2-g86d14ed (Update to 2.6.2 is already planned)

    pnp Plugin: 1.0.1

    icingaweb2: 2.3.4


    When I say I'm sure that the check plugin works fine I mean I've manually recorded the plugin output over several days and it has never failed. If there is a possibility to record the check results in icinga (like a debug log?) we can try it as well.


    From Icingaweb2 I can just show history. I've marked some values


    pnp view from the same time:


  • Hm. I'd do the following ...


    - Enable debuglog and ensure enough disk space is available for /var/log/icinga2

    - Enable debug level in NPCD to gather the process_perfdata.pl calls

    - Correlate those peaks with the logged events


    Such high values might come from a value overflow (e.g. just 32bits but the plugin returns more than just that).


    Which plugin are we talking about? I'd like to see its source code to get a better idea what you're talking about.


    Can you further attach the entire host, service and check command object configuration?

    It would also help to get the executed command line from Icinga 2 itself - https://docs.icinga.com/icinga…g#checks-executed-command

  • We talk about check_nwc_health with mode interface-usage. https://github.com/lausser/che…ent/InterfaceSubsystem.pm


    It uses the snmp high capacity interface counters. This counters have a capacity of 64bit.


    debuglog is already enabled with severity "debug". I will also enable NPCD debug and try to correlate the data.



    This are the configurations (resolved preview from director)



    This is the commandline:

    Code
    1. [2017-03-03 03:18:03 +0100] notice/Process: Running command '/usr/lib64/nagios/plugins/check_nwc_health' '--community' 'xxxxxx' '--hostname' '<ip-address>' '--mode' 'interface-usage' '--report' 'short' '--timeout' '90': PID 17201
    2. [2017-03-03 03:18:03 +0100] debug/CheckerComponent: Check finished for object 'DE-KA-DMZ-Filter01!Interface Usage'

  • Ok I've now tried to correlate the logs:



    This is /var/log/icinga2/debug.log


    This is /var/log/pnp4nagios/perfdata.log


    /var/log/pnp4nagios/npcd.log (nothing special)



    As you can see in debug.log the check returns with exit code 2. Later in perfdata.log it shows us a very high usage of the bond1 interface (far above 100%). This may be the reason for the critical state. This would mean that the check plugin itself miscalculates something.


    Due to the fact that I've called the check on this host manually for several days and had no unexpected values and the fact that this sevice runs on hundreds of different hosts without any issues I'm fairly sure that the plugin is not the root cause. Nevertheless may I haven't yet observed the right moment. I think it would be usefull to log the exact output that icinga2 receives from the plugin. It that possible?

  • It took a long time but now I've evidence that the plugin script or the device itself is messing up the perfdata. Thank you for your fast support.