Nagios 4.3.1 Crashes when recieving data through broker_module

This forum was archived to /woltlab and is now in read-only mode.
  • I have upgraded Nagios from 4.2.4 to 4.3.1 (luckily only on my development box) and now it crashes with a SIGSEGV / SIGTERM repeatedly (about once a minute).

    For me it looks like a problem when a broker_module sends data "back" to nagios.

    I base this on the following facts.

    1) If I disable both (mod_gearman & mk-livestatus) in nagios.cfg, everything works OK.

    2) If I enable mk-livestatus, but I do not feed data back to nagios (thruk - business processes) through mk-livestatus, it still works OK

    3) If I enable mod_gearman or I let thruk feed data back to nagios through mk-livestatus it starts crashing.


    Sadly, the only thing I can see in the nagios-log are:

    Caught SIGSEGV, shutting down...

    Caught SIGTERM, shutting down...


    In the debug-log I do not see anything strange.

    Here are my SW releases:

    OS: RHEL 7.3

    Nagios 4.3.1 (build from source)

    mod_gearman 3.0.1-1 (labs.consol.de)

    gearmand 0.33-5 (labs.consol.de)

    mk-livestatus 1.2.8p18 (build from source)


    Anybody out there using Nagios 4.3.1 with either mod_gearman and/or mk-livestatus?

    Other suggestions?


    Regards



  • It might be worth comparing the included header files in mod_gearman and Nagios itself. Maybe Nagios broke the binary compatibility between NEB modules by changing the structs defined in those header files.


    https://github.com/sni/mod_gea…ee/master/include/nagios4 vs https://github.com/NagiosEnter…score/tree/master/include

  • Doing a quick (yes really quick) check reveals that som header files has changed.

    Here are the changes that I found doing a diff (#ifndef/#define skipped):


    I am no programmer (any more - that was a long time ago), but some of those diff's does not look so good.

  • You could try running nagios in foreground with gdb and then generate a full backtrace. I've never installed Nagios4, but I know some coding and debugging foo.


    If you're looking for gdb instructions, you may borrow some from icinga2 and adjust the paths / runtime arguments (ignore the pretty printer section).

    https://docs.icinga.com/icinga…ent#development-debug-gdb



  • If you run


    Code
    1. (gdb) p this_customvariablesmember

    this probably returns 0x0 right?


    #4 does the neb callback, and jumps into mod_gearman's code in #3 with handle_svc_check().


    This calls clear_volatile_macros_r() and clear_contact_macros_r() back inside Nagios code.


    https://github.com/NagiosEnter…ter/common/macros.c#L2843


    That code is really old, so I would guess that memory is somehow corrupted with the linked list passed via start pointer in


    Code
    1.  customvariablesmember **vars


    Also, the code lacks any checks for null pointers which directly run into a SIGSEGV.


    I'd navigate inside gdb and use "up" to step up until frame #3 is reached and then print local variables in this scope to debug further.


    https://github.com/sni/mod_gea…module/mod_gearman.c#L849


    This code snippet calls clear_volatile_macros_r(&mac). Maybe mod_gearman sets its own custom vars, or the structs have changed in this region.


    This is far beyond my knowledge, but I would collect that information and open an issue over at mod_gearman's issue tracker on GitHub.

  • Code
    1. Program received signal SIGSEGV, Segmentation fault.
    2. clear_custom_vars (vars=vars@entry=0x7ffffffed940) at ../common/macros.c:2851
    3. 2851 my_free(this_customvariablesmember->variable_name);
    4. Missing separate debuginfos, use: debuginfo-install boost-system-1.53.0-26.el7.x86_64 gearmand-0.33-5.x86_64 glibc-2.17-157.el7_3.1.x86_64 libgcc-4.8.5-11.el7.x86_64 libstdc++-4.8.5-11.el7.x86_64 libuuid-2.23.2-33.el7.x86_64 sssd-client-1.14.0-43.el7_3.11.x86_64
    5. (gdb) p this_customvariablesmember
    6. $1 = (customvariablesmember *) 0x2d33302d37313032

    So, no it does not return 0x0. :(


    OK, I'll head over to GitHub and open a issue.


    Thanks again for your support.


    D/\N

  • F.Y.I.

    Compiling mod_gearman with the Nagios-4.3.2 headers (replacing all (except epn_utils.h) headers in include/ and include/lib/ with the ones from the Nagios sources) seems to fix the issue for me. I will let it run on my test rig for a few days, than I will update my production rig.


    D/\N