gearman is killing nagios

  • Guys, help


    Looks likes like mod_gearman is killing nagios process


    Server usage


    procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----

    r b swpd free buff cache si so bi bo in cs us sy id wa st

    1 0 0 51323468 1016 4935160 0 0 0 0 89 387 2 1 97 0 0

    0 0 0 51316132 1016 4935656 0 0 0 0 54 249 1 0 99 0 0

    0 0 0 51312808 1016 4935672 0 0 0 0 605 1458 2 3 95 0 0

    0 0 0 51314396 1016 4935656 0 0 0 60 105 386 1 0 99 0 0

    0 0 0 51314396 1016 4935656 0 0 0 0 58 230 1 0 99 0 0

    0 0 0 51314164 1016 4935676 0 0 0 0 45 201 0 0 100 0 0

    0 0 0 51314220 1016 4935676 0 0 0 0 47 184 0 0 100 0 0

    0 0 0 51312760 1016 4935700 0 0 0 48 668 1680 4 3 93 0 0

    0 0 0 51317140 1016 4935720 0 0 0 53 51 199 0 0 100 0 0







    OMD STATUS until problem occurs


    OMD[rc_prod]:~$ omd status

    gearmand: running

    rrdcached: running

    gearman_worker: running

    npcd: running

    nagios: running

    apache: running

    xinetd: running

    crontab: running

    -----------------------

    Overall state: running

    OMD[rc_prod]:~$ while true ; date ; do omd status | grep stopped ; sleep 5 ; done

    Mon Jul 24 16:20:04 UTC 2017

    Mon Jul 24 16:20:09 UTC 2017

    Mon Jul 24 16:20:14 UTC 2017

    Mon Jul 24 16:20:20 UTC 2017

    Mon Jul 24 16:20:25 UTC 2017

    Mon Jul 24 16:20:31 UTC 2017

    Mon Jul 24 16:20:36 UTC 2017

    Mon Jul 24 16:20:42 UTC 2017

    Mon Jul 24 16:20:47 UTC 2017

    Mon Jul 24 16:20:52 UTC 2017

    Mon Jul 24 16:20:58 UTC 2017

    Mon Jul 24 16:21:03 UTC 2017

    Mon Jul 24 16:21:09 UTC 2017

    Mon Jul 24 16:21:14 UTC 2017

    Mon Jul 24 16:21:19 UTC 2017

    Mon Jul 24 16:21:25 UTC 2017

    Mon Jul 24 16:21:30 UTC 2017

    Mon Jul 24 16:21:36 UTC 2017

    Mon Jul 24 16:21:41 UTC 2017

    Mon Jul 24 16:21:46 UTC 2017

    Mon Jul 24 16:21:52 UTC 2017

    Mon Jul 24 16:21:57 UTC 2017

    Mon Jul 24 16:22:03 UTC 2017

    Mon Jul 24 16:22:08 UTC 2017

    Mon Jul 24 16:22:14 UTC 2017

    Mon Jul 24 16:22:19 UTC 2017

    Mon Jul 24 16:22:24 UTC 2017

    Mon Jul 24 16:22:30 UTC 2017

    Mon Jul 24 16:22:35 UTC 2017

    Mon Jul 24 16:22:41 UTC 2017

    Mon Jul 24 16:22:46 UTC 2017

    Mon Jul 24 16:22:51 UTC 2017

    Mon Jul 24 16:22:57 UTC 2017

    nagios: stopped

    Mon Jul 24 16:23:02 UTC 2017

    nagios: stopped



    Here is possible to see the external cmd sent by gearman


    munmap(0x7fb5158ca000, 4096) = 0

    gettimeofday({1500913375, 62329}, NULL) = 0

    gettimeofday({1500913375, 62418}, NULL) = 0

    gettimeofday({1500913375, 62496}, NULL) = 0

    gettimeofday({1500913375, 62558}, NULL) = 0

    gettimeofday({1500913375, 62598}, NULL) = 0

    stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0

    stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0

    gettimeofday({1500913375, 62731}, NULL) = 0

    stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0

    gettimeofday({1500913375, 62881}, NULL) = 0

    gettimeofday({1500913375, 62914}, NULL) = 0

    gettimeofday({1500913375, 62946}, NULL) = 0

    gettimeofday({1500913375, 62978}, NULL) = 0

    gettimeofday({1500913375, 63009}, NULL) = 0

    writev(2, [{"/omd/sites/rc_prod/bin/nagios", 29}, {": ", 2}, {"symbol lookup error", 19}, {": ", 2}, {"/omd/sites/rc_prod/lib/mod_gearm"..., 56}, {": ", 2}, {"undefined symbol: notification_r"..., 42}, {"", 0}, {"", 0}, {"\n", 1}], 10) = 153

    exit_group(127)



    I did not found any issue like this here in the forum


    Machine OS


    [root@azr-weul2001 ~]# cat /etc/redhat-release

    Red Hat Enterprise Linux Server release 7.3 (Maipo)

    [root@azr-weul2001 ~]# uname -a

    Linux azr-weul2001 3.10.0-514.26.1.el7.x86_64 #1 SMP Tue Jun 20 01:16:02 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux

  • OMD[rc_prod]:~$ omd version

    OMD - Open Monitoring Distribution Version 2.40-labs-edition

  • I unseted


    notifications=no


    in


    /omd/sites/rc_prod/etc/mod-gearman/server.cfg


    and now looks like the problem stop, but i'm interested to know why this happens

  • I'm with you.


    We were trying to set downtimes on a host, with a command like this:


    curl -k -u auto_user:auto_pass "https://1.2.3.4/sitename/check_mk/view.py?_username=auto_user&_do_actions=yes&_down_coment=setting+downtime&_down_to_date=2017-07-26&output_format=json&_do_confirm=yes&_down_from_date=2017-07-26&_down_to_time=06%3A23&_down_custom=Custom%2Btime_range&_secret=auto_pass&view_name=hoststatus&host=hostname&_transid=-1&_down_from_time=04%3A23"


    Every time we would execute the command, "omd status" would tell us that nagios was stopped. The final item in the nagios logfile just before the crash was this:


    [1501043730] EXTERNAL COMMAND: SCHEDULE_HOST_DOWNTIME;hostname;1501042980;1501050180;1;0;0;auto_user;setting downtime


    and there were no errors in the nagios logs.


    Here is what we got from the output of "strace -p " while we were crashing Nagios.


    writev(2, [{"/omd/sites/sitename/bin/nagios", 26}, {": ", 2}, {"symbol lookup error", 19}, {": ", 2}, {"/omd/sites/sitename/lib/mod_gearman/"..., 53}, {": ", 2}, {"undefined symbol: notification_r"..., 42}, {"", 0}, {"", 0}, {"\n", 1}], 10) = 147


    exit_group(127) = ?


    +++ exited with 127 +++



    Paring that error message down enough, and we found this thread.


    From the note earlier in this thread:


    # defines if the module should distribute execution of
    # notifications.
    notifications=no


    And now our Nagios engine doesn't crash when we put hosts in downtime with the above URL.


    Thanks,

    Rob







  • No, in fact we are using the stable version


    biffhero, i did not using wet the notification capability


    If you are using this capability, can you answer to me if something was affected?

  • Thanks for you clarification sni


    Do you know, when this fix will be available for stable version?

  • zanoefel - thanks for posting that fix, I'd discovered it recently but put in place a cron job to restart nagios until I could get around to doing some research on the problem and luckily I came upon your post, it saved me a lot of effort. :)


    I tried to install a copy of the nightly build, as mentioned earlier, to see if this fix had been applied but unfortunately that now seems to have a requirement that mariadb is installed. I run this version of OMD on a Centos& server that also has a PerconaDB server on it and obviously the install of mariadb fails. Is the need for mariadb going to be a requirement in future? If it is could the install check for and already install version (or variant) of mysql and use that?

    The post was edited 1 time, last by fenice2 ().

  • Thanks for your reply.


    That's strange, I can't remember the sequence of what I did with that server but I don't remember seeing mariadb get installed and subsequently I've installed Percona Server on there. I might well have replaced mariadb when I installed percona as this is a cluster test. What is the database used for in OMD?


    A couple of questions for the future, should it tie itself to mariadb or could it check for another variant of mysql having been installed and use that instead? Perhaps it might be useful if we had the ability to use a cluster instead of a single mysql instance?