Gearman looks like is not receiving the jobs to process

  • Hi all


    I have an instance of OMD running in my env that i had to creat hurry regargind an issue that we had with MS SCOM


    Thats the version of my omd


    OMD - Open Monitoring Distribution Version 2.40-labs-edition


    The cfg of mod_gearman is ON


    And checking the status of gearman, looks like is not receiving the jobs to process


    2017-09-06 16:35:05 - localhost:4730 - v0.33


    Queue Name | Worker Available | Jobs Waiting | Jobs Running

    ----------------------------------------------------------------------

    check_results | 1 | 0 | 0

    eventhandler | 1000 | 0 | 0

    host | 1000 | 0 | 0

    notification | 1000 | 0 | 0

    service | 1000 | 0 | 0

    worker_azr-weul2001 | 1 | 0 | 0

    ----------------------------------------------------------------------


    How can i validate this?


    I did not found any issue in nagios.log or even after activate all kind of events to write in gearman log

  • My team called me, and checking omd status gearman_worker was stopped

    Here are the evidences of core dumped



    gearman service was stopped



    /var/log/messages



    Sep 11 03:14:46 azr-weul2001 kernel: mod_gearman_wor[4176]: segfault at 0 ip 000000000040e708 sp 00007ffc5e6634a0 error 4 in mod_gearman_worker[400000+1c000]

    Sep 11 03:14:46 azr-weul2001 abrt-hook-ccpp: Process 4176 (mod_gearman_worker) of user 990 killed by SIGSEGV - ignoring (repeated crash)

    Sep 11 03:14:47 azr-weul2001 kernel: mod_gearman_wor[4159]: segfault at 0 ip 000000000040e708 sp 00007ffc5e6634a0 error 4 in mod_gearman_worker[400000+1c000]

    Sep 11 03:14:47 azr-weul2001 kernel: mod_gearman_wor[4205]: segfault at 0 ip 000000000040e708 sp 00007ffc5e6634a0 error 4 in mod_gearman_worker[400000+1c000]

    Sep 11 03:14:47 azr-weul2001 abrt-hook-ccpp: Process 4205 (mod_gearman_worker) of user 990 killed by SIGSEGV - ignoring (repeated crash)

    Sep 11 03:14:47 azr-weul2001 abrt-hook-ccpp: Process 4159 (mod_gearman_worker) of user 990 killed by SIGSEGV - ignoring (repeated crash)

    Sep 11 03:14:56 azr-weul2001 kernel: mod_gearman_wor[4728]: segfault at 0 ip 000000000040e708 sp 00007ffc5e6634e0 error 4 in mod_gearman_worker[400000+1c000]

    Sep 11 03:14:56 azr-weul2001 abrt-hook-ccpp: Process 4728 (mod_gearman_worker) of user 990 killed by SIGSEGV - dumping core

    Sep 11 03:14:56 azr-weul2001 abrt-server: Package 'mod_gearman' isn't signed with proper key



    Sep 17 00:54:43 azr-weul2001 kernel: Pid 39793(mod_gearman_wor) over core_pipe_limit

    Sep 17 00:54:43 azr-weul2001 kernel: Skipping core dump

    Sep 17 00:54:43 azr-weul2001 kernel: Pid 39791(mod_gearman_wor) over core_pipe_limit

    Sep 17 00:54:43 azr-weul2001 kernel: Skipping core dump

    Sep 17 00:54:43 azr-weul2001 kernel: Pid 39857(mod_gearman_wor) over core_pipe_limit

    Sep 17 00:54:43 azr-weul2001 kernel: Skipping core dump

    Sep 17 00:54:43 azr-weul2001 kernel: Pid 39875(mod_gearman_wor) over core_pipe_limit

    Sep 17 00:54:43 azr-weul2001 kernel: Skipping core dump

    Sep 17 00:54:43 azr-weul2001 kernel: Pid 39825(mod_gearman_wor) over core_pipe_limit

    Sep 17 00:54:43 azr-weul2001 kernel: Skipping core dump

    Sep 17 00:54:43 azr-weul2001 kernel: Pid 39797(mod_gearman_wor) over core_pipe_limit

    Sep 17 00:54:43 azr-weul2001 kernel: Skipping core dump

    Sep 17 00:54:43 azr-weul2001 kernel: Pid 39804(mod_gearman_wor) over core_pipe_limit

    Sep 17 00:54:43 azr-weul2001 kernel: Skipping core dump

    Sep 17 00:54:43 azr-weul2001 kernel: Skipping core dump





    OMD[rc_prod]:~/var/log/gearman$ vim worker.log

    [2017-09-17 00:00:07][9133][INFO ] reloading config was successful


    [2017-09-17 00:00:07][9133][INFO ] reloading config was successful

    [2017-09-17 00:02:05][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:04:07][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:06:08][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:08:10][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:10:11][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:12:13][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:14:15][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:16:17][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:18:18][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:20:20][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:22:22][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:24:24][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:26:25][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:28:27][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:30:26][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:32:27][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:34:29][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:36:31][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:38:31][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:40:32][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:42:33][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:44:34][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:46:34][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:48:35][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:50:36][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:52:37][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:54:39][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:54:53][9133][ERROR] exiting by SIGKILL...

    [2017-09-17 02:18:36][19493][INFO ] mod_gearman worker daemon started with pid 19493




    gearmand.log


    thousands of this errors


    ERROR 2017-08-15 09:28:35.000000 [ 2 ] closing connection due to previous errno error -> libgearman-server/io.cc:109

    ERROR 2017-08-15 09:28:35.000000 [ 2 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100



    ERROR 2017-08-17 00:45:51.000000 [ proc ] Dropped job due to max retry count: H:azr-weul2001:3550318 4ab1bfaa-a3c8-4c56-a954-bc26951eb15a -> libgearman-server/job.c:377

  • I upgraded my omd from 2.40 -> 2.60


    Today i had the issue again, but this time gearman did not died


    I restarted nagios process and gearman started to process the jobs.. i did this restart at least 7 times until env. becomes normal again


    I saved the logs and cfg files, but is to large to upload here (according forum warning)

    Can i sent to you directly, then we can discuss the solution here?

  • Issue fixed


    First CRASH - gearmand core


    Stable module installed, issue continues (from epel repository)

    Did the adjusts in /omd/sites/rc_prod/etc/mod-gearman/nagios.cfg


    event_broker_options=-1

    #broker_module=/omd/sites/rc_prod/lib/mod_gearman/mod_gearman_nagios3.o config=/omd/sites/rc_prod/etc/mod-gearman/server.cfg

    broker_module=/usr/lib64/mod_gearman/mod_gearman_nagios3.o config=/omd/sites/rc_prod/etc/mod-gearman/server.cfg


    Activated monitoring over the application

    Turned ON all debug modes (nagios & gearman)



    Pikes 1 & 2 - Application crash again


    Pike 1


    Notthiing appears in debug logs

    Application came back, after restart only nagios process few times

    Some flags turned ON

    use_retained_scheduling_info=1



    Pike 2


    Nothing on debug logs

    Migrated OMD version from 2.40 -> 2.60

    Lots of Defunct process ¬¬


    Pikes 3 & 4 - Cache clean


    Attached graphic called my attention, because the others were looking OK


    We went to (https://assets.nagios.com/down…gioscore/3/en/tuning.html) and item 16 called our attention

    After check


    hdparm -Tt <device >

    AND

    iostat


    We discovered that the CACHE IO over tmpfs was the bottleneck


    To confirm our suspicious, we cleaned the cache and restarted the application

    free -m && sync && echo 3 > /proc/sys/vm/drop_caches && free -m



    After cache clean, env. become stable again

    We are working to migrate the machine, until then we are watching the graphics & snapin server performance



    Thanks to Anderson Silva Brejeiro and Celson Vieira with the troubleshooting