Gearman looks like is not receiving the jobs to process

  • Hi all


    I have an instance of OMD running in my env that i had to creat hurry regargind an issue that we had with MS SCOM


    Thats the version of my omd


    OMD - Open Monitoring Distribution Version 2.40-labs-edition


    The cfg of mod_gearman is ON


    And checking the status of gearman, looks like is not receiving the jobs to process


    2017-09-06 16:35:05 - localhost:4730 - v0.33


    Queue Name | Worker Available | Jobs Waiting | Jobs Running

    ----------------------------------------------------------------------

    check_results | 1 | 0 | 0

    eventhandler | 1000 | 0 | 0

    host | 1000 | 0 | 0

    notification | 1000 | 0 | 0

    service | 1000 | 0 | 0

    worker_azr-weul2001 | 1 | 0 | 0

    ----------------------------------------------------------------------


    How can i validate this?


    I did not found any issue in nagios.log or even after activate all kind of events to write in gearman log

  • My team called me, and checking omd status gearman_worker was stopped

    Here are the evidences of core dumped



    gearman service was stopped



    /var/log/messages



    Sep 11 03:14:46 azr-weul2001 kernel: mod_gearman_wor[4176]: segfault at 0 ip 000000000040e708 sp 00007ffc5e6634a0 error 4 in mod_gearman_worker[400000+1c000]

    Sep 11 03:14:46 azr-weul2001 abrt-hook-ccpp: Process 4176 (mod_gearman_worker) of user 990 killed by SIGSEGV - ignoring (repeated crash)

    Sep 11 03:14:47 azr-weul2001 kernel: mod_gearman_wor[4159]: segfault at 0 ip 000000000040e708 sp 00007ffc5e6634a0 error 4 in mod_gearman_worker[400000+1c000]

    Sep 11 03:14:47 azr-weul2001 kernel: mod_gearman_wor[4205]: segfault at 0 ip 000000000040e708 sp 00007ffc5e6634a0 error 4 in mod_gearman_worker[400000+1c000]

    Sep 11 03:14:47 azr-weul2001 abrt-hook-ccpp: Process 4205 (mod_gearman_worker) of user 990 killed by SIGSEGV - ignoring (repeated crash)

    Sep 11 03:14:47 azr-weul2001 abrt-hook-ccpp: Process 4159 (mod_gearman_worker) of user 990 killed by SIGSEGV - ignoring (repeated crash)

    Sep 11 03:14:56 azr-weul2001 kernel: mod_gearman_wor[4728]: segfault at 0 ip 000000000040e708 sp 00007ffc5e6634e0 error 4 in mod_gearman_worker[400000+1c000]

    Sep 11 03:14:56 azr-weul2001 abrt-hook-ccpp: Process 4728 (mod_gearman_worker) of user 990 killed by SIGSEGV - dumping core

    Sep 11 03:14:56 azr-weul2001 abrt-server: Package 'mod_gearman' isn't signed with proper key



    Sep 17 00:54:43 azr-weul2001 kernel: Pid 39793(mod_gearman_wor) over core_pipe_limit

    Sep 17 00:54:43 azr-weul2001 kernel: Skipping core dump

    Sep 17 00:54:43 azr-weul2001 kernel: Pid 39791(mod_gearman_wor) over core_pipe_limit

    Sep 17 00:54:43 azr-weul2001 kernel: Skipping core dump

    Sep 17 00:54:43 azr-weul2001 kernel: Pid 39857(mod_gearman_wor) over core_pipe_limit

    Sep 17 00:54:43 azr-weul2001 kernel: Skipping core dump

    Sep 17 00:54:43 azr-weul2001 kernel: Pid 39875(mod_gearman_wor) over core_pipe_limit

    Sep 17 00:54:43 azr-weul2001 kernel: Skipping core dump

    Sep 17 00:54:43 azr-weul2001 kernel: Pid 39825(mod_gearman_wor) over core_pipe_limit

    Sep 17 00:54:43 azr-weul2001 kernel: Skipping core dump

    Sep 17 00:54:43 azr-weul2001 kernel: Pid 39797(mod_gearman_wor) over core_pipe_limit

    Sep 17 00:54:43 azr-weul2001 kernel: Skipping core dump

    Sep 17 00:54:43 azr-weul2001 kernel: Pid 39804(mod_gearman_wor) over core_pipe_limit

    Sep 17 00:54:43 azr-weul2001 kernel: Skipping core dump

    Sep 17 00:54:43 azr-weul2001 kernel: Skipping core dump





    OMD[rc_prod]:~/var/log/gearman$ vim worker.log

    [2017-09-17 00:00:07][9133][INFO ] reloading config was successful


    [2017-09-17 00:00:07][9133][INFO ] reloading config was successful

    [2017-09-17 00:02:05][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:04:07][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:06:08][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:08:10][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:10:11][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:12:13][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:14:15][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:16:17][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:18:18][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:20:20][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:22:22][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:24:24][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:26:25][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:28:27][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:30:26][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:32:27][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:34:29][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:36:31][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:38:31][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:40:32][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:42:33][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:44:34][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:46:34][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:48:35][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:50:36][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:52:37][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:54:39][9133][INFO ] no checks in 2minutes, restarting all workers

    [2017-09-17 00:54:53][9133][ERROR] exiting by SIGKILL...

    [2017-09-17 02:18:36][19493][INFO ] mod_gearman worker daemon started with pid 19493




    gearmand.log


    thousands of this errors


    ERROR 2017-08-15 09:28:35.000000 [ 2 ] closing connection due to previous errno error -> libgearman-server/io.cc:109

    ERROR 2017-08-15 09:28:35.000000 [ 2 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100



    ERROR 2017-08-17 00:45:51.000000 [ proc ] Dropped job due to max retry count: H:azr-weul2001:3550318 4ab1bfaa-a3c8-4c56-a954-bc26951eb15a -> libgearman-server/job.c:377