mod_gearman orphaned host_check lost?

  • Hi,


    i had an orphaned host-check for 3 days.

    Now i'm looking for some informations how this might happen.

    os: debian jessie
    Naemon-core: 1.0.3

    mod-gearman: 2.1.5

    gearman-job-server: 0.33-4


    naemon.cfg:

    check_for_orphaned_services=1

    check_for_orphaned_hosts=1


    gearman-job-server (module.cfg):

    orphan_host_checks=yes

    orphan_service_checks=yes

    use_uniq_jobs=on


    worker.conf:

    max-age=0

    (no settings for uniq)


    Situation:

    The host_check was initially OK and became orphand regarding to naemon.log.

    The host was up the hole time. We have separate ping check and all other services were fine as well.

    Code
    1. [1502494202] Warning: The check of host 'testhost' looks like it was orphaned (results never came back). I'm scheduling an immediate check of the host..
    2. [1502496302] CURRENT HOST STATE: testhost;DOWN;SOFT;2;(host check orphaned, is the mod-gearman worker on queue 'testqueue' running?)

    At this time 2 different worker are working on the desired queue. Both are connected to gearman-job-server and doing their job.

    As you can see in the 2nd logentry. After logrotation, the host was assumed to be DOWN in softstate 2/3.

    After this the host stayed in the state for some days. It was never being recognized as orphaned again and no results were submitted.


    Problem:

    As i understand orphaned services, naemon should do this always if a service is removed from scheduling queue without result.

    Regarding logs... no result was submitted for days. I only see the Current HOST STATE log for days. And always soft state 2/3.

    Restarting the gearman_worker solved the problem. (college did that... so i am missing some informations like gearman_top at this time).

    But queues and worker are monitored. The amount of worker per queue is also monitored (to avoid having zombies running).

    So i am pretty sure everything was fine connected that time.


    Question:

    Well... how could this happen? ;)
    All scenarios i can imagine will not explain this.

    My only possible explanation would be:

    t0: gearman-job-server send orphaned result to naemon but somehow left the job in its own queue (maybe cause it it in jobs_running?)

    t0: naemon got orphaned service from gearman-job-server and reschedule the check.

    t1: gearman-job-server removed the job from naemon queue and discard the job cause use_uniq_jobs is active.

    t2: naemon forgot about it. gearman-job-server already send its orphaned result...

    t3: nobody cares.


    Would be very nice if somebody can help or explain the behavior ;)

  • Hi sni, sorry for the late response...


    yes. there is only one naemon-core and one gearman-job-server running (master-server).
    The queue i am talking about is handled by 4 gearman_worker_daemons running on 4 seperate machines(worker).


    well... this setup is duplicated.

    So the 4 worker are in the end connected to another master_server with 4 additional gearman_worker_daemons.

    But i think this does not matter at all, cause the gearman_worker_daemons have all their own process running.


    The fact, that the queue is handles by more then one worker might lead to the problem... but still i can not imagine how this happened.


    Let me know if you have an idea or need more informations ;) any help would be very nice ;)

  • Hi sni, thx for your answer.

    just to make this clear. The 2 naemon cores run on 2 different physical machines. Each naemon core has its own gearman-job-server.

    So they do not really know of each other.

    On worker side there are (to make it simple) 2 mod-gearman-worker daemons running on the same virtual machine.

    Each of the worker daemons is only connected to one of the gearman-servers.


    As the worker for each gearman-server has different config and process, i don't understand what could be the problem here.
    Can you maybe explain it a bit deeper?

  • It only appears on one masterserver. The other has no orphaned services.
    As i am still using naemon 1.0.3, maybe its already fixed... i saw some orphand related stuff in changelog of 1.0.6

    Its just very strange that nothing happens from naemon-core over days:

    Full log from beginning until restart of worker:

    Code
    1. naemon.log-20170812:[1502494202] Warning: The check of host 'testhost' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the host...
    2. naemon.log-20170813:[1502496302] CURRENT HOST STATE: testhost;DOWN;SOFT;2;(host check orphaned, is the mod-gearman worker on queue 'hostgroup_test.queue' running?)
    3. naemon.log-20170814:[1502582702] CURRENT HOST STATE: testhost;DOWN;SOFT;2;(host check orphaned, is the mod-gearman worker on queue 'hostgroup_test.queue' running?)
    4. naemon.log-20170815:[1502668803] CURRENT HOST STATE: testhost;DOWN;SOFT;2;(host check orphaned, is the mod-gearman worker on queue 'hostgroup_test.queue' running?)
    5. naemon.log-20170816:[1502755500] CURRENT HOST STATE: testhost;DOWN;SOFT;2;(host check orphaned, is the mod-gearman worker on queue 'hostgroup_test.queue' running?)
    6. naemon.log-20170817:[1502841900] CURRENT HOST STATE: testhost;DOWN;SOFT;2;(host check orphaned, is the mod-gearman worker on queue 'hostgroup_test.queue' running?)
    7. naemon.log-20170818:[1502928303] CURRENT HOST STATE: testhost;DOWN;SOFT;2;(host check orphaned, is the mod-gearman worker on queue 'hostgroup_test.queue' running?)


    well.... it was the first occurrence and i do not have that much time to further investigate. So i'll upgrade naemon soon and see if it happens again.
     
    Thx for your help sni ;)

  • there are 2 worker-processes assinged to "hostgroup_test.queue". Waiting and worker amount is seperatly monitored for eeach queue and was ok during this time.
    We do ~1000 checks/sec and only one was orphaned. So its very hard to debug.
    And we nothing was loged by gearman-job-server or any worker (debug=0).

    Just to get clear... if the gearman-neb module puts a check to gearman-job-server... will the check immediately be removed from the naemon scheduling queue?
    Does mod-gearman report naemon-core about uniq-jobs?


    Btw: i just checked what happens without any worker on a queue.
    If there is no worker and unique jobs are enabled. Naemon-core recognized the orphaned services shortly after check_intervall and rescheduled them.
    So something very weird must have happend once at my side.

  • I guess there is nothing in the logs because it seems like this one host job has gone missing somehow.


    And yes, mod-gearman cancels the usually processing of checks and immediatly returns to the naemon core with a return code indicating that the check is now beeing worked on.


    What exactly do you mean by "uniq-jobs". There is a uniq-job feature in Gearman. And its being set to the "hostname" and "service description" of a check. So if you add multiple requests for the same host or service, there will still be only one job. I've seen rare cases

    where gearman hangs on on job which never finishes. Force-Rescheduling doesn't help then, because there is a job already with

    that uniq-id.

    What helped here, was to stop naemon, stop gearmand and remove the gearman retention file.

  • thx for the explanation.
    indeed i mean uniq jobs of gearmand. I have it enabled to avoid growing queues workers are not reachable.

    Well... it happens once now and i can not reproduce it. So very much thanks for your help and informations.
    I'll keep an eye on it and try to get more infos if it happens again.

    btw... your thruk changelog on thruk.org is not up-to-date.

    and thanks for all the cool stuff you invented ;)