Host state doesn't update when Dependency is applied

This forum was archived to /woltlab and is now in read-only mode.
  • Hello,


    I'd like to ask a question regarding dependency settings, more specifically disable_checks statement. So, my host dependency configuration is like below:

    Code
    1. apply Dependency "host-dependency" for (parent in host.vars.parents) to Host {
    2. parent_host_name = parent
    3. disable_notifications = true
    4. disable_checks = true
    5. ignore_soft_states = false
    6. assign where parent
    7. }


    With the value of true, when the parent host goes down the Icingaweb2 interface shows its state DOWN, which is correct. However, the child host state doesn't change to UNREACHABLE. It keeps showing UP. I waited for a long time and refreshed web page but no change. The only time it changes the status is when I clicked Check now. The same thing happens when the parent host comes up. Parent host state changes to UP but child host doesn't until I click Check now.


    On the other hand, when I change disable_checks to false, the expected behaviour happens. On the parent host down, it shows DOWN as the state of the parent host, at the same time, child host also shows UNREACHABLE. And when the parent host comes up, both change to UP in Icingaweb2.


    So my question is:

    Although disable_checks = true is configured, shouldn't the child host status change to UP when parent is UP? It looks like the check for child never happens with the true setting. If I configure it to true, does the check not happen to child host at all until I manually order? I guess setting the value to false will make the system do unnecessary checks, which will use resources not efficiently. Could someone please advise regarding this issue? Thanks a lot.

  • UPDATE: also there is another problem with interface. This is happening against only one host at the moment and I just can't figure out why. Please see the diagram attached. The check source shows unreachable. It sometimes changes to reachable. The host is UP at the moment. Could someone please explain how I have to interpret this? I am a bit confused with this.

  • Do not get impatient.


    First of all, we need to talk about what a dependency does:

    • It (by default) suppresses notifications
    • It renders a checkable unreachable, that is the small red dot
    • It does *not*: render the check status unknown !!!
    • Depending on disable_checks, it lets the checkable continue to run checks or disables these

    I made some screenshots for you.

    Consider the following scenario:

    Host CCTV depends on ClaraNetDNS being in UP state.

    However, CCTV may or may not continue to run checks even with ClaraNetDNS being down and will display that check result just fine.


    The first screenshot shows exactly that:

    • Workstation21 schedules and executes the check.
    • This endpoint is reachable.
    • disable_checks is true, so no new check results are comming in, we have bad last check / next check times.
    • The check result displayed is outdated.
    • The "is not reachable" is followed by a red dot marking that host CCTV might not be reachable because ClaranetDNS is down.




    Now, same story but disable_checks is false:

    • Workstation21 schedules and executes the check.
      This endpoint is reachable.
    • disable_checks is false, so new check results are comming in, we have good last check / next check times.
    • The check result displayed is current.
    • The "is not reachable" is followed by a red dot marking that host CCTVmight not be reachable because ClaranetDNS is down.



    Finally, lets assume that CCTV returns a critical check result:

    Here we see that the check result is critical, but the overall check status is faked to be not critical but unreachable.

    Again, understand the difference:

    Unreachable is a flag in the host/service object.

    UNKNOWN is a status like UP, DOWN for hosts or CRITICAL, WARNING, OK for services.



    As for your "check does not restart by itself" issue, i fear you did not wait long enough - for me that works.


    You should have learned:

    The flag "reachable" is represented as the red dot.

    The overall status ("top colored box") may be outdated or faked and thus is not relyable.

    The line "CheckingEndpointName is (not) reachable" is difficult to read. It should be interpreted as:

    "Endpoint running the check: CheckingEndpointName "

    "CheckableName is (not) reachable"


    Let us come back to the first line of that post:

    Answering that question took me exactly 2 hours with creating screenshots etc.

    Sometimes it just needs a bit of time to get an answer, even if a weekend is in between.

    The post was edited 4 times, last by sru ().

  • Well, I have been doing a test from this morning for more than three hours and this is my conclusion.


    1. When disable_checks is set to true, child host never returns UNREACHABLE.

    2. When disable_checks is set to true, check doesn't restart - please, don't get me wrong, I have been waiting for more than three hours.

    3. When disable_checks is set to false, status change for child host happens immediately as the system notices the status of parents. This applies to recovery as well.


    One thing I have to ask is this:

    The line "CheckingEndpointName is (not) reachable" is difficult to read. It should be interpreted as:

    "Endpoint running the check: CheckingEndpointName "

    "CheckableName is (not) reachable"


    Check Source state shows reachable when parents are down (capture1.png). On the other hand, child shows not reachable (capture2.png). How do I interpret this situation?

  • How do I interpret this situation?

    capture1:

    MONCORE01.TEST is scheduling and executing a check against host object ROUTER2.INTHEMIDDLE.

    Either there are no dependencies or none of the dependencies on the way to ROUTER2.INTHEMIDDLE triggered,

    thus the flag "reachable" of host ROUTER2.INTHEMIDDLE is true (green dot).

    The check against ROUTER2.INTHEMIDDLE failed for some reason, perhaps it is currently down.


    capture2:

    MONCORE01.TEST is scheduling and executing a check against host object TESTHOST.MYDOMAIN.

    At least one dependency on the way to TESTHOST.MYDOMAIN triggered,

    thus the flag "reachable" of host TESTHOST.MYDOMAIN is false (red dot).

    The check against TESTHOST.MYDOMAIN failed for some reason, perhaps it is currently down.

    because the reachable flag is false and the check result is not "UP", we see an overall "status" of unreachable.


    As for your statement that checks never start to run again if disable_checks=true, sorry. i just rechecked and i can not

    reproduce that. In my environment checks continue to run as soon as the dependency does not trigger any longer (green dot).

    The post was edited 1 time, last by sru ().

  • Great, so it looks like all my questions were answered. Testing more than 3 months, now I guess I can deploy it into production. Thanks again. Cheers.

  • Had the same results here. With disable_checks = true some host stay on UP instead of unreachable. So i switched it to false and all looks good now. Hope that will limit my notification for unknown servicechecksresults because of "Zone 'myzone.zone.zone' is not connected. Log lag: 5 hours, 39 minutes and 2 seconds"

  • With disable_checks = true some host stay on UP instead of unreachable

    State changes happen after an active or passive check provides a result.

    With disable_checks=true no (active ?) checks are running any more, thus no result thus no state change.


    I am unprecise here in that "unreachable" is not a state (like unknown) but a flag.

  • Thanks for claryfy, I'am aware of this, but my problem is, that I'am getting notifications for "unknown" services because of "endpoint is not connected". So I'am getting hundred of mails for servises of hosts that are not reachable. I hope that this unreachable flag prevents that from happening, in my short tests in the last minutes it looks pretty good.

  • I might be wrong, but if that unreachable flag is set in a host or service, i always was in the assumption that a dependency kicked in.