Downtime does not get set when Zone is Offline

This forum was archived to /woltlab and is now in read-only mode.
  • Hello!


    I have a problem with downtimes. We have systems which are only online for a couple of hours a day. When they come online, they send a message to an HTTP API. This API deletes downtimes set for this system and schedules a downtime in 10 hours in icinga2 via the livestatus interface. If the system got shutdown correct, they send again a message to the HTTP API which sets the system in a downtime. Also the Scheduled downtime works if the system is online for >10 hours. A problem occurs when the Zone is not reachable any more and the scheduled downtime should start: the Zone will get set into the scheduled downtime, also Services applied to the Zone. But the server and its services will not start the scheduled downtime and their services are displayed CRITICAL (when they where CRITICAL before the Zone lost connection) in icingaweb2 and also on the NagVis Map.


    The downtime is set as follows:


    Shell-Script
    1. COMMAND [Unix timestamp] SCHEDULE_HOST_SVC_DOWNTIME;<hostname>;36000;473040000;1;0;0;LiveStatus;"blah";\n\n
    2. # Description:
    3. #COMMAND [Unix timestamp] SCHEDULE_HOST_SVC_DOWNTIME;<hostname>;<start_time>;<end_time>;<fixed>;<trigger_id>;<duration>;<author>;<comment>;\n\n

    For me it seems like that this behaviour has something to do with the implicit dependency between the Host and the Zone - when the Zone is not reachable no action for the Host will get triggered.


    Is it possible to configure Icinga2 in a way that even though the Zone is DOWN and in a DT that also the scheduled DT for the Hosts and their Services "below" this Zone get triggered?


    Thanks for any help in advance,
    regards
    MoBro

  • As far as I understand your description, you'll have master-satellite cluster setup.


    There's a manual script involved which tries to set a downtime once it detects a zone failure. Possible running in its own cronjob.


    Icinga 2 itself executes the checks in the satellite zone and replays them to the master zone.


    The timing issue here is:


    1) The satellite checks are synced once to the master, but they are still ok
    2) The satellite zone drops the connection
    3) The script enforces a downtime schedule
    4) No more checkresults are replicated to trigger the downtime


    If the checks were CRITICAL in 1) the scheduled downtime should trigger immediately in 3)


    Can you extract some logs for that scenario, and also fetch the downtime's state from either the API or the IDO database backend?


    PS: Versions and OS as always.

  • Hi!


    We have a Master-Satellite setup. OS Version is RHEL 6.7(64 Bit) as master and debian 7(32 Bit)/8(64 Bit) for the satellite hosts. The used icinga version is 2.4.1-1. There is a (PHP) script involved to schedule downtimes. It is not run by a cronjob, the script is accessible via Apache. When a system comes online, it sends a http message to the Master which sets the system out of the current downtime and schedules a downtime in 10 hours. This is done for every host in the satellite system with the COMMAND postet in my first post.


    When the system gets a connection loss in between this 10 hours and does not come back online, the scheduled downtime will get applied to the Endpoint Zone. The problem is that it will not get applied to the Host (of the Zone), services and the hosts/services "below" the endpoint host.


    Today I can not fetch the requested logs and downtime states. I will do so on Monday. I don´t know how to queue either the API neither the IDO, I will give you to output of the object list.


    regards and thanks
    mobro

  • So your issue is that you want to schedule multiple downtimes for an endpoint host and all hosts/services belonging to that endpoint zone. Why don't you schedule multiple downtimes then and let them trigger from the endpoint hosts downtime (trigger_id)?

  • My issue is that systems in Downtime are not shown as in Downtime. That is because the Zone is not reachable and the scheduled DT only gets triggered for the Zone, not for the "Subsystems".


    dnsmichi wrote:

    Why don't you schedule multiple downtimes then and let them trigger from the endpoint hosts downtime (trigger_id)?

    I already schedule the downtime for each host of one System. I also see this scheduled downtimes in the object list (changed the FQDN of the Master/Client):

    • Zone Downtimes, this downtime gets triggered if the Zone goes offline:
    • Server Downtime, this downtime is not triggered when the zone is offline:

    If the Zone is in Downtime, the downtime for the server and its services does not get triggered. I am not sure at the moment if for services/hosts, that are in a critical state before the zone becomes unavailable, notifications get send. But they definitely are wrong displayed in icingaweb2 and in NagVis. I hope I clarified my issue, thanks for your help,


    regards
    MoBro

  • Should I create a Bug report for this issue or does anybody has a solution to trigger the DT even though the Zone is offline?

  • I have a hard time understanding what "zone" means in your context. For me a "zone" references a Zone config object and of course all other checkable objects which are put into that zone context.


    When you're saying that you are putting a zone into a downtime, I have no exact idea what this means. Is this a specific host in your zone, or anything similar?

  • I have two hosts in a Satellite. One is the Host itself, the other is a Zone Host. When I talk about "zone", I mean this Zone Host - not the Zone object. The Zone Host is automatically created by the "icinga2 node update-config" command.
    Host Definition for both Satellite Hosts:

    • Satellite Host Object:
    • Satellite-Zone Host Object:


    I schedule a downtime for both hosts and its services with the command postet in the initial post. The problem is that the downtime only gets triggered for the "Satellite-Zone" Host.


    regards and thanks,
    Mobro

  • The host "Satellite-Zone" is actively checked on the master - no wonder you'll receive check results which will immediately trigger the downtime. The host "Satellite" is supposed to be checked inside the "Satellite-Zone" zone and no check results mean no farther action.


    I thought both hosts are checked inside the satellite zone, but given that the health check works I'm wondering what's missing? You cannot trigger a downtime if the client doesn't sent any active check results on its own. That's by design.

  • dnsmichi wrote:

    I thought both hosts are checked inside the satellite zone, but given that the health check works I'm wondering what's missing? You cannot trigger a downtime if the client doesn't sent any active check results on its own. That's by design.

    Missing is that the scheduled downtime is not triggered when the zone is offline. I understand now that this behaviour is as intended, even though I would rethink this design decision :)


    Thanks for investigating this problem with patience,


    regards,
    Mobro