Multiple instances of escalation failures

This forum was archived to /woltlab and is now in read-only mode.
  • I have three notification users, primary, secondary and tertiary. The latter two are set up with times.begin = 1h and 2h respectively to create an escalation chain.

    Most of the time this works just fine, but there are several distinct cases where this fails completely:


    Acknowledgement notifications seem to be sent to all users, completely disregarding the times dictionary. I've worked around this in my notify script by checking the age of the status, but this really should not be necessary. It should ideally only be sent to whomever have received the issue notifications, in the same way a recovery notification is.


    If an issue has a notification period set to e.g. "workhours" but arises some time before then, it will still use the time since state change time to determine who to send notifications to, rather than the start of the "workhours" period. For example an issue that arose at 6 am, will send a notification to all escalation levels as soon as it enters its notification window that start at 9 am.


    The third is similar to the second but applies when a host or service is removed from or exits scheduled downtime while still in a failed state. Again the time since state change seems to be the basis of the escalation rather than downtime end time. (Luckily I have implemented a basic rate limiting in my notification script when this happened to several hundred services yesterday. There was still a lot of noise in the office space though)

  • Acknowledgement notifications seem to be sent to all users, completely disregarding the times dictionary.

    That is what you want, i guess: somebody ack's a warning /critical State and all persons that have been notified about the non-OK- State must receive that ack.

  • Everyone who has received the initial issue notification, certainly, but I don't want to be woken up in the middle of the night by the primary on-call acking an issue a minute after receiving it. As it curretnly is, the ack is always sent to every contact regardless of their times.begin. I have not checked if it also disregards times.end or period, as we don't use these ourselves.

  • Since there were no response the first time around I've created a simple setup to better demonstrate and replicate these issues:


    Firstly a check that I can trigger to a desired state at will by commenting lines (/usr/local/sbin/check_test):

    Shell-Script
    1. #!/bin/bash
    2. echo "Check is OK"; exit 0
    3. #echo "Check is WARNING"; exit 1
    4. echo "Check is CRITICAL"; exit 2
    5. echo "Check is UNKNOWN"; exit 3

    And a simple notification script that just logs to a file, so I can see the timeline (/usr/local/sbin/test_notification):

    Shell-Script
    1. #!/bin/bash
    2. echo "[$(date +%T)] $1 - $2: $3 ($4)" >> /tmp/messagelog

    Finally the test config


    I then ran the following four tests:


    1: Acking after escalation

    This works more or less as expected, even if the timing both in general and for when secondary receives the notification is a bit off.



    2: Acking before escalation:

    Code
    1. [14:19:16] primary - Test service: CRITICAL (PROBLEM)
    2. [14:19:28] secondary - Test service: CRITICAL (ACKNOWLEDGEMENT)
    3. [14:19:28] primary - Test service: CRITICAL (ACKNOWLEDGEMENT)
    4. [14:19:40] primary - Test service: OK (RECOVERY)

    Notice here that secondary receives the ack, even if he hasn't yet gotten any notification. He also does not get a recovery message, and will be somewhat confused.



    3: Service goes critical more than 5 minutes before the end of a scheduled downtime, which in this case ended 14:30

    Code
    1. [14:30:06] primary - Test service: CRITICAL (PROBLEM)
    2. [14:30:26] secondary - Test service: CRITICAL (PROBLEM)
    3. [14:30:46] secondary - Test service: OK (RECOVERY)
    4. [14:30:46] primary - Test service: OK (RECOVERY)

    Both primary and secondary got notified (more or less) immediately. I'm guessing it counts time since start of the problem, rather than start of whenever it can post notifications. Effectively this completely voids the notion of escalation.



    4: Swapped from "24x7" to the "test" period in the notification template, which were set to ten minutes into the current future and then triggered a critical state:

    Code
    1. [14:45:13] primary - Test service: CRITICAL (PROBLEM)
    2. [14:45:13] secondary - Test service: CRITICAL (PROBLEM)
    3. [14:45:41] secondary - Test service: OK (RECOVERY)
    4. [14:45:41] primary - Test service: OK (RECOVERY)

    Similar result to test 3. Notification is immediately escalated upon entering the notification period, again voiding escalation completely.



    In conclusion I think that, for Icinga2 to be able to successfully mimic the nagios/icinga1 notion of escalation, the notification.start interval needs to be from the earliest point a notification can be sent, not necessarily from when the state changes.

    The post was edited 3 times, last by tfylling ().