Disabling notifications on state changes to UNKNOWN and back

We use a software load balancer called HAProxy, which healthchecks the backend services, and exposes this in a status page. I have a custom check that queries this status page for the status of a load balanced service. This is either UP (i.e. OK), DOWN (i.e. CRITICIAL) or “UNKNOWN”.

“UNKNOWN” occurs when HAProxy is reloading and has yet to determine the state of the backend service.

The problem is that when a service is in CRITICAL state, it can transition to UNKNOWN and then back to CRITICIAL. This sends a new notification every time.

How can I disable this? I want Icinga to ignore any transitions to and from UNKNOWN unless the OK/CRITICIAL state has actually changed.

So basically:
CRITICIAL -> UNKNOWN -> CRITICAL = do nothing
OK -> UNKNOWN -> OK = do nothing
OK -> UNKNOWN -> CRITICAL = notify
CRITICAL -> UNKNOWN -> OK = notify

thanks!

Remove the Unknown entry from your states filter.

Unfortunately that didn’t work. It still sent me a notification on the transition from UNKNOWN -> CRITICIAL

Ah, the transition … no that cannot be filtered away. For such a thing you would need some external logic to filter such state changes away.

Hello,

We are using Icinga (version: r2.9.2-1).

Last week we too faced a similar issue where the State of particular service was transitioned between UNKNOWN <-> CRITICAL during the night and received thousands of emails (luckily no phone calls). (Re-Notification interval was configured for 7d)

I am thinking about writing a Wrapper Plugin for check_by_ssh which always returns CRITICAL if the exit status is > 3 to handle this. Not sure if @dnsmichi also meant about writing such an extra Wrapper (“external logic”)

@markvr, May I know If you have already found a work around for this State Transition Problem ?

Just to mention the issue that resulted in CRITICAL / UNKNOWN transition-states: (Will open a separate thread to discuss this)

For me Strangely, check_by_ssh returns sometime CRITICAL and sometimes UNKNOWN after a defined time out of 30 secs. This triggered several emails.

Note: The example shown here belongs to same service. (1) returns OK->UNKNOWN and (2) UNKNOWN->CRITICAL

(1). check_by_ssh returns UNKNOWN state after timeout of 30 secs (after being killed by a signal)

[2019-02-25 14:14:20 +0100] warning/Process: Killing process group 96501 ('/usr/lib64/nagios/plugins/check_by_ssh' '-C' 'sudo /tools/icinga/check_service_status.sh "tomcat7.HRS"' '-H' 'search.internal.de' '-i' '~/.ssh/id_rsa
_icinga_deployer' '-l' 'icinga_deployer' '-o' 'StrictHostKeyChecking=no' '-q' '-t' '30') after timeout of 30 seconds
[2019-02-25 14:14:20 +0100] warning/Process: PID 96501 was terminated by signal 9 (Killed)
[2019-02-25 14:14:20 +0100] warning/PluginCheckTask: Check command for object 'search.internal.de!check-linux-service-tomcat7.HRS' (PID: 96501, arguments: '/usr/lib64/nagios/plugins/check_by_ssh' '-C' 'sudo /tools/icinga/che
ck_service_status.sh "tomcat7.HRS"' '-H' 'search.internal.de' '-i' '~/.ssh/id_rsa_icinga_deployer' '-l' 'icinga_deployer' '-o' 'StrictHostKeyChecking=no' '-q' '-t' '30') terminated with exit code 128, output: <Timeout exceeded.><Termi
nated by signal 9 (Killed).>
[2019-02-25 14:14:20 +0100] debug/Checkable: Update checkable 'search.internal.de!check-linux-service-tomcat7.HRS' with check interval '300' from last check time at 2019-02-25 14:14:20 +0100 (1.5511e+09) to next check time at 2019-02-
25 14:19:10 +0100(1.5511e+09).
[2019-02-25 14:14:20 +0100] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2019-02-25 14:14:20 +0100] debug/DbEvents: add log entry history for 'search.internal.de!check-linux-service-tomcat7.HRS'
[2019-02-25 14:14:20 +0100] debug/DbEvents: add checkable check history for 'search.internal.de!check-linux-service-tomcat7.HRS'
[2019-02-25 14:14:20 +0100] notice/ApiListener: Relaying 'event::CheckResult' message
[2019-02-25 14:14:20 +0100] debug/DbEvents: add state change history for 'search.internal.de!check-linux-service-tomcat7.HRS'
[2019-02-25 14:14:20 +0100] notice/Checkable: State Change: Checkable 'search.internal.de!check-linux-service-tomcat7.HRS' soft state change from OK to UNKNOWN detected.

(2). check_by_ssh returns UNKNOWN->CRITICAL state after timeout of 30 secs

[2019-02-25 14:19:40 +0100] warning/Process: Killing process group 101159 ('/usr/lib64/nagios/plugins/check_by_ssh' '-C' 'sudo /tools/icinga/check_service_status.sh "tomcat7.HRS"' '-H' 'search.internal.de' '-i' '~/.ssh/id_rs
a_icinga_deployer' '-l' 'icinga_deployer' '-o' 'StrictHostKeyChecking=no' '-q' '-t' '30') after timeout of 30 seconds
[2019-02-25 14:19:40 +0100] notice/Process: PID 101159 ('/usr/lib64/nagios/plugins/check_by_ssh' '-C' 'sudo /tools/icinga/check_service_status.sh "tomcat7.HRS"' '-H' 'search.internal.de' '-i' '~/.ssh/id_rsa_icinga_deployer'
'-l' 'icinga_deployer' '-o' 'StrictHostKeyChecking=no' '-q' '-t' '30') terminated with exit code 2
[2019-02-25 14:19:40 +0100] debug/Checkable: Update checkable 'search.internal.de!check-linux-service-tomcat7.HRS' with check interval '300' from last check time at 2019-02-25 14:19:40 +0100 (1.5511e+09) to next check time at 2019-02-
25 14:24:30 +0100(1.5511e+09).
[2019-02-25 14:19:40 +0100] debug/DbEvents: add log entry history for 'search.internal.de!check-linux-service-tomcat7.HRS'
[2019-02-25 14:19:40 +0100] debug/DbEvents: add checkable check history for 'search.internal.de!check-linux-service-tomcat7.HRS'
[2019-02-25 14:19:40 +0100] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2019-02-25 14:19:40 +0100] notice/ApiListener: Relaying 'event::CheckResult' message
[2019-02-25 14:19:40 +0100] notice/JsonRpcConnection: Received 'event::Heartbeat' message from 'SI0BOS404.de.bosch.com'
[2019-02-25 14:19:40 +0100] debug/DbEvents: add state change history for 'search.internal.de!check-linux-service-tomcat7.HRS'
[2019-02-25 14:19:40 +0100] notice/Checkable: State Change: Checkable 'search.internal.de!check-linux-service-tomcat7.HRS' soft state change from UNKNOWN to CRITICAL detected.

Hi @vish-c,

I would guess that the plugin hangs in different states like case 1) timeout while establishing the connection 2) timeout during login or execution of the actual check. But this is just a guess and not qualified. Would need a bit more debugging to come to a qualified statement here. The access log of the monitored server could be interesting.

Such a wrapper exists already. It’s called check_negate. It’s a wrapper for your check and can be used to modify the actual check result before - in your case: return CRITICAL if the check_by_ssh plugin returns an UNKNOWN error.

Cheers,
Marcel

Thanks for your tip. I will check this.

Regards,
Vish