Distributed monitoring: remote zone done but no info about remote zone services changed


#1

Hi All

I have distributed monitoring setup with one master zone and one remote zone configured as " Top Down Config Sync". All services on remote zone has “Check Source: nms16”. (nms16 – is a Endpoint of remote zone)

Works as expected basically

Now I shutdown remote zone server and expect to get info about this from main server. Yes, main server reports about remote host become unavailable but all services from client zone still has “OK” status as it last check.

No info about “last service check timestamp too old” in logs, only info about unsuccessful connection to remote zone.

I expect to know about such situation and mark services from remote site as “UNKNOWN” state with a suitable info message.

Is it my misconfiguration or Icinga2-distributed monitoring feature?


(Matthias) #2

I think this is the expected behaviour as services have an implicit dependency on their host: https://icinga.com/docs/icinga2/latest/doc/03-monitoring-basics/#implicit-dependencies-for-services-on-host

Edit: nevermind, they should still execute service checks and therefore change state. I usually disable service checks using an explicit dependency, thus the confusion.


#3

But if I shutdown remote zone endpoint then all hosts monitored by this endpoint must switch to UNKNOWN state IMHO


(Rafael Voss) #4

As the remote zone is executing the checks, the master only receives the results of the check and not checking the endpoint actively. As the result you will see the last check result (which was ok).

That was already discussed from time to time, but there is no solution yet.

What you can do for yourself is editing the host_state in “/etc/icingaweb2/enabledModules/monitoring/library/Monitoring/Backend/Ido/Query/HoststatusQuery.php” and add “WHEN hs.is_reachable = 0 OR hs.next_check < now()-30 THEN 2” to it.

f.e.

 'host_state'                => 'CASE WHEN hs.has_been_checked = 0 OR hs.has_been_checked IS NULL THEN 99 WHEN hs.is_reachable = 0 OR hs.next_check < now()-120 THEN 2 ELSE hs.current_state END',

You also can use 99 instead of 2 to get the “pending” state.

That will change the view (and nothing more) inside Icinga web, so that the host is shown unreachable/pending. As far as know, there is no way to use “unknown” state on a host.

edit: I forgot that we are not living in a perfect world. I added -120 (seconds) to the entry, otherwise you will get a lot of unreachable flappings…

Edit2: Out of curiosity i tried to give the host an unknown state. That was quite easy by editing the host.php. If you are interested, i’ll can send you a patch.


#5

It really very bad idea to get “last OK” status while we know nothing about real host & services states.

If we trying to build new distributed architecture while not to add new logic / states?

For example, main server know about endpoint doesnn’t respond some time so add new host state “Unreachable_Remote_Endpont”

When endpoint arrive this state then all hosts checked by this endpoint may be marked at same status and all services checked by this endpoint marked or as UNKNOWN or as “Unreachable_Remote_Endpont” too

Yes, this logic may be achieved by using Dependency but using it as default (may be configurable via icinga2.conf) looks more suitable than “last OK” status

It’s also looks good idea to automatically check all remote endpointds reachability if destributed monitoring used. (again, yes, now it may be achieved by creating Host for every endpoint manually)

And or course icinga users must be notified about all this situations by standard notifications logic.

Thanks for your responses!


(Rafael Voss) #6

You can use the cluster check for that:
https://icinga.com/docs/icinga2/latest/doc/10-icinga-template-library/#itl-icinga-cluster


#7

cluster and cluster-zone check looks good to check endpoints and zones connection state. Thanks!

But we are still needed automatic dependency to force all endpoint / zone services to UNKNOWN state when endpoint / zone is not connected to main server

I’ll try to do it manually (but we need to do it “by default” IMHO)

Thanks again!