Change check source for hostalive check for endpoints


(Axel577) #1

Hey all,
I’m running a icinga2 2.10.2 master and a lot of endpoints that have the icinga agent installed.
Currently, all endpoints are checked with the pingv4 service. So when an endpoint (windows or linux server) is down, the ping service is critical.
Furthermore, every endpoint does have the hostalive check. It looks like that:


The check source of that check is the endpoint itself. So when the endpoint is down, the hostalive check doesn’t react and the endpoint is still shown as “UP”.
I’m wondering if there is any possibilty to change the check source of that hostalive check? The check source is supposed to be the icinga master, not the endpoint.
The result would be, that the hostalive check reacts and the endpoint is shown as “Down”. Futhermore, it would be shown as a host problem on the dashboard, not as a service problem anymore. It’s confusing when a complete endpoint is down but it’s not shown as Host down.


#2

Hi,

please add your configuration (host object, service object/apply rule, zone/endpoint configuration).

If you are using the Top Down Command Endpoint method to execute checks on agents you can leave out the command_endpoint for this service thereby the checks will be executed on your master.

If you are using the Top Down Config Sync method move the service in your master zone.

Best regards
Michael


(Axel577) #3

Hi, that’s my configuration:

/etc/icinga2/zones.conf:

object Endpoint “syslog”{
host=“192.168.1.1”
}
object Zone “syslog” {
endpoints= [ “syslog” ]
parent= “icinga2”
}

/etc/icinga2/zones.d/syslog/hosts.conf:

object Host “syslog” {
import “generic-host”
address = “192.168.1.1”
vars.os = “Windows”
vars.ping = “yes”
}

/etc/icinga2/zones.d/global-templates/services.conf:

apply Service “CPU” {
import “generic-service”
check_command = “load-windows”
vars.load_win_warn = “90”
vars.load_win_crit = “95”
assign where host.vars.os == “Windows”
}
apply Service “RAM” {
import “generic-service”
check_command = “memory-windows”
vars.memory_win_warn = “10%”
vars.memory_win_crit = “5%”
assign where host.vars.os == “Windows”
}
apply Service “ping4” {
import “generic-service”
check_command = “ping4”
assign where host.address && host.vars.ping == “yes”
zone=“icinga2”
}

the hostalive check is definied in the templates.conf:
/etc/icinga2/conf.d/templates.conf:

template Host “generic-host” {
max_check_attempts = 5
check_interval = 1m
retry_interval = 30s
check_command = “hostalive”
}

hostalive check, CPU check and RAM check are executed on the endpoint. Check source is “syslog”.
Only the ping service is executed from the icinga server.
I defined

zone=“icinga2”

Maybe I can do the same with the hostalive check?


#4

Hi,

in a distributed environment (e.g. master - agents) it is recommended to disable the conf.d directory on the master and on the agents.

vim /etc/icinga2/icinga2.conf

// include_recursive "conf.d"

Now you have to consider which configuration mode you want to use, there are two:

You should read about the advantages and disadvantages.

Short summary:

  • If you use the Command Endpoint method the scheduling for the checks is done one the master, the actual execution is done on the agent.

  • If you use the Config Sync method the configuration objects for the agent zone are synced to the agent. The scheduling and execution is done one the agent.

These following Service apply rules are good examples for the Config Sync method, since they are placed in the global-templates zone, which is a global zone by default, they are synced to all agents. Thereby all agents (where the apply rule matches) have this service. The actual scheduling and execution is done on the respective agent.

cat /etc/icinga2/zones.d/global-templates/services.conf

apply Service "CPU" {
  import "generic-service"
  check_command = "load-windows"
  vars.load_win_warn = "90"
  vars.load_win_crit = "95"
  assign where host.vars.os == "Windows"
}

apply Service "RAM" {
  import "generic-service"
  check_command = "memory-windows"
  vars.memory_win_warn = "10%"
  vars.memory_win_crit = "5%"
  assign where host.vars.os == "Windows"
}

The zone variable enforces to execute the check in the icinga2 zone, which seems to be your master. This also works in the host object.

apply Service "ping4" {
  import "generic-service"
  check_command = "ping4"
  assign where host.address && host.vars.ping == "yes"
  zone = "icinga2"
}

Conclusion:

Disable the conf.d directory to get a one single point of trust for your configuration. Leaving the conf.d directory enabled on distributed environments leads fast to get weird and hard to debug behavior. For example you define a new hostalive service in the global-templates with competing settings to the local hostalive service.

Since it seems that you already have begun with using the Top Down Config Sync method I wouldn’t change this. So to get the hostalive check for your host object executed by the master you have to add the zone variable to enforce it is executed in the given zone.

object Host "syslog" {
  import "generic-host"
  address = "192.168.1.1"
  vars.os = "Windows"
  vars.ping = "yes"
  zone = "icinga2"
}

Best regards
Michael


Execute a command from another satellite then the client is located on
(Axel577) #5

Hi Michael,
thank you for your detailed answer!
I added

zone = “icinga2”

to my hosts.conf, as you said.

Now, the hostalive check does use icinga2 as source.

But: All other checks can’t be performed anymore, because icinga2 is the source, too:

the master tries to check itself now.

I tested the same for a linux endpoint. I use the load check there. But the load values are values for the master, not for the endpoint.


#6

This is because the service objects inherit the zone variable from the host object.

To pin the checks to a specific endpoint you can use the command_endpoint variable.

apply Service "RAM" {
  import "generic-service"
  check_command = "memory-windows"

  command_endpoint = host.name

  vars.memory_win_warn = "10%"
  vars.memory_win_crit = "5%"
  assign where host.vars.os == "Windows"
}

This follows the convention that the host object name is equal to the endpoint name.


(Axel577) #7

Could you tell me the exact line?
Since it’s an apply rule that is applied to a lot of windows endpoints, I would need a variable.


#8

What exactly do you mean?

The added line for the apply rule is:

command_endpoint = host.name

This requires that the host object name is equal to your endpoint object name, e.g.

object Host "syslog" {
  [...]
}

object Endpoint "syslog" {
  [...]
}

(Axel577) #9

Hi Michael,
sorry, you are totally right. I think I was confused yesterday.
So I added

command_endpoint = host.name

to all apply rules.
Now the hostalive check is carried out by the master, and all local checks are carried out by the endpoint. When the endpoint is down, it is shown as a host problem, since the hostalive checks reacts.
That’s what I wanted to accomplish. Great.
I’ve got a question though:
When I restart or stop the icinga service on an endpoint, all services are shown as “Unknown” under the “services” tab of a host, in the dashboard and in the history.
Screenshot from dashboard:


Is there a way to change that? That’s very overloading. Especially the history is overloaded with these unknown events.

Before I added

zone = “icinga2”

to the hosts.conf, the services remain up, although the icinga service was stopped.


#10

This happend because the scheduling for the check happend on the agent, now with the zone attribute set, the scheduling happens on the master.

You can implement a health check that checks if the agent (zone) is connected. Now you can apply a dependency on the checks that are executed on the agents based on the health check.

  • health check not ok -> do not execute the agent checks.

You can find an example in the documentation.

https://icinga.com/docs/icinga2/latest/doc/06-distributed-monitoring/#health-checks

The example only restrain the notifications for the agent checks. If you want to keep the history clean and don’t want them executed while the health check is not ok, you can add disable_checks = true to the dependency.

https://icinga.com/docs/icinga2/latest/doc/09-object-types/#dependency


(Axel577) #11

I already have the cluster check:

object Service “Icinga Agent not running” {
check_command = “cluster”
max_check_attempts = 5
check_interval = 10m
retry_interval = 30s
host_name = “icinga2”
}

I called it “Icinga Agent not running”. You can see it on my last screenshot.

Now I added:

apply Dependency “health-check” to Service {
parent_service_name = “child-health”
states = [ OK ]
disable_notifications = true
assign where host.vars.client_endpoint
ignore where service.name == “child-health”
}

But I think that’s wrong, isn’t it? I don’t understand where “child-health” come from.


#12

The cluster check won’t work here, since this checks if all agents are connected. You need the cluster-zone check, to check if a specific agent (zone) is connected.

apply Service "child-health" {
  check_command = "cluster-zone"

  display_name = "child-health-" + host.name

  /* This follows the convention that the client zone name is the FQDN which is the same as the host object name. */
  vars.cluster_zone = host.name

  assign where host.vars.client_endpoint
}

This requires the custom variable client_endpoint in your host object to identify the hosts that are checked via command_endpoint.

Also keep in mind that the child-heatlh check should be executed more frequently than the agent checks, otherwise it is possible that the dependency don’t apply because the child-health did not notice that it is no longer connected.

object Host "syslog" {
  [...]

  vars.client_endpoint = name //follows the convention that host name == endpoint name
}

Now you can apply the dependency.

apply Dependency "health-check" to Service {
  parent_service_name = "child-health"

  states = [ OK ]
  disable_notifications = true

  assign where host.vars.client_endpoint
  ignore where service.name == "child-health"
}

(Axel577) #13

did it in exactly that way now, but it doesn’t work yet.
Where do I have to put

apply Service “child-health”

and

apply Dependency “health-check” to Service

?
In global-templates/services.conf or in local conf files on the master?


#14

What exactly? Did you get a config compile error?

global-templates should be fine. The service in global-templates/services.conf and the dependency in global-templates/dependencies.conf to keep a clean structure.

For now only the notifications are restrained by the dependency. Add disable_checks = true to also restrain the execution for the agent checks.


(Axel577) #15

I don’t get a config compile error and I added disable_checks = true:

apply Service “child-health” {
import “generic-service”
check_command = “cluster-zone”
display_name = “child-health-” + host.name
vars.cluster_zone = host.name
assign where host.vars.client_endpoint
}

apply Dependency “health-check” to Service {
parent_service_name = “child-health”
states = [ OK ]
disable_notifications = true
disable_checks = true
assign where host.vars.client_endpoint
ignore where service.name == “child-health”
}

object Host “nessus” {
import “generic-host”
address = “192.168.1.1”
vars.os = “Linux”
zone = “icinga2”
vars.client_endpoint = name
}

But when I stop the icinga service on nessus endpoint, all unknown services still appear in dashboard, history and Hosts overview.

The checks aren’t executed anymore:


So I think the dependency works partly. But I don’t know why they become Unknown though.


#16

Try the following.

apply Service "child-health" {
  [...]
  
  check_interval = 5s
  retry_interval = 5s
  
  [...]
}

The service disk for example is unknown since 47 seconds, but the child-health service only since 19 seconds. So the following happend:

  1. you stopped the Icinga 2 daemon on your agent
  2. the master scheduled the disk check
  3. No check result for disk check meaning agent is not reachable/down/stopped
  4. disk check is now unknown
  5. the master scheduled the child-health check
  6. child-heatlh check result is critical meaning agent is not reachable/down/stopped
  7. child health is now critical
  8. dependency health-check is now active and holds back notifications and check executions

The child-health check needs to be executed more frequently to avoid this.


(Axel577) #17

You are totally right, that was the problem. I adjusted the check interval as you suggested.
May I ask you some more questions:

  1. Can you tell me the difference between cluster check and cluster-zone check? Because both tell me when the service is stopped.
  2. You told me I should disable conf.d. Currently, I configured some conf files in that folder, for example notifications.conf. Should I move all conf files of conf.d folder to /etc/icinga2/global-templates/ ?
    Futhermore, I have some switches and accesspoint hosts in /etc/icinga2/conf.d/hosts/. Where should I move these files to?

(Rafael Voss) #18

Hi,

The Cluster check checks all connection in the zone and all parent connections. The Cluster-zone check, checks only the connection of the zone you want

See the Link posted from @mcktr:

Well, that depends: Place them where you need them. If notifications.conf is only needed for notifications on your Master, than put it in your masterzone folder ( on my configuration its /etc/icinga2/zones.d/master) and move the ones that are required on every Agent to global-templates (on my configuration its /etc/icinga2/zones.d/global-templates/")


(Axel577) #19

@unic
Hi Rafael, but is there any impact if I don’t disable conf.d and leave all files in conf.d folder?

@mcktr
Hi Michael, I found out that the dependency is only active when the health check is in hard state. Is that correct?


(Rafael Voss) #20

Yes, the impact is, that all of this configs are only available on the master. If you need some generic/global configuration files (like templates) Agents or other satellites will not know about them and you need to copy them manually.

You can configure in which state the dependency will be “active”

https://icinga.com/docs/icinga2/latest/doc/03-monitoring-basics/#dependencies

If the dependency should be triggered in the parent object’s soft state, you need to set ignore_soft_states to false .