Executing remote checks in a distributed monitoring setup


(Stacy Dunegan) #1

Hello. I am having an issue getting checks returned from the “client” to the “satellite”. I know there have been many topics on this but I can not find one that was a solution to my problem.

Here is an overview of my setup;

icinga2-master (also setup as a satellite)

icinga2-satellite-2
icinga2-satellite-3
icinga2-satellite-4

icinga2-client-[1-1000]

icinga2-master:zones.conf
object Endpoint "icinga2-master"  {
  host = "127.xxx.xxx.1"
}

object Endpoint "icinga2-satellite-2"  {
  host = "127.xxx.xxx.1"
}

object Endpoint "icinga2-satellite-3"  {
  host = "127.xxx.xxx.1"
}

object Endpoint "icinga2-satellite-4"  {
  host = "127.xxx.xxx.1"
}

object Zone "master"  {
  endpoints = [ "icinga2-master", ]
}

object Zone "sat-2"  {
  endpoints = [ "icinga2-satellite-2", ]
  parent = "master"
}

object Zone "sat-3"  {
  endpoints = [ "icinga2-satellite-3", ]
  parent = "master"
}

object Zone "sat-4"  {
  endpoints = [ "icinga2-satellite-4", ]
  parent = "master"
}

object Zone "global-templates"  {
  global = true
}

Under the zones.d directory I have the following

icinga2-master/hosts.conf
icinga2-satellite-2/hosts.conf
icinga2-satellite-3/hosts.conf
icinga2-satellite-4/hosts.conf
global-templates (has my .conf files)

in each host.conf I have the following config for the clients in each respective zone:

object Endpoint "icinga2-client-1" {
  host = "127.xxx.xxx.1"
}
object Zone "icinga2-client-1" {
  endpoints = [ "icinga2-client-1", ]
  parent = <whatever zone it needs to be in>
}
object Host "icinga2-client-1" {
  address = "127.xxx.xxx.1",
  display_name = "icinga2-client-1"
  check_command = "hostalive"
  vars.client_endpoint = "icinga2-client-1"
  vars.disks["disk"] = {}
}

My main issue is that the top down configuration is working for the hostalive/icinga2/load checks. When I attempt to use the check_disk all of the client servers show the disk usage of the satellite server. From other posts I have seen the recommendation of using NRPE or the icinga2 agent to get that info. I would prefer the icinga2 agent route but cant find out where to start with that.


(Markus Frosch) #2

I guess you haven’t read the right documentation for command_endpoint yet.

You set just a var in your example here:

  vars.client_endpoint = "icinga2-client-1"

You’d have to set this for every service that should be executed on the “target” system:

object Service "bla" {
...
  command_endpoint = host.name
  // or
  command_endpoint = host.vars.client_endpoint
}

Hosts should be defined in the zone that is responsible monitoring the target (same as the parent of the client).

Have you seen Top Down Command Endpoint?

Maybe show us a Service example?


(Stacy Dunegan) #3

Sure. Here is what is in my zones.d/global-templates/services.conf

apply Service for (disk_name => config in host.vars.disks) to Host {
  check_command = "disk"
  check_interval = 5m
  command_endpoint = host.name
  vars = vars + config
  assign where host.vars.os == "Linux"
}

apply Service "icinga" to Host {
  check_command = "icinga"
  check_interval = 5m
  assign where host.vars.os == "Linux"
}

apply Service "load" to Host {
  check_command = "cpu_load"
  check_interval = 5m
  vars.backup_downtime = "02:00-03:00"
  assign where host.vars.os == "Linux"
}

I attempted to add “client_endpoint = host.name” earlier. With this configuration the load/icinga and hostalive check all work and show a “check source” of the satellite that is responsible for it. I have a handful of servers (~10) that are showing the disk usage for the “client” server(what I want). However, on the other servers (~180) I see this message “Remote Icinga instance ‘icinga2-client’ is not connected to ‘icinga2-master’”


(Markus Frosch) #4

Uhm okay, so it seems like the satellites are sending the check command, but not receiving an answer.

Is the client logging anything? Is accept_commands enabled on the clients? See ApiListener


(Stacy Dunegan) #5

On the clients in the api.conf I see

accept_commands = true
accept_config = true

It might be worth noting that I use puppet to manage the install and configuration of the master, satellites and clients. So my issue with this is why does it seem to be working on 10 servers but not the other 180?


(Markus Frosch) #6

I see, so the config should be clean.

Is there nothing in the logs in terms of warnings or errors on client and satellite?


(Stacy Dunegan) #7

So it seems like after making the configuration change. I need to restart the icinga2 agent on the client server and then from icingaweb2 I need to force an updated check. After trying that on 5 different servers it seems to be working. I’ll update this post after I verify that this was the solution.