Choosing a new Icinga2 architecture - Scalable over multiple clusters


I am trying to figure out wether it is possible to balance checks execution across multiple child-zones.

For now, we only have the first two levels and here is the situation :

  • Endoints inside Tier 2 Cluster are hitting maximum CPU capacity
  • We already have two Endpoints per Tier 2 Zone
  • We want to keep our configuration regrouped per Tier 2 Zones as they are logical separation between differentes teams
  • All icinga2 instances are in version 2.6.3 (The version we have in production for now)

So here are our questions :

  • Is there an architecture allowing us to keep our configuration organisation ?
  • Can we have a Three Tier architecture without physical endpoint inside Tier 2 ?
  • Is a Tier 2 zone capable of splitting checks across multiple Tier 3 zone (each containing 2 Endpoints) ?

At the moment, we tried the zone conf below, but host configured inside checkers zone is staying in PENDING state. We are assuming that checks does not reach Tier 3 Zone at this stage.

object Endpoint "icinga2-master-a" {

  host = ""

  port = "5665"


object Endpoint "icinga2-master-b" {

  host = ""

  port = "5665"


object Endpoint "icinga2-checker-1-a" {

  host = ""

  port = "5665"


object Endpoint "icinga2-checker-1-b" {

  host = ""

  port = "5665"


object Endpoint "icinga2-checker-2-a" {

  host = ""

  port = "5665"


object Endpoint "icinga2-checker-2-b" {

  host = ""

  port = "5665"


object Zone "master" {

  endpoints = [ "icinga2-master-a","icinga2-master-b" ];


object Zone "checkers" {

  parent = "master"


object Zone "checkers-1" {

  endpoints = [ "icinga2-checker-1-a","icinga2-checker-1-b" ];

  parent = "checkers"


object Zone "checkers-2" {

  endpoints = [ "icinga2-checker-2-a","icinga2-checker-2-b" ];

  parent = "checkers"


object Zone "global-templates" {

  global = true


Zone.conf are the same across the whole infrastructure.

Thanks in advance for any help.


Icinga2 does not support the concept of sub-zones ( which is what you are trying to build).
The way to spread checks is by having multiple satellites in each Zone and then they load balance the check execution and reduce the load the negate bottlenecks.

The 3rd tier is the monitored host themselves , which can function as icinga agents ( and defined as Endpoints in the Satellites), or can be monitored by other agents .

Thanks for the answer Aflatto.

Can I have more details on how it would work ?

Like :

  • Do I have one folder per satellites cluster on the master for the configuration ?
  • Can I have multiple clients zones behind a satellite zone ?
  • Does it mean that satellites are able to split the load across any number of clients zones ?
  • Can client execute commands in destination of host to monitor and give the perfdata back to the masters ? (to write it to metrics base)

Thanks again for your time.

The best way is described in the docs:

To answer your individual questions :

  1. A directory under the zones.d on the master = Zone , a zone can have multiple satellites in it, and they share and balance the check execution for the defined hosts and services for that Zone.

  2. clients are defined in the same zone as the Satellites, again you can not define a “sub-zone”.

  3. See number 1 & 2

  4. Yes, the perfdata will “travel” all the way back to the masters to be presented and parsed to a metrics dashboard.

In fact, that is the point on which the documentation lost me a bit.
To precise, in my situation, there is no point talking about agent because I only monitor proprietary network devices.

Therefore can we assume that “clients” are Icinga2 instance that perform given checks with network devices as target ?
Or did I got it wrong, and clients are the network devices themselves ?
The documentation confuse me on this part :

  • A client node which works as an agent connected to master and/or satellite nodes.
  • A client node only has a parent node.
  • A client node will either run its own configured checks or receive command execution events from the parent node.

And my other question is about “sub-zone” as you said this concept does not esist within Icinga but the documentation left some hints that would implies that it exist in my opinion :

[root@icinga2-master1.localdomain /etc/icinga2/zones.d/satellite]# vim icinga2-client1.localdomain.conf

object Endpoint “icinga2-client1.localdomain” {
host = “” //the satellite actively tries to connect to the client

object Zone “icinga2-client1.localdomain” {
endpoints = [ “icinga2-client1.localdomain” ]

parent = “satellite”

Can the “icinga2-client1.localdomain” Zone be considered as a sub-zone in this case ?

Agents means Linux or Windows machines that have icinga locally installed and configured in client mode. You have proprietary network devices, hence, no icinga agent.

You’ll have your checks e.g. check_snmp being executed on your master or satellites.

The design of your icinga environment depends on some facts e.g. network connections especially firewalls and expected load for master and satellites.

Your configured zone “checkers” does not have any endpoints and this is not allowed.
Following this chapter you’ll see a satellite can have another satellite as its parent, means you can have (much) more then 3 tiers.

Thanks for the information Roland.

Given the information gathered here, I would have only one more question :
Can an Icinga instance aggregate multiple roles like master and satellite (To save costs in VMs) ?

No, one icinga instance can be configured as master, satellite OR client/agent. What are you trying to achieve?

Precisely, bring scalability upon load per zone inside my architecture at the lower cost possible in virtual machines. Here’s its state today :

  • 2 Masters in a master zone
  • 2 Clients in a team1 zone
  • 3 Clients in a team2 zone (I know the limitation but it works though xD)
  • 2 Clients in a team3 zone

We even tried to have 4 Clients in team2 zone but after some reload the clients2.4 would not take load anymore.

So now, we just want to raise our clients number to gain load capacity in every team zone without changing conf achitecture for organisationnal purposes. It is has follow for now :

  • Zone.d/
    • master/
    • global-templates/
    • team1/
    • team2/
    • team3/

You keep using the term “Clients” in the wrong context ,“Client”, in your case is a network device.

The nodes you place in the zones to load balance the checks are called “Satellites” , using the wrong assignment will cause the answers you get not to match what you ask.

So to make it clear, more then 2 Satellites per zone is not recommenced, 3 may work but that is pushing it, the 4th will cause an infinite sync loop that will cause the nodes to become ineffective and stall monitoring.

The number of clients per zone is limited by the amount of checks the satellites can process and execute without creating lag in execution and reporting.

Ok so that kind of misunderstanding is exactly why I don’t like the Icinga2 documentation.

Nevertheless, now that this is sorted out, we can loop back to my original problem :
How can I add processing capacity to a satellite zone that already contains 2 endpoints that are maxed out regardings our infrastructure specs ?

I can only see one way out at the moment through the diffusion of a given team configuration across a satellite and its sub-satellite like in the scheme I put in first post.

I’d recommend to add team4, team5, … as much as you need.

Of course but that is not acceptable on the usability perspective of the application.
That would force people who wants to add hosts (and associated services), to figure out themselves in which satellite zone it should be put. Thus assuming they could correctly presume of number of checks to perform associates with those services and the load increase it would imply.

For context and to explain my statement above : we have 5000+ Hosts and 26000+ Services and we need to expand both in months to come.

A behaviour one could only expect from the application itself.

The Satellites you have are they virtual or bare metal ?

You can always upgrade them to a better spec, of course virtual makes it easier for that .
Another thing is upgrade the software 2.10 has some nice performance improvements and 2.11 is supposed to have even more ( when it comes out as stable).

I have implemented a setup with similar numbers and a 2 Satellite nodes in a zone with adequate spec were not an issue and had capacity for more.

What is the spec you have for those Satellite nodes ?

We are working on virtual machines.
They already using half vCPU of ESXi they are placed in each. And as we are creating constant load with Icinga checks, we can’t take more otherwise we could create vCPU starvation for other VMs on the ESX.
So we kind of reached the max specs of our infrastructure yet.

We are using customs scripts to achieve SNMP Polling in Icinga commands through Python scripts to ensure that the minimal numbers of SNMP polling are used to protect devices polled.
Do you use checks from plugins ?

About interval between checks, we mainly use 1m. We could compare properly with checks density but I don’t have this information at the moment.

About specification, all satellites are set with 12 vCPU and 4 Go of RAM.

The RAM is WAY low for what you are doing, a good minimal spec for a Satellite is 4 Cores and 16 GB Ram, you guys went the reverse way, starve the RAM and load on the cores.

I’d recommend doubling the RAM ( at least) on the Satellites , and if it is not too much of an issue give each one 16 or even 24 GB ram, and then check the load on the devices.

Do you have references to justify those needs in RAM ?

Because we originallly had much more RAM but our VIRTUALIZATION team, few months after start of our supervision, come to us telling that RAM was underused on our satelites VMs and it could be brought down without performance issue.

And as we were constantly adding new services and hosts during this, we did not see any specific links between lowering RAM and increase in CPU usage.

I will try to diagnose the RAM usage myself on satellites and compare with an analysis on their side.

After a quick diagnosis on my part and on side of my VIRT team :

  • VM with 12 core and 4 Go of RAM (satellite A of team1)

And according to this source on RAM management on Linux, I see no sign of a problem :

On the other side, I tride to retrieve the Check density over 1 minutes and I chosed Active Hosts Checks 1min and Active Services Checks 1min from “icinga” check command :

Do you have approximatively the same amount of checks per minutes ?
It would help us to narrow down what options we have left to deal with that.