Icinga High Availability - requirements and architecture choices

Hi, we currently have Icinga2 with Icingaweb2 / Director running on a set of Docker containers:

icinga node (includes webserver)
postgres node (IDO)
graphite node (graphs)

We recently stood up a dev instance to prepare for an HA deployment. While we considered using our in-house VMWare VPLEX HA, this is mainly just for host HA, not the application, meaning if the application goes down, that doesn’t really help. In reviewing https://www.icinga.com/docs/icinga2/latest/doc/06-distributed-monitoring/#high-availability-master-with-clients, I have a few architectural questions before I devote time to attacking HA.

Our requirements:

  • Icinga2 HA running with Director and Icingaweb2
  • Postgres IDO backend
  • No clients are necessary, unless that is good practice

Questions:

  1. Should each icinga master be on a different cluster for safer HA failover?
  2. Should each postgres node (if HA desired) be on a different server?
  3. Do we need, or should we have, a postgres HA failover solution like http://clusterlabs.github.io/PAF ?
  4. If master-1 is the “config master,” is this really true HA? From what I understand, Director will push to the config master, and then replicate to master-2
  5. Will we need a proxy in front of the masters like NGINX to load-balance/failover to the right master in the event of failure? How is this typically done?

I am thinking I should standup Icinga on two different hosts, and “tie them together.” I admit I need to study the first linked doc more, but I wanted to conceptually wrap my head around the high-level requirements. Our other thought was to leverage VMWare VIC (vendor Kubernetes) for some kind of HA, but we are not at the stage yet (testing first with a Jenkins node).

Containers add more complexity to your setup, especially when you consider high availability. Such a thing should be dealt with a container orchestrator like k8s, and you shouldn’t worry too much about it.

Icinga 2 expects two running nodes inside an HA-enabled zone, when one goes down, the other one takes the entire feature set. Since the IDO database backend is running on a dedicated instance, you can address it via a “virtual unique ip address”.

That thing above can be tested very easy with your containers, just shut down one of the master instances and look into the logs and Icinga Web 2’s monitoring health.

In terms of configuration, Icinga 2 and its cluster are built to continue running even if the assigned configuration master with configuration in “zones.d” is gone. If that node doesn’t come back, you can re-elect it to a new master with the stored configuration (read on in the docs how this works).

If you’re pushing things via the REST API, and you’d want to assign a specific node, plus a fallback one, I would consider an HTTP proxy up front, which forwards the request to the primary node, or backup node. This is for the Director deployments, and other API tasks e.g. incoming check results from external scripts.

In theory, this works. The best thing is to try it out and learn more how it works.

Ok, seems like I do need to push for an NGINX proxy to handle the main instance, and to always keep the “correct” address for API calls externally. I did this previously with Artifactory, so I should be able to figure that out. One of my stated main concerns was the IDO / IDO HA, which you touched upon with the note regarding k8s. Unfortunately, we are still in very early stages with VMWare VIC (vendor K8s).

Howerver, instead of pushing hard with a project like https://github.com/paunin/PostDock (Postgres on Docker with full HA/failover), we instead carry on and get VMWare VIC up and functioning. Assuming VMWare VIC is working, we would just add another node to our docker-compose YAML setup, acting as a client following the config master. The reason why I was looking at HA/failover for the IDO is data integrity, if Icinga is going to be front and center in monitoring critical areas of our Hadoop clusters.

Thanks for the pointers. Trying my best to realize the architectural requirements before I head full steam into each piece. I fully understand the complexity with Docker, but beyond using supervisord in place of systemd, the setup hasn’t been too radically different. Instead of the hostname being used for inter-communication with postgres/graphite servers, the docker container name is used.