Icinga2 Distributed Architecture question

This forum was archived to /woltlab and is now in read-only mode.
  • Hi all


    I am experiencing strange behaviour with my icinga2 installation.

    I am using icinga v2.6.3 + icingaweb 2.4.1

    Here is how the architecture setup is:

    I have 2 master nodes running in HA setup - Icinga-master01 - that holds also all the configuration, then icinga-master02 - both part of the master zone.

    Then there is one satellite that sits in same network as masters - but has a public address (reason for this setup is that we had very limited amount of public IPs, and I did not expect any problems that this would create)- This satellite is connected to a single host outside the environment and executing http checks to our applications.

    The satellite name is icinga01 (zone icinga01) and there is client hops-icinga2-us-master (the host dedicated to run the http checks - runs checks on about 30 webpages).

    Rest of the monitored clients (about a 100 of them and 1100 services) are in same isolated network with masters - connected directly to master zone.

    Both masters are connected to same IDO db.

    Then there is another host running all our UI services - icingaweb2, grafana and kibana. Icingaweb uses the connection to IDO db and provides the UI for our Icinga environment.


    Now the odd behaviour - when I have both master hosts services up and running I have problem that some services does not seem to work, there are no notifications from them or so. All seems ok there are no alerts but I have experimentally confirmed that services are randomly "registered" either at master1 or master2... Problem is that some services seem not to run checks, although the service is seemingly OK, it will not notify if there is problem with it as it will not get check data. The last check and next check value just rises and rises in icingaweb2 interface.

    When I shutdown icinga2 service on master02 the system is back operational and all the checks are executed correctly, but there is no HA.

    Another weird behaviour is that checks executed via that satellite -> http checker machine will sometimes produce 2 notifications - when the webpage goes down and when it is back up - as only master1 is online both notifications are coming from master1.

    Also when I reconfigure the master1 and restart the icinga service on it it will create an alarm for about 10 minutes telling that both master01 and master02 cpu being maxed out - strange is that master02 service is off at the time...


    I can provide any logs or configuration files, I am not able to find way how to fix this - just I have tried to look for any possible cause and I am not too sure if I can have both Icinga2 masters talking to same IDO db, even though I did not find any article telling otherwise. Also maybe I have some problem with PKI handling there on the machines. I have used community chef cookbook to handle Icinga configuration + some added code to handle the clients and satellites configuration in distributed environments, it looked to be working until I have added the satellite with http checker node.


    I would be grateful for any hints here.

    Thanks in advance

    David

    The post was edited 1 time, last by junkett ().

  • Please show the feature list of the Icinga-master01 and Icinga-master02


    Please add also the zone configuration of Icinga-master01 and Icinga-master02 (and please format the output with the code tags </>)

  • thanks for your time, here are the answers

    Please show the feature list of the Icinga-master01 and Icinga-master02


    Please add also the zone configuration of Icinga-master01 and Icinga-master02 (and please format the output with the code tags </>)

    Feature list:

    Icinga-master01

    apilistener

    command

    compatlog

    checker

    mainlog

    notification

    graphitewriter

    idomysqlconnection


    Icinga-master02

    apilistener

    command

    compatlog

    checker

    mainlog

    notification

    graphitewriter

    idomysqlconnection

  • Here are zone config files, I have renamed the fqdns as we have our customer names in there:)

    Also had to shorten the file as there is limit 10 000 characters - however I have only removed some customers VMs that are configured in same way as those that remain there