Need to understand the best way to apply distributed monitoring to two sites

  • Hello, and thanks in advance for any input.


    I've had multiple reads through the Distributed Monitoring guide, but am having enough trouble wrapping my head around it that I want to see if anyone here has some advice. From what I can tell, I'll need something along the lines of what's outlined in the "High-Availability Master with Clients" section of the guide, but could be wrong about that as I'm still struggling to understand it all.


    Right now, I'm running Icinga2 on a Ubuntu 16.04 Server virtual machine, with 1CPU and 4GB RAM (Additional resources will be added as needed), monitoring 10 clients with 66 services. So far my network has been a staging area for a project which will soon grow rapidly. I recently added a remote network, which is connected with a site-to-site openVPN, and monitoring hosts on the other side of the tunnel is working perfectly.


    A long-standing issue I've had is that if anything happens that takes the monitoring server offline, notifications can't happen. I believe having this second site available puts me in a good position to solve this problem, but I need to figure out the best way to configure Icinga2 for this functionality. If I understand right, I'll need one "Master" node at each site, since Satellites and Clients aren't used to send notifications. The High-Availability Master with Clients seems like it may be applicable, but I don't fully understand it, and my main concern there is having a master node running on a database that's on the other side of a potentially slow VPN tunnel.


    Would love to hear ideas about this, even if it's completely different from what I've figured out so far.


    Thanks again!

  • since Satellites and Clients aren't used to send notifications

    Wrong. It is up to you to icinga2 feature enable notification also on satellites and clients.

    The issue is to avoid the same notification to be send from different endpoints.

    Read:

    https://www.icinga.com/docs/ic…bility-with-notifications



    I'll need one "Master" node at each site

    I think that should be the way to go. The masters would fail over nicely and resync after the net is back again.

    Regarding the database, keep in mind that icinga2 just writes to it, it would never read it. You could also disable

    the HA on the IDO_xxxSQL feature of icinga2 so that only the master that is at the same location as the database will write to it.

    If a failover to the slow site happens, the remote master will write a backlog. Once that is replayed to the local master, this local master would write state changes to the database.


    But, as icingaweb2 *is* reading from the database, it will not see the latest changes until the backlog is replayed, then.

    I would try how bad the slow writing really comes out and let both masters fail over the "active database writer" role, as is in the docs.

    The post was edited 1 time, last by sru ().

  • I would use two masters in a HA zone. By default both will load-balanced the notifications being sent. If one goes offline, the other master continues to a) schedule/execute checks b) receive check results from the client c) trigger alert notifications d) update your historical (maybe SLA relevant) backends.


    That exactly is what the second scenario in the docs describes. If you do need to apply certain services amongst all master zone members, either put these inside the master zone, or use the global-templates zone for just that. The only thing being disallowed in global zones are static host/service objects.

  • Thank you both very much for your help - That definitely gets me on the right track.


    I've been working on setting that up, and running into some issues that I believe are not directly related to this. I'm going to open a separate thread for that.


    I'll update here as I make more progress on the subject of this thread. Thanks again!