Icinga2 - Distributed Mode - Architecture with more than two nodes per zones

This forum was archived to /woltlab and is now in read-only mode.
  • Hello guys,


    We are actually working on a supervision platform with a colleague and we are facing a problem here.


    Specs :

    system : RHEL CentOS 7

    Icinga2 : 2.6.3


    Our architecture for now is a cluster of master where we put configuration and three child zones with another cluster of checkers in each zones.

    The problem is that one of this zones already have too much checks to perform and is already loaded at almost 100% of CPU Load.


    The context :

    - we can't upscale those checkers anymore (already 24 vCPU per checkers)

    - we can't put 3 checkers in a zone because there is a warning from icinga2 itself warning us for bad CPU perf in this case

    - we can't increase time between checks

    - we tried 3 levels archi (with satelitte and client) but satellite aren't supposed to perform checks


    How can we deal with this issue ?

    Why icinga2 can't balance the load between more than two endpoints ?


    Please, any help would be kind and appreciated.


    AlexJ

  • - we tried 3 levels archi (with satelitte and client) but satellite aren't supposed to perform checks

    That is exactly the way to go.

    Put n zones with 2 endpoints each below the master and delegate the checks to these.


    Pseudocode:

    Code
    1. Zone_Master(endpoints[m1,m2])
    2. Zone_Satellite1(endpoints[s1_1,s1_2]) Zone_Satellite2(endpoints[s2_1,s2_2]) Zone_Satellite3(endpoints[s3_1,s3_2])

  • Hello sru,


    That is exactly how we did it but the problem is, to work with you answer, we have zone_satellite2 (for example) that is overloaded due to too much checks to perform.


    Can we :

    - add more endpoint in a zone ?

    - or make a conf work for more than one zone but not all like global-templates ?


    eg : have a common configuration for two different zones like 1 and 2 in your examples

  • - add more endpoint in a zone ?

    currently, there is a suggestion to limit the number of endpoints to 2.

    I am not sure if that would change in 2.7.x:

    Icinga2 Cluster with master and clients using Kubernetes


    - or make a conf work for more than one zone but not all like global-templates ?

    No. A zone is a boundary of trust and thus can not be "merged" with other zones.

    But wait:

    What about putting apply rules in the global_templates zone that have an assign to clause that matches multiple zones ?

    That should be fairly legal, i guess:


    assign where (host.zone=="sat1") || (host.zone=="sat2")

    The post was edited 1 time, last by sru ().

  • Are you sure it could be recommended ?

    Because according to the documentation it is not possible.


    Moreover, I don't think you solution could do anything more than apply a same service to hosts of two different zones and that would still requires us to manually separate hosts between two zones.


    What we want is a way to balance the load of a given list of hosts and services between more than two endpoints (a zone).


    In fact, I am kind of starting to worry because I think we will have to separate manually all our hosts and that is not a thing you should expect from a modern app.

  • What is preventing you from splitting your zones into smaller pieces?

    Also by the way you wrote your initial question, I am asuming you are using some kind of remote check, to check other hosts.

    What is preventing you from letting the hosts do all the checking and report back to a satellite server?

    Linux is dead, long live Linux


    Remember to NEVER EVER use git repositories in a productive environment if you CAN NOT control them

  • What is preventing you from splitting your zones into smaller pieces?

    We want, if possible, to keep a logical separation between zones. Actually, it is "a team of our service, a zone" and cutting it in smaller pieces would require more work for operator. And it also would requires us to find a separation that achieve almost perfect balance between number of checks to perform. I don't know how to do that so if you know a way, please share.

    What is preventing you from letting the hosts do all the checking and report back to a satellite server?

    Actually, we have a 2-level architecture : cluster of master (holding the conf); clusterB(receiving confB and checks hostsB), clusterA(receiving confA and checks hostsA), clusterC(receiving confC and checks hostsC)

    What you are suggesting is to go for a 3-level architecture ?

  • the 3-level architecture is the way to go.


    To answer your first part, as I do not know your setup, I can not give you any way to seperate your Zones in a better way.

    By the way you write it, I would really suggest getting a consultant in that can help you with your specific problems.

    Linux is dead, long live Linux


    Remember to NEVER EVER use git repositories in a productive environment if you CAN NOT control them

  • That's where I am a bit confused with the documentation because I didn't fully understand the work done by satellite in 3-level architecture.

    If it means that the satellite level job is just to receiving confs and passed checks to perform to checkers, I am a beat skeptical about the amount of load taken by those tasks and that it will transfer a sufficient amount of work from my checkers.


    Can you explain to me in details this part ?

  • if configured correctly the Satellite does no work at all. It is just there to distribute the configuration and schedule the checks on all hosts in its zone, receive their results and pass them on to the Master.

    This is crucial to make sure, that all Data will get to the Master even in the case of losing connection to it.


    Depending on the amount of checks you are running the load on the clients is not much.

    On my linux servers, which are running around 40 Checks eachthe load barely exceeds 2% in the average

    Linux is dead, long live Linux


    Remember to NEVER EVER use git repositories in a productive environment if you CAN NOT control them

  • So what could I gain from this architecture because I don't have any problem to get results of checks back to master.


    About the amount of checks that is where the problem is, because, when I said in my first post that one of our cluster was overloaded, I was talking of 100% or very close in average CPU Load (^^') with 13452 services to perform every minutes. That's why we are searching a way to share with more servers the load from checks.

    The post was edited 1 time, last by AlexJ: Precision ().

  • apply a same service to hosts of two different zones and that would still requires us to manually separate hosts between two zones.

    correct.

    That's why we are searching a way to share with more servers the load from checks.

    Back to the basics:

    An endpoint can do one or both of 2 things:

    - Schedule a check for execution

    - execute a check *or* delegate that execution to a subordinated endpoint.


    You like to have one zone with all your objects in.


    If that would be "because that looks nice for the operator", you could place all your checks in the master zone and set

    a command_endpoint attribute for each service.

    That way, all objects are located and scheduled in the same zone, but actually executed at a given subordinated endpoint.


    But you still have the work to manually set the destination endpoint that should run the check.

    Because that is not acceptable for you, i currently see no solution for that scenario.

    The post was edited 1 time, last by sru ().

  • Hi guys ! I work with AlexJ on the same project, so I'll keep posting for him :)

    Our decision is to reduce our services number, and lower the check delay on our interfaces. But because we have two environments running Icinga2, I decided to try running Icinga2 with a 3 VMs zone in the test environment.


    I have to say I am kind of suprised by the outcome. The test bench cotains 2 VMs as masters and a zone with 3 checker VMs.

    So far, it look like Icinga is running like a charm on our 3 VMs and on the masters...


    Checker1

    [Blocked Image: https://img15.hostingpics.net/pics/931792rdcdal.png]


    Checker2

    [Blocked Image: https://img15.hostingpics.net/pics/977728rdcdid.png]

    Checker3

    [Blocked Image: https://img15.hostingpics.net/pics/288403rdcdid2.png]







    Master1

    [Blocked Image: https://img15.hostingpics.net/pics/884660master1.png]







    Master2

    [Blocked Image: https://img15.hostingpics.net/pics/818653master2.png]









    As we can see, Checker3 is the one I added to the cluster to make a 3 VMs cluster. Each graph is separated in three sections I wrote on the first graph.

    • First section is CPU load in a 2 VMs cluster with reduced services number
    • Second section is CPU load in a 3 VMs cluster with reduced services number
    • Third section is CPU load in a 3 VMs cluster with all services running


    We can see a slight lowering when switching from a 2 VMs cluster to a 3 VMs cluster, and then the CPU load jumping when adding all our services.

    But more importantly, there is no sign of Icinga misusing the CPU time, either on low load or full load.


    How can we explain that ? Would the CPU time be a lot more affected if we used more than 3 VMs in the cluster ?

  • How can we explain that ? Would the CPU time be a lot more affected if we used more than 3 VMs in the cluster ?

    While the graphs are amazing, i am not sure what you

    a) expecting ( do you expect the load to fall to 2/3 if you add another endpoint ?)

    b) asking.


    We can see that 2 endppoints have a slightly lower load then 3 endpoints in the low check count scenario.

    I would like to stress that it is not currently suggested to run a zone with more than 2 endpoints for the observations stated above.

    Regarding the masters, that load is what i would expect.

  • We can see that 2 endppoints have a slightly lower load then 3 endpoints in the low check count scenario.

    I would like to stress that it is not currently suggested to run a zone with more than 2 endpoints for the observations stated above.

    There is just a missing part but Korbs thought it wasn't needed because we already explained several times in that same post that we were at FULL load with only 2 endpoints.


    Just to remind, THAT is our 2 servers (2 endpoints mode) with full load :



    [Blocked Image: https://cdn-images-1.medium.com/max/1600/1*NZaNH0SFe67HtTxKrPXkAg.png]


    So actually it is still better than hell when we use 3 endpoints even though it is not recommended.


    That is why we are asking "How can we explain that ?"

    The post was edited 2 times, last by AlexJ ().

  • Our architecture for now is a cluster of master where we put configuration and three child zones with another cluster of checkers in each zones.

    The problem is that one of this zones already have too much checks to perform and is already loaded at almost 100% of CPU Load.

    So this is the first post.


    If it was in the screenshot it would have been with a comment like this : "BEFORE 3VMs cluste with full load" which again wasn't shown but wrote several times.


    So just to go forward, if someone is capable of explaining this situation, we would appreciate or it is recommended in this kind of "extreme" case to get in touch with Icinga2 dev ?

  • We think it is actually a matter of number of checks more than the type of checks because most of our checks are just python script calling a single snmpget.


    About number of checks, we already brought them down by putting interval from 1m to 5m for a large amount of services.

    We also close some others because we find them useless in the end.


    Actually we are in a position where we feel like we did our best to reduce the load and that our last way is to either upscale our servers (without knowing it would be very effective) or upscale the architecture to perfom a better load balancing.