Posts by kunsjef

This forum was archived to /woltlab and is now in read-only mode.

    Unfortunately the API didn't work either. The downtime start is set correctly, but the end is still at the same time the command is run. Maybe this is a bug?


    This is still something I want to do, but I have no idea how without tinkering with the database.

    however, the below seems to set a downtime in the past:

    Code
    1. curl -k -s -u root:icinga -H 'Accept: application/json' -X POST 'https://localhost:5665/v1/actions/schedule-downtime?type=Service&filter=service.name==%22procs%22' -d '{ "start_time": 1482220000, "end_time": 1482223800, "duration": 1000, "author": "icingaadmin", "comment": "IPv4 network maintenance" }'

    I will definitely try this. Thanks a bunch! I didn't consider the API.

    Hi all,


    Is there a way to register "historical" or past downtimes in Icinga2? We make a lot of reports for our customers. These are based on Icinga2, and we want to have those 100% correct. if a customer disconnects his line for local maintenance and we don't know about this until later, we have the need to register this as a downtime from the time the host went down until the host is back up again. Also, sometimes our engineers (not me :whistling: ) forget to add downtime to a specific host or service in a maintenance window, and we need to be able to set this after or during the job.


    When we try to add downtime in the past from Icingaweb2 we get this error: A downtime must not be in the past.


    We have tried adding downtimes using External Commands, but that fails in another way: If we set the downtime let's say between 5 am to 6 am this morning, it will start the downtime at 5 am, but the downtime will not end at 6 am - it will end after the command is run. If I run this command at 9:15 am, the downtime will end a couple of minutes after that, for example at 9:18 am.


    Is there a way to do this right? :)


    Icinga version: r2.6.0-1, Debian 8.6 (Jessie), 1 master and 6 satellite checkers

    I'm only trying to help you here. That bug you're encountering is not fixed so my best advise I can give you is to reduce the number of checkers. Probably it is also a good idea to limit the number of concurrent checks (concurrent_checks in your CheckComponent configuration).

    I know you try to help and we really appreciate that. We would not run Icinga2 if this community did not exist.


    We have limited the number of checkers to 2 in each zone. I can confirm that this does not help with my problem with high loads and subsequently a large number of alert messages. At 03:20 tonight one of our checkers stopped (Worker 2). "Worker 2" is in the same zone as "Worker 1". 1 and 2 are in zone 1, 3 and 4 are in zone 2, 5 and 6 in zone 3. These are the load and number of processes graphs from the rest of the checkers when worker 2 stopped and restarted:



    My initial question was: Is this expected behaviour - not the crashing but the loads? If you restart one checker - should the others go bananas? Should checks time out and should Icinga2 send alert messages?


    These are all notification of timeouts:

    Shell-Script
    1. # grep "Sending notification" icinga2.log.1 | wc -l
    2. 2141


    The notifications look like this:

    dnsmichi wrote:

    As far as I can see you're solely using those 6 nodes inside a zone for load-balancing the checks. If you reduce the number of endpoints in that zone to two, all your dependencies on host objects will still work.


    Right now you don't use any methods to pin your host checks to any of those nodes, so it does not matter where the host check runs or how many endpoints are in that zone, from a technical point of view.

    Ehm. We do not run 6 nodes for load balancing just for fun or just because we can. We run 6 nodes for load balancing because we have to. The reason is load. If I could remove 4 nodes I would be happy, but if I do that Icinga2 won't run. We don't have a super big network, but still 5 checkers is just marginally ok. We added the 6th just in case one of the others fail for some reason. The smallest checker have 2 Intel Xeon 4-core 2.33GHz processors, and with 5 checkers this one has a hard time keeping below a load of 8.



    dnsmichi wrote:

    Our recommendation is to only have 2 endpoints in a zone until we investigate further. If you want to stay at 6 and live with the bug, it is up to you. Probably you prefer the recommendation over the bug :p

    We have removed a lot of dependency configuration and split the config in 3 separate zones with 2 checkers in each zone. Hopefully this will remove the issues we have seen.



    dnsmichi wrote:

    Note aside: Your CheckCommand definitions could really use arrays and arguments. That sort of looks like a modified 1.x exporter.

    We really want to optimize our config, so an example on how to do it better will be greatly appreciated :)

    I am sorry to nag about this, but our Icinga2-installation and all monitoring is down at the moment because of this problem. I am just wondering if anybody else have had the same problem...? I am guessing that most people have all their network objects connected with dependencies? And I am also guessing there are others out there with more than 2 satellites?

    Our Inventory system generates all our Icinga2 configs based on templates that we have made. In the Inventory system, we connect hosts via interfaces, and this is the basis for dependency config in Icinga2. Also we choose a spesific function and select the correct model, and both these are basis for service configs in Icinga2.


    This is an example config of one host:


    Another example:


    This is an example on how our network topology looks like. We have 2600 hosts that are all connected:
    http://www.graphviz.org/Gallery/twopi/twopi2.svg


    If you look at the configs above, both these hosts have a dependancy to host6. If host1 and host6 is in zone1, and host2100 is in zone2, the config check for host2100 fails because the satellites in zone2 has no config for host6.


    Then we get errors like this on the satellites in zone2:


    Shell-Script
    1. critical/config: Error: Validation failed for object 'host2100!host2100_to_host6' of type 'Dependency'; Attribute 'parent_host_name': Object 'host6' of type 'Host' does not exist.

    I don't know what you want to see, but the problem is that everything in our network is connected. Every host and service in Icinga2 has a parent/child dependency in a tree structure. Today we have all our hosts and services in one zone with 6 satellites. To overcome this bug we have to divide our 6 satellites in 3 separate zones with two satellites in each zone.


    I was thinking that our 3000 hosts could be spread across 3 zones, and that we could put ~1000 hosts in each zone. But when every host is connected in a tree structure, some host will always have a dependency that runs across zones. Icinga2 won't let us have a dependecy between zones.


    Then I am wondering if there is a way around this without deleting some of the dependencies? I am sure there are others out there that have all hosts connected with parent/child dependencies, and that have more than 2 satellites.

    Hello,


    that's not a crash though. Just Icinga restarting after receiving a config change

    Code
    1. [2016-08-18 15:00:29 +0200] information/ApiListener: Restarting after configuration change.

    This has nothing to do with your initial crash. Try starting Icinga with gdb when there are no config changes pending.

    Well, this is a production environment, and our Icinga2-installation automatically receives new configuration from our Inventory-system. We have 20+ users updating our Inventory, so configuration changes will happen all the time. I will start it again though, and hope for the best :-)

    This was not as easy as I thought. In our Icinga2-config we have dependencies between all our hosts, and all services have dependencies to their respective hosts. When everything have dependencies, we cannot split the configuration to several zones without getting error messages like this (host1 is in one zone, and host2 in another):

    Shell-Script
    1. Aug 19 12:20:02 worker1 icinga2[28835]: /var/lib/icinga2/api/zones/worker/_etc/hosts/host1.conf(23): object Dependency "host1_to_host2" {
    2. Aug 19 12:20:02 worker1 icinga2[28835]: /var/lib/icinga2/api/zones/worker/_etc/hosts/host1.conf(24): parent_host_name = "host2"
    3. Aug 19 12:20:02 worker1 icinga2[28835]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    When everything is connected with dependencies, Icinga2 will not allow us to split the config in different zones.


    Is this possible to solve without breaking dependencies?

    Ok so now it crashed again. This is the output from gdb when it crashed:

    Unfortunately, when I run "thread apply all bt full" this happens:


    Shell-Script
    1. (gdb) thread apply all bt full
    2. Cannot find new threads: generic error
    3. (gdb)

    Any tips? My gdb window is still open...

    Hi,


    All nodes are updated with the latest Icinga2 package for Debian

    Master zones.conf

    Satellites zones.conf (they are all exact copies of this):

    Hi all,


    We currently have 1 master icinga2 server that does no checking, and 6 checkers that do all the hard work. We have understood that the load balancing algorithm in icinga2 is to divide all the checks on the number of available checkers. In our case, each checker will do about 17% of all the checks independent of their hardware resources. We had a lot of trouble with this in the beginning, when we thought that checks was divided based on how much each checker could actually do (we use gearman in our icinga1 installation).


    Anyway, we have currently 2600 hosts and 9000 services in Icinga2. Most of these checks (~6000) are interface-checks that takes about 500ms to complete. With 6 checker servers that each have 8+ powerful CPUs, this works fine when everything is stable. There is no latency, and load average on the checkers is around 4-6 based on number of CPUs.


    The problems start if one of the checkers restarts or if the icinga2-process stops somehow (we currently have a problem with this). When this happens, some - not all - of the other checkers experience high cpu loads (500-1000) and the number of processes rises drastically (1000+). This leads to check timeouts and multiple (again 1000+) alert messages being sent out - often in the middle of the night. The high loads last anywhere from 15 minutes to 60 minutes before everything goes back to normal.


    Is this expected behaviour? If not - where do we start troubleshooting this problem? Of course we need to solve the problem I linked to above, but still it should be possible to restart the icinga2-process on a checker without getting thousands of email alerts.

    Ok now I have the debug logs. I have to admit that this is beyond my understanding. I hope this makes some sense to you :-)


    First this is what gdb said when icinga2 crashed:


    And this is the output from "thread apply all bt full". It was more than I was allowed to post here, so here is a link:
    http://pastebin.com/eA6U7fj5


    -Thomas

    We have a Load Distribution cluster setup with 1 master server and 6 checker servers. Every now and then (as often as once per day) the icinga2 service on the checkers stop with the following error:


    We have a script that automatically restart the service if it stops, but still this fault causes some stress to the other checker servers which again causes check timeouts. Has anyone else seen these segfaults? What can possibly be the cause of this issue?


    All servers run version r2.4.10-1 on Debian version 8.5.


    By the way - no crashlog is generated on the checkers when this happens.

    Wolfgang wrote:

    Icinga1 uses only one core. When notifications are blocking, the other processes will wait. Running gearman distributes the monitoring processes and you probably know what happens if it's down.

    I can see that this might be a problem. But we use gearman for most of our checking, so our Icinga1 server mainly does web-interface, notifications and keeping track of statuses. The only real problem we see in Icinga1 is that check latency might bee too high every now and then. That never happens in Icinga2, and that is a good selling point for Icinga2. We need to know when important hosts or services go down, and we need to know it immediately. But I am guessing that if we used the same amout of servers for our Icinga1 setup, we would not have any latency issues.



    Wolfgang wrote:

    The load values themselves don't tell much. Are the servers "responsive" when you experience high values?

    When the load is "normal" around 5-7 on the checkers, they are responsive and there is no problem. Before we installed task spooler, and we had loads of 100+ and even 1000+, the servers would stop responding and just die. Well they were alive, but they stopped responding even to ping, and even the console on them was more or less frozen. Also with the task spooler active we can see load peaks when we reload Icinga2 at the master, or if we reload one of the checkers, but only temporarily for 5-10 minutes. During that time the checkers with high load are sluggish, but they work.


    The Icinga2 master is sluggish now and then when the load exceeds 4, and also the web GUI is a little slow now and then with high loads. I am hoping that an upgrade to 8 CPUs will solve this problem.


    The MySQL server see a LOT more activity in Icinga2. At Icinga2 reloads before the task spooler, the MySQL server was trampled/killed by Icinga2, and this caused a lot of problems for other systems using the same server. Now we have a separate MySQL server for Icinga2, and we had to upgrade to Percona MySQL to prevent error-messages in Icinga2 telling us that the "database server cannot keep up". This server has 16GB RAM, and this is not enough, so we are going to double the amount of RAM soon.




    Wolfgang wrote:

    Having check timeouts: can you name a specific service / type of services be the cause of the problem?

    The problem is that the checkers get too much to do (too much load, and too much tasks for task spooler to handle). All kinds of checks times out - even ping. The errors we see are like this:
    Exception occured while checking 'ytrsa0014': Error: Function call 'fork' failed with error code 11, 'Resource temporarily unavailable'
    We run quite a lot of checks, but most of them are either ping checks or fairly simple SNMP queries. We have maybe 40-50 checks that require a bit more resources, but we have tried turning those off without any changes.


    What I don't understand, is how other people can run Icinga and monitor 5000+ hosts and 30000+ services. Do they see the same problems we have, or do they run Icinga1?