Posts by cyberkov

This forum was archived to /woltlab and is now in read-only mode.

    Hi!


    Thank you Mikesch  :) I will try to get more memory into the satellites, since they are virtual machines, but this will take time :'(

    Until then I will raise to 25 concurrent checks as well.


    For the version difference: Unfortunately there is no 2.6.3 for Ubuntu 12 anymore :( Upgrading satellites will be so much less of a pain when icinga2 HA is in place on the my satellites. can't wait for it.


    Cheers Hannes

    Hello!


    I started using the director (importing data from OTRS cmdb) with currently around 3800 Hosts and 7700 Services, monitoring a worldwide network of devices.

    For this I have one master running mysql and icingaweb2 plus and 6 satellites with the following specs:


    Machine Specs Icinga2 version
    master 4 CPUs, 4GB RAM r2.6.3-1
    sat1 4 CPUs, 4GB RAM r2.6.2-1
    sat2 2 CPUs, 2GB RAM r2.6.2-1
    sat3 2 CPUs, 8GB RAM r2.6.2-1
    sat4 4 CPUs, 4GB RAM r2.6.3-1
    sat5 4 CPUs, 4GB RAM r2.6.3-1
    sat6 2 CPUs, 4GB RAM r2.6.2-1



    since most of the checks are located in one zone, sat2 and sat5 are in the same zone.

    Nearly all of the checks are currently based on check_nwc_health, check_snmp_storage.pl (Manubulon plugins) and check_icmp.


    Every time I am reloading the configuration on the master, the icinga2 satellites are spawning around 70 or more check-plugins like they forgot the state before the reload.

    This takes around 20-40 minutes until it is ok again or, even worse, crashes some of the satellites. Sometimes the oomkiller kicked in and killed the icinga2 process, which isn't very helpful as you can imagine :-)

    [Blocked Image: https://preview.ibb.co/dwDrVF/load.png]

    The check intervall for generic-host is 3m, 1m, generic-service is 5m, 2m.

    The log_duration for the satellites is set to 2h. The small amount of icinga2 agents have a log_duration of 0 of course. compat feature is turned off.

    I tried setting the concurrent_checks down to 20 while i was testing, but even that causes rather high loads.

    I added a tmpfs to /var/tmp/check_nwc_health to reduce diskwrites but it didn't help, but iowait never seemed to be the problem anyway.


    As you see I tried a lot and the documentation (and the icinga2 book), despite quite helpful, didn't tell me how to properly approach such a largescale environment.

    For any pointers on where I could find and remove the bottleneck, I would be very thankful. If you need any more information, please let me know :) I didn't attach the icinga2 troublehoot output for now as I think this is just some setting I am missing. Of course I can add it if needed.


    Here is the output of https://localhost:5665/v1/status of all satellites for reference: https://gist.github.com/cyberk…970ec47d8b9ccfebc7a57c3ff



    Thanks in advance!

    Sorry for the wall of text :)


    Cheers Hannes

    Hello!


    Probably it is my fault, but I am still having a hard time figuring out Icinga2 Director in terms of importing:


    Currently I have 3 datasources in varying quality:

    - PuppetDB

    - Active Directory

    - OTRS ITSM (v3, soon to be KIX2016/OTRS5)


    For the first two I followed the instructions and it works pretty flawlessly.

    The only thing I do not understand is how to apply filters on the incoming data here. We have a servernaming convention of "$CITY(3)$TYPE(2)$NUMBER(3)", so a server in Graz would be "grzsv001" and in Salzburg "sbgsv002". But Puppet holds Notebooks as well which would be "viepc001" for instance, which I obviously do not intend to monitor.

    I expected that I would set the filter expression in the syncrule to hostname!=*pc*&!ip=127.* or (better) hostname!=???pc*&!ip=127.* but it doesn't seem to do the job.


    For OTRS it is even worse. I am exporting the devices to a CSV (via otrs.ImportExport.pl) as I know that the database model will change in KIX2016 so there is no point in writing advanced queries for OTRS's DB. So I end up with 3 files:

    - Locations

    - DeviceTypes

    - Devices

    Each row has a unique ID ("Number"). Devices belong to a Location and a DeviceType (on which I'd like to base my assign rules sooner or later).

    Devices to have a "Deployment State", which can either be "Active", "Inactive" or "Planned". Unfortunately the rule "Deployment State"="Active" does not work. I as well tried mapping it to a "state" field and lowercasing it but neither state=active nor state!=inactive work :(


    I didn't get that information from the docs so I am confused whether I would need to create another rule for hosts (so one for puppet import, one for AD import and one for CSV import for the host objetcs) or should I put all mappings in one rule and use the "purging"?


    Sorry for this wall of text. I really tried hard to get as much information as possible from the documentation, the issue tracker and from the forum but I am really stuck :(


    Thanks in advance!


    Cheers Hannes