Posts by Hactar

This forum was archived to /woltlab and is now in read-only mode.

    Ah, that is the difference between our setups, i also run the configuration top down (master sending out config to the clients on change and reload) but in my setup the satelites are actively connecting to the master since most of them are behind firewalls. So they initiate the TLS connection, once established the master sees them as online and sends out the config. The master itself never actively initiates a connection since he would hit a firewall in just virtualy every case. Maybe that is the way to go for you too?


    If that does not work i am missing a key point somewhere in your config as i too can not see where that might come from.

    Update: Changed the IDO resource over to latin1, filtering with "_host_mylocation=location" for example is now working. No crash or bugs so far, but we only tested it with one host for two hours. Will update and set to solved after a few days of further testing on more hosts / zones.

    Ah i have not seen that issue on Github when searching for my specific one, thanks for pointing it out!


    I see what the workaround is, but do i understand this correct that any comments containing any utf8 characters will then potentially crash the icinga2 instance? I mean the host names etc. i have control over since rollout runs only over our table, but if a client side adds a comment (and at least 2 sites so far are based in Austria so we will have utf8 characters and there will be no possible way to filter client inputs) that can then crash my master? I do not really like the idea of my master crashing after one of the Austrian clients saved the "über comment of death" so to speak to a downtime or notification acknowledge.


    Is my assumtion of the risk correct here?

    We are trying to set up several Users that get filtered views in Icingaweb2 filtered by a custom variable in the host basicaly containing the zone (which is a client site/location in our case).


    With the configs looking like this (wich should be correct acording to this post):



    we are getting this beauty of an error uppon login with the user in question:


    Code
    1. SQLSTATE[42000]: Syntax error or access violation: 1253 COLLATION 'latin1_general_ci' is not valid for CHARACTER SET 'utf8', query was: SELECT so.name1 AS host_name, h.display_name COLLATE latin1_general_ci AS host_display_name, CASE WHEN hs.has_been_checked = 0 OR hs.has_been_checked IS NULL THEN 99 ELSE hs.current_state END AS host_state, so.name2 AS service_description, s.display_name COLLATE latin1_general_ci AS service_display_name, CASE WHEN ss.has_been_checked = 0 OR ss.has_been_checked IS NULL THEN 99 ELSE ss.current_state END AS service_state, CASE WHEN (ss.scheduled_downtime_depth = 0 OR ss.scheduled_downtime_depth IS NULL) THEN 0 ELSE 1 END AS service_in_downtime, ss.problem_has_been_acknowledged AS service_acknowledged, CASE WHEN (ss.problem_has_been_acknowledged + ss.scheduled_downtime_depth + COALESCE(hs.current_state, 0)) > 0 THEN 1 ELSE 0 END AS service_handled, ss.output AS service_output, ss.perfdata AS service_perfdata, ss.current_check_attempt || '/' || ss.max_check_attempts AS service_attempt, UNIX_TIMESTAMP(ss.last_state_change) AS service_last_state_change, s.icon_image AS service_icon_image, s.icon_image_alt AS service_icon_image_alt, ss.is_flapping AS service_is_flapping, ss.state_type AS service_state_type, CASE WHEN ss.current_state = 0 THEN CASE WHEN ss.has_been_checked = 0 OR ss.has_been_checked IS NULL THEN 16 ELSE 0 END + CASE WHEN ss.problem_has_been_acknowledged = 1 THEN 2 ELSE CASE WHEN ss.scheduled_downtime_depth > 0 THEN 1 ELSE 4 END END ELSE CASE WHEN ss.has_been_checked = 0 OR ss.has_been_checked IS NULL THEN 16 WHEN ss.current_state = 1 THEN 32 WHEN ss.current_state = 2 THEN 128 WHEN ss.current_state = 3 THEN 64 ELSE 256 END + CASE WHEN hs.current_state > 0 THEN 1024 ELSE CASE WHEN ss.problem_has_been_acknowledged = 1 THEN 512 ELSE CASE WHEN ss.scheduled_downtime_depth > 0 THEN 256 ELSE 2048 END END END END AS service_severity, ss.notifications_enabled AS service_notifications_enabled, ss.active_checks_enabled AS service_active_checks_enabled, ss.passive_checks_enabled AS service_passive_checks_enabled FROM icinga_objects AS so
    2. INNER JOIN icinga_services AS s ON s.service_object_id = so.object_id AND so.is_active = 1 AND so.objecttype_id = 2
    3. INNER JOIN icinga_hosts AS h ON h.host_object_id = s.host_object_id
    4. INNER JOIN icinga_hoststatus AS hs ON hs.host_object_id = s.host_object_id
    5. INNER JOIN icinga_servicestatus AS ss ON ss.service_object_id = so.object_id
    6. LEFT JOIN icinga_customvariablestatus AS hcv_mylocation ON s.host_object_id = hcv_mylocation.object_id AND hcv_mylocation.varname = 'mylocation' COLLATE latin1_general_ci WHERE ( (hcv_mylocation.varvalue = 'location1') AND CASE WHEN COALESCE(ss.current_state, 0) = 0 THEN 0 ELSE 1 END = '1') ORDER BY CASE WHEN ss.current_state = 0 THEN CASE WHEN ss.has_been_checked = 0 OR ss.has_been_checked IS NULL THEN 16 ELSE 0 END + CASE WHEN ss.problem_has_been_acknowledged = 1 THEN 2 ELSE CASE WHEN ss.scheduled_downtime_depth > 0 THEN 1 ELSE 4 END END ELSE CASE WHEN ss.has_been_checked = 0 OR ss.has_been_checked IS NULL THEN 16 WHEN ss.current_state = 1 THEN 32 WHEN ss.current_state = 2 THEN 128 WHEN ss.current_state = 3 THEN 64 ELSE 256 END + CASE WHEN hs.current_state > 0 THEN 1024 ELSE CASE WHEN ss.problem_has_been_acknowledged = 1 THEN 512 ELSE CASE WHEN ss.scheduled_downtime_depth > 0 THEN 256 ELSE 2048 END END END END DESC, UNIX_TIMESTAMP(ss.last_state_change) DESC, s.display_name COLLATE latin1_general_ci ASC, h.display_name COLLATE latin1_general_ci ASC LIMIT 10


    During Setup everything that we had the option to choose was configured with utf8 so i'm at a bit of a loss where the error comes from. We tried most combinations of the syntax that we could think of (like host.mylovation, _host_vars.mylocation, _host_vars_mylocation, etc.) but asside from the occasional "Got invalid custom var" we got the above message. Filtering out host names works, so filtering in general is working, but since there is no filter for zones (aside from custom vars) we are at a bit of a dead end here.


    Running Icinga2 2.7.1.1 and Icingaweb 2.4.2 on Debian 8, everything out of the official apt packages.


    Anyone got any input that might lead me to an error on my side or will i have to put this up as an error on Github since i could not find any open issue regarding this on there?

    That much was clear, what i meant was that the backup server running besides the master should have not only the mySQL data syncd to it (as it does now as mySQL slave) but also the influxDB. At the moment we are running a carbon relay pushing the data to both the master and the backup server. This funcunality i have too look into with influxDB, sorry if in formulated that unclear.


    For the rest i am absolutely with you, gotta look into how i could replicate influxDB to another server if we really switch over. Well, xFilesFactor first with th rest of the bugs still to iron out and then onwards from there.

    Thanks for the answer, so it is kind of like i expected. I'll try and set a way lower xFilesFactor then. Most issues of this kind we have gotten rid of with just building the graphs differently or using other representative metrics like the metadata of the service.


    As for influxDB i initially dropped it as i am not too firm on databases myself (getting better, but i'm still on a private little war with them) and i would have to go and research how i would set it up to work in our environment. Every satellite (aka Client Zone) that want's graphing needs one, which is easy enough i'd guess, but our master setup has a DB slave that gets all our mySQL data and Whisper data which i have no idea how to handle with influxDB. Well maybe in time when everything is running for the first few zones.

    Not using director so not really into deep here, but if you add something manually to a file don't you need to re-run the initialisation prozess of director again so he imports the changes as "external objects" that can then in turn not be modified by the director?

    Yes either do that or at lest what i can think of to mak it look "nicer" or better more readable is set Y-min to -1 and Y.max to 4 so you have a bit of room above and below makint it more readable. What might also be a good idea is to set the treshold to 2,7 or something that the graph really visally clears the treshold line and set said threshold line to not fill the space below with its color.


    But that's jsut my two cent, cool thing that it works for now!

    Well what you could do is visualize as Graph and set thresholds (found to the left under the Display Tab when editing your metric) and call those thresholds accordingly. This gives you lines within the graph representing those levels that you can even use for alerting via Grafana or at least alert lists as an overview should you have more of those graphs.


    My question would be if you need the history (the time interval) or just the state the service has now. If you need the later why not pull the state (without the min) that it has just now and display that in a singlestat panel. With value to text mapping you can easily represent what you want, with color thresholds you can even fill background or text however you see fit for the task. With sparklines on you'd even see if there was a state change recently (at least with Graphite as a datasource you do but i think that should be true for InfluxDB also).


    Maybe i missunderstood what you want to represent (as the time interval and the min operator in front of the value confuses me in the context of service metadata), if so please go in more detail and we go from there.

    Good Morning community.


    We've encounterd the following problem with our bandwith graphing that we only realized fully on friday late afternoon when checking used up internet bandwith (since there have been complaints about slowdown). We build our Bandwith Graphs for the few interesting ports using the inOctet and outOctet metrics reported back by the trusty old "check_interfaces" plugin (transformed with "nonNegativeDerivative") and so far all data seemed to be within reasonable numbers (only 10Gb Links going over 1Gb/s transfers, the 1Gb Links sometimes going to 300-400Mb/s on file transfers, normal stuff) until we looked specificaly to the Lan port connected to our Firewall.


    Said Lan port drew a graph at around 400Mb/s and above on a 40Mb/s Line coming in... Odd we thought and checked the internal bandwith graph on the firewall which was showing the same graph but with 8-10x lower values. Scaleing the graph by 0,125 (so scale down to 1/8th for those not in the mood to calculate) we get a perfect match.


    I'd say i may have missunderstood the value that is coming out of the plugin for the octet's where i thought it might report 1octet when 8 bit come through where it seems to report on a 1:1 basis. Question now is did i miss a flag in the plugin call that i can't find despite reading up on the plugin documentation i could find, is it that the various HP switch series despite beeing internaly and MIB wise different report the value wrong or do is just have to scale every graph by 0.125 because "that's just how it is"?


    Thanks.

    Is you fan status ( "normal (1)" ) by chance corresponding to the check output for the Plugin or can you narrow it down so the command only checks the fans? If so you can just pull the information you need from the metadata of the plugin instead of the performance data. Had to use this aproarch with a Ping check that wasn't showing consistent perfdata (basically flopping between data and null) and the Metadata was stable and representitive.


    Not much i can help with your main problem though sorry, still working out templates myself...

    Good afternoon,


    I am currently (still) in process of rolling out our monitoring setup and this time i got stuck with an interessting (at least to me) Grafana / Graphite Problem.


    First of all the setup:


    One master with at the moment 3 satellites in their respective zones. The zones represent a client / site each. Every server is running icinga2, icingaweb2, graphite and grafana as the clients want to see everything localy and we of course want to have all the zones represented on the master. Versions are as follows:


    icinga2: r2.7.0-1

    icingaweb2: 2.4.1

    Django: 1.7.11

    graphite: 0.9.12-3

    grafana: 4.4.3


    Replication from the zones to the master works, all graphs / panels look exactly the same data wise on the master and the satellites. All servers running debian 8.


    I started with a very basic overview panel for one of the zones. Client just wants a Grafana Dashboard showing the Ping RTA (simple ping4 check metric out of icinga2) for all his switches and main control devices on the network showing green with RTA below 500ms and red with over 1s or on complete loss of connection just to have an easy to read Dashboard for whatever tech is on site (trained or not). Sounded easy enough, so i slapped a few SingleStat panels on there only to see them pop to N/A (NULL) and still being green. To solve this i tried "TransformNull(2)" as a workaround to reliably get an "error" (a red panel) shown. Remember we talk potentially untrained tech here so we need a simple red = bad. After this the panels started randomly switching between "2s - red" and whatever actuall value the check reported back. It seemed that ever so often the value gets interpreted as NULL for some reason.


    Next i tried StatusPanel ( https://github.com/Vonage/Grafana_Status_panel ) since it has a "set inactive if no data" option and you can choose the color you want to set it to. Sounded perfect at first, but this only resulted in a panel showing "Invalid number" when SingleStat showed NULL with the panel still being highlighted green. The inactive state never occoured even when the device was physically off and no RTA metrics where written.


    My next tought was that maybe there is really no data recorded in the graph at said times, so i created a graph for the same metric. While the graph had the occasional gap of a minute where data is actually missing (data is aggregated in 1min. intervals), which is also a problem in need of solving, the times where SingleStat and StatusPanel reported NULL did not match with said gaps.


    Since i did not want to get defeated by a ping4 metic i tried setting an alert on the graph itself. Alert when max of the last 10s is higher 1s and show no data when no data is written, show the alerts , no data and OK state in an alert list as replacement for the red/green SingleStat panels. Or so i though... Now the graphs show a bunch of "no data" indicator lines (gray) al over the graph even with values present on that time. I am at a bit of a loss. The only thing i could still think of is that the xFilesFactor option is messing with me somehow since it is still at it's default of 0.5 with average as it's aggregation method.


    My storage-schemas.conf looks as follows:


    Code
    1. [carbon]
    2. pattern = ^carbon\.
    3. retentions = 60:90d
    4. pattern = ^icinga2\.
    5. retentions = 60s:90d,5m:180d,20m:1y
    6. [default_1min_for_1day]
    7. pattern = .*
    8. retentions = 60s:1d

    the logs from /var/log/graphite for the timegap shown in the screenshot:

    Is data aggreagtion f...ing me over somehow? It looks like datapoints are recived even in the time (12:09) the graph does not show data and more than one at that, so even with xFilesFactor at 0.5 this should be enough to generate a datapoint here. Why does every panel think there is no data even when the graph shows data? This should be so simple and jet i manaeged to make it complicated.


    Any input is greatly appreciated. If any logs / configs are missing to provide answers just ask and i will deliver asap.

    Ok, as fortold i found out mostly everything by myself (bit of sleep does wonders, who would have guessed) but i still ran into a problem i did not see coming.


    The satellite i want to integrate into my setup was used as a test server and has it's own director running. This throws of deployment greatly as there are configs in the global zones of that server that should not be there. I need to get rid of the director instance on the satellite server and really have no time to re-install (shipment is in 13 hours). What do i need to delete? I keep finding files from director and i am not sure what is actually save to kill. My Google-Fu was not very sufficient on this topic so far.


    /Edit: Got access to the locked warehouse instead of the remote session of before, re-install is running. Would still be interesting what the solution is for future reference. Another nightshift it is then...

    Ok, it has been way too long but again everything was more importend to the boss than the Icinga2 setup up until exactly now where it has to be able to do configs and zone rollouts via Icinga Director and finish the first zone rollout by friday latest. The fun never ends :D


    Anyway, i got the database and the users created, the API user is up, the global zone is on both masters and i ran the Kickstart wizzard. Now it wants to do 200 something changes that include deploying the endpoints for my two masters again as well as the master and director-global zone. Is my brain mushed enough to just not get it or is this correct? I thought this just leaves it be, or does it just make the current config again via the API and nothing breaks?


    Next question would be (after the changes are deployed if this is correct) how do i add another satellite zone. I got the endpoint for the satellite zone up and running (getting internet there tomorrow) and the endpoint / agent is able to reach both master servers. The zone contains a good number of hosts all of which should be checked from the satellite endpoint and data should be sent to the masters. The zone should only be able to see itself. Manually i would just use the node wizzard to join the two zones, but since the day was a bit to long i don't quite get it using director.


    Any help would be apriciated as always. I guess if i have more time tomorrow (which most likely will not happen) i might be able to solve some of it on my own but since the sudden deadline does not allow for full reconfiguration if i F... up i'm a bit on the fence.

    I have experimented around a bit with the director (on a seperate install) and since our HA Cluster Setup (the masters + DB backend) is now up and running it is time to get the director on there (which i would prefer to do correctly the first time around).


    Since i have a good few questions i see coming up and i have to document everything anyway i thought that with my questions an experiences and the answers from the community i'd compile a little how to / best practice to send over to DNSMichi to make a sticky post out of it to be extended bit by bit. This way my documentation is usefull for more poeple than it would be now.


    On to my questions. I have a 2 master HA setup (one config master where the node wizzard ran as "master" and a backup master where the node wizzard ran as "satellite") with the following config:


    There is nothing in /etc/incinga2/zones.d so far on both masters. When the Config_Master is the aktive checking master i see his local checks in icingaweb2, when the Backup_master is the active checking master i see only his local checks in icingaweb2. This tells me that a) failover is working and the servers communicate and b) i need to check my servers via ssh and get rid of the local checks if i want to see both servers no matter which one is the active one (since the non active one does not execute it's local default checks).


    My first step now would be creating an director api user on the config master and since the director should handle the config of all future zones (it will only be two zones for initial deployment aside from the master zone) i have to set

    Code
    1. object Zone "director-global" {
    2. global = true
    3. }

    but do i need to get rid of the other global zone (for regular config sync) as well?


    Furthermore before i run the kickstart wizzard i would like to adapt a few values in the standard configs for the "generic-host" to fit my needs (for example warning / critt threshholds for processes are way to low for our setup), but is this a good idea or is ths better handled another way?


    As one can see this is for a lot of you guys rather basic stuff but i think i can build an easy how to out of it to be used alongside the documentation. My initial thoughs on the structure was:


    • Prepration before install
    • Install using github
    • Kickstart
    • How to use Host Templates, datatypes / fields and what to watch out for
    • How to set up services
    • Integrate custom commands
    • Create new zones useing director
    • User management (permissions for director users)


    Any further input is welcome, i am sure i missed quite a lot.

    Ok a little Update. Since this:

    Code
    1. Fatal error: Undefined class constant 'MYSQL_ATTR_INIT_COMMAND' in /usr/share/php/Icinga/Data/Db/DbConnection.php on line 171

    developed itself over the weekend without anyone touching the system (the error comes up if you want to access <my-ip>/icingaweb2) and i do not have the time to diagnose everything again i will probably kill the whole setup and rebuild with Mariadb 10.1.x


    Seems to be the best way forward in terms of stability. So if anyone has the same bright idea as i did to use the 10.2.x RC do yourself a favour and don't :D