Grafana Alerting and TransformNull problem - Data points missing / being interpreted as NULL (Graphite source)

This forum was archived to /woltlab and is now in read-only mode.
  • Good afternoon,

    I am currently (still) in process of rolling out our monitoring setup and this time i got stuck with an interessting (at least to me) Grafana / Graphite Problem.

    First of all the setup:

    One master with at the moment 3 satellites in their respective zones. The zones represent a client / site each. Every server is running icinga2, icingaweb2, graphite and grafana as the clients want to see everything localy and we of course want to have all the zones represented on the master. Versions are as follows:

    icinga2: r2.7.0-1

    icingaweb2: 2.4.1

    Django: 1.7.11

    graphite: 0.9.12-3

    grafana: 4.4.3

    Replication from the zones to the master works, all graphs / panels look exactly the same data wise on the master and the satellites. All servers running debian 8.

    I started with a very basic overview panel for one of the zones. Client just wants a Grafana Dashboard showing the Ping RTA (simple ping4 check metric out of icinga2) for all his switches and main control devices on the network showing green with RTA below 500ms and red with over 1s or on complete loss of connection just to have an easy to read Dashboard for whatever tech is on site (trained or not). Sounded easy enough, so i slapped a few SingleStat panels on there only to see them pop to N/A (NULL) and still being green. To solve this i tried "TransformNull(2)" as a workaround to reliably get an "error" (a red panel) shown. Remember we talk potentially untrained tech here so we need a simple red = bad. After this the panels started randomly switching between "2s - red" and whatever actuall value the check reported back. It seemed that ever so often the value gets interpreted as NULL for some reason.

    Next i tried StatusPanel ( ) since it has a "set inactive if no data" option and you can choose the color you want to set it to. Sounded perfect at first, but this only resulted in a panel showing "Invalid number" when SingleStat showed NULL with the panel still being highlighted green. The inactive state never occoured even when the device was physically off and no RTA metrics where written.

    My next tought was that maybe there is really no data recorded in the graph at said times, so i created a graph for the same metric. While the graph had the occasional gap of a minute where data is actually missing (data is aggregated in 1min. intervals), which is also a problem in need of solving, the times where SingleStat and StatusPanel reported NULL did not match with said gaps.

    Since i did not want to get defeated by a ping4 metic i tried setting an alert on the graph itself. Alert when max of the last 10s is higher 1s and show no data when no data is written, show the alerts , no data and OK state in an alert list as replacement for the red/green SingleStat panels. Or so i though... Now the graphs show a bunch of "no data" indicator lines (gray) al over the graph even with values present on that time. I am at a bit of a loss. The only thing i could still think of is that the xFilesFactor option is messing with me somehow since it is still at it's default of 0.5 with average as it's aggregation method.

    My storage-schemas.conf looks as follows:

    1. [carbon]
    2. pattern = ^carbon\.
    3. retentions = 60:90d
    4. pattern = ^icinga2\.
    5. retentions = 60s:90d,5m:180d,20m:1y
    6. [default_1min_for_1day]
    7. pattern = .*
    8. retentions = 60s:1d

    the logs from /var/log/graphite for the timegap shown in the screenshot:

    Is data aggreagtion me over somehow? It looks like datapoints are recived even in the time (12:09) the graph does not show data and more than one at that, so even with xFilesFactor at 0.5 this should be enough to generate a datapoint here. Why does every panel think there is no data even when the graph shows data? This should be so simple and jet i manaeged to make it complicated.

    Any input is greatly appreciated. If any logs / configs are missing to provide answers just ask and i will deliver asap.

  • Hi,

    i used a the xFilesFactor of "0" to get rid of a lot of gaps. The Problem is, that the xfilesfactor will set the value to NULL when most of values are "NULL". F.e. if Graphite expects a value every minute (retentions = 60s:1d) , but your pingcheck delivers only 1 Value every 5 mins. Graphite has 4 NULL Values and one real value. As more than 50% of the Values are NULL Graphite will normalize the value to NULL.

    As i never got rid of every gap, so i switched to influxdb. All the Pain ist gone since then. I strongly reccomend influxdb, its easy to install.

    The post was edited 1 time, last by unic ().

  • Thanks for the answer, so it is kind of like i expected. I'll try and set a way lower xFilesFactor then. Most issues of this kind we have gotten rid of with just building the graphs differently or using other representative metrics like the metadata of the service.

    As for influxDB i initially dropped it as i am not too firm on databases myself (getting better, but i'm still on a private little war with them) and i would have to go and research how i would set it up to work in our environment. Every satellite (aka Client Zone) that want's graphing needs one, which is easy enough i'd guess, but our master setup has a DB slave that gets all our mySQL data and Whisper data which i have no idea how to handle with influxDB. Well maybe in time when everything is running for the first few zones.

  • You need to convert your existing whisper db, or just delete it. After deletion a, new databsae with the xFilesFactor from your configuration will be created.

    With InfluxDB its the same like whisper. If you have a master, you only need one InfluxDB. For icinga its just switching/configuring the feature from graphite to InfluxDB-writer.

  • That much was clear, what i meant was that the backup server running besides the master should have not only the mySQL data syncd to it (as it does now as mySQL slave) but also the influxDB. At the moment we are running a carbon relay pushing the data to both the master and the backup server. This funcunality i have too look into with influxDB, sorry if in formulated that unclear.

    For the rest i am absolutely with you, gotta look into how i could replicate influxDB to another server if we really switch over. Well, xFilesFactor first with th rest of the bugs still to iron out and then onwards from there.