Retention policies and continuous queries made simple


(Carsten Köbke) #1

This is a small howto about downsampling data for InfluxDB.

Use this howto on your own risk!
And as always -> create a backup first before you change anything!

Setup

InfluxDB name: icinga2
default retention policy: rp_1_year for new databases, or autogen for existing.

Terms

continuous query (CQ) An query that runs automatically and periodically within a database, normaly every 30 minutes if you didnt configure other times in your influxdb.ini

retention policy (RP) The part of InfluxDB’s data structure that describes for how long InfluxDB keeps data (duration).

Actual situation for most InfluxDB setups used with Icinga2

Most users setup InfluxDB like in this blog post, which works perfect.

Your Icinga2 server will send all the metrics (and meta data) to your InfluxDB, InfluxDB store all data inside the “DEFAULT” retention policy and keep it forever (or until you have enough and delete it). That is how InfluxDB works, you store data in a retention policy!

But what if you don’t want to store all the data indefinitely? You will need a retention policy (or more) and if you also want to downsample your data older then x hours|days|weeks|years you also need continuous queries :slight_smile: Lets create a setup that will downsample all data older then 52 weeks (1 hour to one entry) and downsample again after 104 weeks (1 day to one entry) deleting data older then 3 years.

Explanation:
Icinga2 will send performance data to InfluxDB -> InfluxDB will store all data inside the default retention policy.

The first continuous queries will pull data, that is older then 52 weeks, every hour from the default retention policy and write it into the retention policy “rp_2_years”.

The second continuous queries “cq_after_2_years” will pull data, that is older then 104 weeks, every day from the retention policy “rp_2_years” and write it into the retention policy “rp_3_years”.

The default retention policy will delete all data from the default policy that is older then 52 weeks and one day.
The retention policy “rp_2_years” will delete all data from the retention policy “rp_2_years” that is older then 104 weeks and 2 days.
The retention policy “rp_3_years” will delete all data from the retention policy “rp_3_years” that is older then 156 weeks.

Prepare a new InfluxDB database

  1. Log into influx cli
  2. Create database with non default retention policy named “one_year”
CREATE DATABASE "icinga2" WITH DURATION 52w1d REPLICATION 1 SHARD DURATION 168h NAME "rp_1_year"
  1. Change to your newly created database
use icinga2
  1. Create second retention policy “rp_2_years” and “rp_3_years”
CREATE RETENTION POLICY "rp_2_years" ON "icinga2" DURATION 104w2d REPLICATION 1
CREATE RETENTION POLICY "rp_3_years" ON "icinga2" DURATION 156w REPLICATION 1
  1. Check if all retention policies are there
> SHOW RETENTION POLICIES
name       duration   shardGroupDuration replicaN default
----       --------   ------------------ -------- -------
rp_1_year  8760h0m0s  168h0m0s           1        true
rp_2_years 17520h0m0s 168h0m0s           1        false
rp_3_years 26208h0m0s 168h0m0s           1        false
  1. Create the first continuous query that will downsample all metric data (meta data will not be taken) older 52 weeks and downsample 1 hour of data into 1 entry writing it into the retention policy named rp_2_years
CREATE CONTINUOUS QUERY "cq_after_1_year" ON "icinga2" BEGIN SELECT mean(value) AS value,mean(crit) AS crit,mean(warn) AS warn INTO "icinga2"."rp_2_years".:MEASUREMENT FROM "icinga2"."rp_1_year"./.*/ WHERE time < now() -52w GROUP BY time(1h),* END
  1. For the next continuous query we want to use data from the retention policy “rp_2_years”
CREATE CONTINUOUS QUERY "cq_after_2_year" ON "icinga2" BEGIN SELECT mean(value) AS value,mean(crit) AS crit,mean(warn) AS warn INTO "icinga2"."rp_3_years".:MEASUREMENT FROM "icinga2"."rp_2_years"./.*/ WHERE time < now() -104w GROUP BY time(1d),* END
  1. Check if all continuous are set like we wanted it.
> SHOW CONTINUOUS QUERIES
name: _internal
name query
---- -----

name: icinga2
name            query
----            -----
cq_after_1_year CREATE CONTINUOUS QUERY cq_after_1_year ON icinga2 BEGIN SELECT mean(value) AS value, mean(crit) AS crit, mean(warn) AS warn INTO icinga2.rp_2_years.:MEASUREMENT FROM icinga2.rp_1_year./.*/ WHERE time < now() - 52w GROUP BY time(1h), * END
cq_after_2_year CREATE CONTINUOUS QUERY cq_after_2_year ON icinga2 BEGIN SELECT mean(value) AS value, mean(crit) AS crit, mean(warn) AS warn INTO icinga2.rp_3_years.:MEASUREMENT FROM icinga2.rp_2_years./.*/ WHERE time < now() - 104w GROUP BY time(1d), * END

For existing databases with a retention policy “autogen”

Hint

Its not possible to rename existing retention policies, but we can alter them.
Do not drop the default retention policy, you will loose all your data!

  1. Log into influx cli
  2. Change to your existing database
use icinga2
  1. Create second retention policy “rp_2_years” and “rp_3_years”
CREATE RETENTION POLICY "rp_2_years" ON "icinga2" DURATION 104w2d REPLICATION 1
CREATE RETENTION POLICY "rp_3_years" ON "icinga2" DURATION 156w REPLICATION 1
  1. Check if all retention policies are there
> SHOW RETENTION POLICIES
name       duration   shardGroupDuration replicaN default
----       --------   ------------------ -------- -------
autogen    0s         168h0m0s           1        true
rp_2_years 17520h0m0s 168h0m0s           1        false
rp_3_years 26208h0m0s 168h0m0s           1        false
  1. Create the first continuous query that will downsample all metric data (meta data will not be taken) older 52 weeks and downsample 1 hour of data into 1 entry writing it into the retention policy named rp_2_years
CREATE CONTINUOUS QUERY "cq_after_1_year" ON "icinga2" BEGIN SELECT mean(value) AS value,mean(crit) AS crit,mean(warn) AS warn INTO "icinga2"."rp_2_years".:MEASUREMENT FROM "icinga2"."autogen"./.*/ WHERE time < now() -52w GROUP BY time(1h),* END
  1. For the next continuous query we want to use data from the retention policy “rp_2_years”
CREATE CONTINUOUS QUERY "cq_after_2_year" ON "icinga2" BEGIN SELECT mean(value) AS value,mean(crit) AS crit,mean(warn) AS warn INTO "icinga2"."rp_3_years".:MEASUREMENT FROM "icinga2"."rp_2_years"./.*/ WHERE time < now() -104w GROUP BY time(1d),* END
  1. Check if all continuous are set like we wanted it.
> SHOW CONTINUOUS QUERIES
name: _internal
name query
---- -----

name: icinga2
name            query
----            -----
cq_after_1_year CREATE CONTINUOUS QUERY cq_after_1_year ON icinga2 BEGIN SELECT mean(value) AS value, mean(crit) AS crit, mean(warn) AS warn INTO icinga2.rp_2_years.:MEASUREMENT FROM icinga2.autogen./.*/ WHERE time < now() - 52w GROUP BY time(1h), * END
cq_after_2_year CREATE CONTINUOUS QUERY cq_after_2_year ON icinga2 BEGIN SELECT mean(value) AS value, mean(crit) AS crit, mean(warn) AS warn INTO icinga2.rp_3_years.:MEASUREMENT FROM icinga2.rp_2_years./.*/ WHERE time < now() - 104w GROUP BY time(1d), * END
  1. Wait for at least 2 days, because second CQ runs only every day once, or more days depending on your amount of data, before you alter the default retention policy to delete all data older 52w and 1 day
ALTER RETENTION POLICY "autogen" ON "icinga2" DURATION 52w1d REPLICATION 1 SHARD DURATION 168h DEFAULT
  1. Check if the “autogen” policy has changed
> show retention policies
name       duration   shardGroupDuration replicaN default
----       --------   ------------------ -------- -------
autogen    8760h0m0s  168h0m0s           1        true
rp_2_years 17520h0m0s 168h0m0s           1        false
rp_3_years 26208h0m0s 168h0m0s           1        false

If you still have questions, just ask here or open a new topic.

Cheers,
Carsten


(Bodo Schulz) #2

big love! :slight_smile:


Retention - how to do it?
(Kevin Honka) #3

(Nicolò Festa) #4

Hi Carsten,
I’m looking for a way to downsampling influxdb points and I agree with your solution, but I have the following doubt.

The only way to prevent data loss is to execute the CQ on a time range less than the RP.
Infact, the WHERE condition of cq_after_1_year excludes the first day of data from the default RP.
Does these queries work as you expect?
I’m not sure because the InfluxDB documantion says:

InfluxDB automatically generates a time range for the cq_query when it executes the CQ. Any user-specified time ranges in the cq_query’s WHERE clause will be ignored by the system.


(Carsten Köbke) #5

They worked in my test. I used 1 day and 2 days as RP. The Documentation is right when you put a time range (start -> endtime), i use only a “older then” where clause.

For the CQ, its not only the first day excluded. Let me explain:
Given is a entry from first january 2017, the timestamp (time) is then 1483225200 and an entry from 01.01.2018 -> 1514761200

timestamp data cq timestamp used
1483225200 entry1 1491911056 yes
1514761200 entry1 1491911056 no

“Now - 52w” timestamp is 1491911056", so the first entry would be used by the CQ because its “older” (<) then “now - 52w”, the second entry would not be usedby the CQ. So the CQ says any data with timestampes “older” then “now - 52w” will be downsampled (or timestamp is smaller then XY).

If you stil have questions, just ask :slight_smile:


(Denny Fuchs) #6

Hi Karsten,

we are now also trying out downsampling and retention, but first for our DB Telegraf, because we running out diskspace (~100 servers with 10sec intervall + Icinga 60sec / 300s) …

Can you take short look if it has no bad / false configuration ? We want to use the same policies / GC for Icinga2DB too :slight_smile:

-- 1 year, resample 1h
CREATE RETENTION POLICY "rp_1_years" ON "telegraf" DURATION 52w1d REPLICATION 1
-- 5 years, resample 1d
CREATE RETENTION POLICY "rp_5_years" ON "telegraf" DURATION 260w REPLICATION 1

-- check everything looks as expected
SHOW RETENTION POLICIES

CREATE CONTINUOUS QUERY "cq_after_1_month" ON "telegraf" BEGIN
SELECT mean(*) -- all numeric field keys will be aggregated
INTO "telegraf"."rp_1_years".:MEASUREMENT -- into same measurement name as source
FROM "telegraf"."autogen"./.*/ -- from all measurements
WHERE time < now() -4w -- everything older than 4w
GROUP BY time(1h), -- resample to 1h granularity
* END

CREATE CONTINUOUS QUERY "cq_after_1_year" ON "telegraf" BEGIN
SELECT mean(*) -- all numeric field keys will be aggregated
INTO "telegraf"."rp_5_years".:MEASUREMENT -- into same measurement name as source
FROM "telegraf"."rp_1_years"./.*/ -- from all measurements
WHERE time < now() -52w -- everything older than 52w
GROUP BY time(1d), -- resample to 1d granularity
* END

-- check everything looks as expected
SHOW CONTINUOUS QUERIES

-- AFTER the top of the next hour, check data in rp_1_years, should have aggregates for everything older than 4w
-- if it looks good, then you can alter DEFAULT retention policy.
-- Why not way a few days? Because the next RP is based on rp_1_years data, not default
-- 1 month, full data granularity
ALTER RETENTION POLICY "autogen" ON "telegraf" DURATION 4w1d REPLICATION 1 SHARD DURATION 168h DEFAULT

-- check everything looks as expected
SHOW RETENTION POLICIES

(Denny Fuchs) #7

Hi again,

we have also a question: our database(s) has now over 100GB. What happens if the GC run the second time ? The GC copies from the default RT to everything older than 1 month and put its the next RP. What happens, if the next hour it runs, if the source data is not gone ? The source is still the default RP. We thinking, that InfluxDB copies the data each time, the GC runs.

cu denny


(Carsten Köbke) #8

Hi Dennys,

I will take a look this evening.


#9

Thx Carsten.

Denny and I studied your post in detail and came up with an approach for telegraf, however as @linuxmail mentioned, it looks like we need a one-time set of queries to get the data copied into the first non default retention bucket, and then remove the time in the WHERE clause and we should be fine after that. The concern is the current CQ’s will do the same work over and over again.


(Carsten Köbke) #10

Hi @linuxmail & @kylem,

He will take all data that is older then 4weeks from default RP and downsample 1h of entries to 1 entry . Keep in mind that InfluxDB will not save duplicates, so no data will be saved twice. The autogen RP will later delete all data older then 4 weeks + 1 day. So over the time in autogen you will have only data of 4 weeks and 1 day.
And yes your database will grow first before you change your autogen RP.

It just takes time to downsample everything, thats why i wrote you should let it run for some time (maybe 1 days for 100gb, but i have no expierence how long this could take) and then alter the autogen. So data gets downsampled before it gets deleted.

Same is then for the second RP of 5 years, just have a backup and let it run. If things go bad.

For me your QCs and RPs look good (but i dont know telegraf) it should work as designed :slight_smile:

Btw “running out of disk space” with 100GB data? My small QNAP has 24TB if you need some more space :wink:

Regards,
Carsten


(sam) #11

Thanks for the writeup @Carsten!

unfortunately it is no working for me, as influx (v1.6.2) is also downsampling data which is younger than 2 weeks (in my case). Running the following query by hand works as expected:

SELECT mean(value) AS value INTO "test"."agg_1h".:MEASUREMENT FROM "test"."autogen"./.*/ WHERE time < now() -2w GROUP BY time(1h),*

But if I run it as this CQ:

CREATE CONTINUOUS QUERY "cq_agg_1h" ON "test" BEGIN SELECT mean(value) AS value INTO "test"."agg_1h".:MEASUREMENT FROM "test"."autogen"./.*/ WHERE time < now() -2w GROUP BY time(1h),* END

There is also downsampled data which is only 1h old in the test.agg_1h measurement.
Any hints why this is not working for me?

Thanks for your help!


(Carsten Köbke) #12

Hello @sam

do you get any error message when you try to create the CQ?
Can you run this and post it here please

SHOW RETENTION POLICIES
SHOW CONTINUOUS QUERIES 

Regards,
Carsten


(sam) #13

Dear @Carsten,

thanks for your quick reply. Unfortunately I caught a flu and haven’t been able to get the information. Next week my annual holiday is starting. So it’ll be three weeks till I can get back to you. Sorry for that!

All the best
Sam


(sam) #14

Dear @Carsten,

before leaving I added some other CQs and RPs to the “icinga2” database - same result :frowning:

 > SHOW RETENTION POLICIES
name    duration  shardGroupDuration replicaN default
----    --------  ------------------ -------- -------
autogen 0s        168h0m0s           1        true
agg_1h  2208h0m0s 168h0m0s           1        false
agg_5h  8736h0m0s 672h0m0s           1        false

> SHOW CONTINUOUS QUERIES
name: icinga2
name      query
----      -----
cq_agg_1h CREATE CONTINUOUS QUERY cq_agg_1h ON icinga2 BEGIN SELECT mean(value) AS value INTO icinga2.agg_1h.:MEASUREMENT FROM icinga2.autogen./.*/ WHERE time < now() - 2w GROUP BY time(1h), * END
cq_agg_5h CREATE CONTINUOUS QUERY cq_agg_5h ON icinga2 BEGIN SELECT mean(value) AS value INTO icinga2.agg_5h.:MEASUREMENT FROM icinga2.agg_1h./.*/ WHERE time < now() - 13w GROUP BY time(5h), * END

Looks correct to me. Thanks for your help!


(Carsten Köbke) #15

Looks also OK for me, but i need to test it, will tkae some time.