Missing performance data

  • Hi All,



    Currently we are migrating from nagios to icinga2 with web2 and grafana.

    On one of our hosts we have a few docker containers running but their network card does not seem to end up into grafana.


    On the machine we get the data from ( with NRPE):

    Code
    1. /usr/lib/nagios/plugins/check_iftraffic_nrpe.sh -D lo -p -a -s 1000 -w 95 -i veth
    2. OK: Stats: veth4cbf651(723K/30K) veth7a0688f(26K/3K) vethae15efa(311/49) veth2702962(1K/66) veth463053a(111K/5K) veth3d75744(9K/2K) vethabcc38c(0/0) veth0427cd3(294K/224K) vetha8fded5(0/0) veth07fc40a(20/26) veth2e06c7e(285/30) veth1e4d50d(335/340) veth312b8bc(0/0) veth43eaf9f(5/1) vetheed9990(127/21) vethdbfffcb(0/0) vethf713819(6K/4K) veth51efa65(0/0) vethe2a89b8(0/0) veth09b5153(24/12) veth592447f(0/0) (in/out in bytes/s) | in-veth4cbf651=723869 out-veth4cbf651=30790 in-veth7a0688f=26746 out-veth7a0688f=3443 in-vethae15efa=311 out-vethae15efa=49 in-veth2702962=1346 out-veth2702962=66 in-veth463053a=111410 out-veth463053a=5131 in-veth3d75744=9454 out-veth3d75744=2309 in-vethabcc38c=0 out-vethabcc38c=0 in-veth0427cd3=294455 out-veth0427cd3=224088 in-vetha8fded5=0 out-vetha8fded5=0 in-veth07fc40a=20 out-veth07fc40a=26 in-veth2e06c7e=285 out-veth2e06c7e=30 in-veth1e4d50d=335 out-veth1e4d50d=340 in-veth312b8bc=0 out-veth312b8bc=0 in-veth43eaf9f=5 out-veth43eaf9f=1 in-vetheed9990=127 out-vetheed9990=21 in-vethdbfffcb=0 out-vethdbfffcb=0 in-vethf713819=6862 out-vethf713819=4533 in-veth51efa65=0 out-veth51efa65=0 in-vethe2a89b8=0 out-vethe2a89b8=0 in-veth09b5153=24 out-veth09b5153=12 in-veth592447f=0 out-veth592447f=0

    Here you can find the missing items: vethe2a89b8, veth592447f, veth09b5153, veth51efa65


    On the icingaweb2 interface you will find it aswell:

    Code
    1. Plugin Output
    2. OK: Stats: veth4cbf651(559K/49K) veth7a0688f(34K/1K) vethae15efa(125/51) veth2702962(0/0) veth463053a(0/0) veth3d75744(0/0) vethabcc38c(0/0) veth0427cd3(379K/111K) vetha8fded5(0/0) veth07fc40a(0/0) veth2e06c7e(3K/242) veth1e4d50d(26/18) veth312b8bc(0/0) veth43eaf9f(0/0) vetheed9990(0/0) vethdbfffcb(0/0) vethf713819(9K/7K) veth51efa65(0/0) vethe2a89b8(5/8) veth09b5153(0/0) veth592447f(0/0) (in/out in bytes/s)


    Now over at the performance data:

    The items are missing. :(


    Is there a limit / setting that we are missing to adjust?



    Much apreciated in advance!


    William.

  • Hi Wolfgang,


    I know how that construction works yes.


    And if i split the two you can see that the missing items are also behind the '|' and thus should be send to icinga2:

    Code
    1. OK: Stats: veth4cbf651(723K/30K) veth7a0688f(26K/3K) vethae15efa(311/49) veth2702962(1K/66) veth463053a(111K/5K) veth3d75744(9K/2K) vethabcc38c(0/0) veth0427cd3(294K/224K) vetha8fded5(0/0) veth07fc40a(20/26) veth2e06c7e(285/30) veth1e4d50d(335/340) veth312b8bc(0/0) veth43eaf9f(5/1) vetheed9990(127/21) vethdbfffcb(0/0) vethf713819(6K/4K) veth51efa65(0/0) vethe2a89b8(0/0) veth09b5153(24/12) veth592447f(0/0) (in/out in bytes/s)

    |

    Code
    1. in-veth4cbf651=723869 out-veth4cbf651=30790 in-veth7a0688f=26746 out-veth7a0688f=3443 in-vethae15efa=311 out-vethae15efa=49 in-veth2702962=1346 out-veth2702962=66 in-veth463053a=111410 out-veth463053a=5131 in-veth3d75744=9454 out-veth3d75744=2309 in-vethabcc38c=0 out-vethabcc38c=0 in-veth0427cd3=294455 out-veth0427cd3=224088 in-vetha8fded5=0 out-vetha8fded5=0 in-veth07fc40a=20 out-veth07fc40a=26 in-veth2e06c7e=285 out-veth2e06c7e=30 in-veth1e4d50d=335 out-veth1e4d50d=340 in-veth312b8bc=0 out-veth312b8bc=0 in-veth43eaf9f=5 out-veth43eaf9f=1 in-vetheed9990=127 out-vetheed9990=21 in-vethdbfffcb=0 out-vethdbfffcb=0 in-vethf713819=6862 out-vethf713819=4533 in-veth51efa65=0 out-veth51efa65=0 in-vethe2a89b8=0 out-vethe2a89b8=0 in-veth09b5153=24 out-veth09b5153=12 in-veth592447f=0 out-veth592447f=0

    If you search for vethe2a89b8 as example you can see it is behind the pipe aswell:


    Code
    1. in-vethe2a89b8=0 out-vethe2a89b8=0


    But if you look at the perf data in the web interface or in grafana you can not find them back:

    I was missing more items from this list, but i did seperate the normal devices ( eth0,1,2,3 and lo) from this check and added those in a seperate check.

    After that I had less missing "veth" devices missing indicating that there must be a limit somewhere that I am missing.

  • Currently InfluxDB is installed with the latest version. Nothing much going on there:

  • After enabling debug, and forcing a check upon the service i get the following:

    Code
    1. [2017-08-10 17:26:09 +0200] debug/IdoMysqlConnection: Query: UPDATE icinga_servicestatus SET acknowledgement_type = '0', active_checks_enabled = '1', check_command = 'check_nrpe', check_source = 'HOSTNAME', check_type = '0', current_check_attempt = '1', current_notification_number = '0', current_state = '0', endpoint_object_id = 235, event_handler = '', event_handler_enabled = '1', execution_time = '0.635247', flap_detection_enabled = '0', has_been_checked = '1', instance_id = 1, is_flapping = '0', is_reachable = '1', last_check = FROM_UNIXTIME(1502378769), last_hard_state = '0', last_hard_state_change = FROM_UNIXTIME(1502378739), last_notification = FROM_UNIXTIME(1502357273), last_state_change = FROM_UNIXTIME(1502378739), last_time_ok = FROM_UNIXTIME(1502378769), last_time_unknown = FROM_UNIXTIME(1502378714), last_time_warning = FROM_UNIXTIME(1502378663), latency = '0.000670',
    2. long_output = '', max_check_attempts = '1', next_check = FROM_UNIXTIME(1502378798), next_notification = FROM_UNIXTIME(1502378744), normal_check_interval = '0.500000', notifications_enabled = '0', original_attributes = '{\"enable_notifications\":true}',
    3. output = 'OK: Stats: veth4cbf651(5K/5K) veth7a0688f(3K/3K) vethae15efa(0/0) veth2702962(0/0) veth463053a(0/0) veth3d75744(8K/6K) vethabcc38c(0/0) veth0427cd3(90K/18K) vetha8fded5(0/0) veth07fc40a(107/136) veth2e06c7e(1/1) veth1e4d50d(68/217) veth312b8bc(0/0) veth43eaf9f(0/0) vetheed9990(0/0) vethdbfffcb(0/0) vethf713819(5K/4K) veth51efa65(0/0) vethe2a89b8(0/0) veth09b5153(0/0) veth592447f(0/0) (in/out in bytes/s) ', passive_checks_enabled = '1', percent_state_change = '7', perfdata = 'in-veth4cbf651=5758 out-veth4cbf651=5245 in-veth7a0688f=3323 out-veth7a0688f=3413 in-vethae15efa=0 out-vethae15efa=0 in-veth2702962=0 out-veth2702962=0 in-veth463053a=0 out-veth463053a=0 in-veth3d75744=8324 out-veth3d75744=6301 in-vethabcc38c=0 out-vethabcc38c=0 in-veth0427cd3=90134 out-veth0427cd3=18873 in-vetha8fded5=0 out-vetha8fded5=0 in-veth07fc40a=107 out-veth07fc40a=136 in-veth2e06c7e=1 out-veth2e06c7e=1 in-veth1e4d50d=68 out-veth1e4d50d=217 in-veth312b8bc=0 out-veth312b8bc=0 in-veth43eaf9f=0 out-veth43eaf9f=0 in-vetheed9990=0 out-vetheed9990=0 in-vethdbfffcb=0 out-vethdbfffcb=0 in-vethf713819=5893',
    4. problem_has_been_acknowledged = '0', process_performance_data = '1', retry_check_interval = '0.500000', scheduled_downtime_depth = '0', service_object_id = 1731, should_be_scheduled = '1', state_type = '1', status_update_time = FROM_UNIXTIME(1502378769) WHERE service_object_id = 1731

    I have added a few enters for convenience.

    Here the vethe2a89b8 (and the others aswell) is present,

    Does this mean that the problem lies within grafana ?

  • I did a little digging in the InfluxDB database: (1000 rows to be sure)

    SQL
    1. https://jpst.it/12X-c

    Unfortunately the Network cards are still missing in here. So I looked back at the debug log and saw that I overlooked some parts:


    Code
    1. https://jpst.it/12X-y

    Here you can see the missing items into the output of the command but not in the:

    Code
    1. [2017-08-10 17:25:39 +0200] debug/InfluxdbWriter: Add to metric list:

    So for some reason the cards that I am missing are not send to the influxDB and I cannot seem to find a error about it.

  • Hmm works for me. I wrote a real quick plugin just to echo your perfdata.


    Shell-Script
    1. #!/bin/bash
    2. echo "OK: Test | in-veth4cbf651=723869 out-veth4cbf651=30790 in-veth7a0688f=26746 out-veth7a0688f=3443 in-vethae15efa=311 out-vethae15efa=49 in-veth2702962=1346 out-veth2702962=66 in-veth463053a=111410 out-veth463053a=5131 in-veth3d75744=9454 out-veth3d75744=2309 in-vethabcc38c=0 out-vethabcc38c=0 in-veth0427cd3=294455 out-veth0427cd3=224088 in-vetha8fded5=0 out-vetha8fded5=0 in-veth07fc40a=20 out-veth07fc40a=26 in-veth2e06c7e=285 out-veth2e06c7e=30 in-veth1e4d50d=335 out-veth1e4d50d=340 in-veth312b8bc=0 out-veth312b8bc=0 in-veth43eaf9f=5 out-veth43eaf9f=1 in-vetheed9990=127 out-vetheed9990=21 in-vethdbfffcb=0 out-vethdbfffcb=0 in-vethf713819=6862 out-vethf713819=4533 in-veth51efa65=0 out-veth51efa65=0 in-vethe2a89b8=0 out-vethe2a89b8=0 in-veth09b5153=24 out-veth09b5153=12 in-veth592447f=0 out-veth592447f=0"
    3. exit 0


    And i see it in InfluxDB/Grafana






    I think something is wrong with your influx setup. Please provide output of these commands:

    Code
    1. icinga2 --version

    Code
    1. influx --version

    Code
    1. curl -G 'http://192.168.200.8:8086/query?pretty=true' -u genericuser:someramdonpasswordthatichanged --data-urlencode "db=genericdb" --data-urlencode "q=SELECT * FROM check_nrpe WHERE metric =~ /^in-vethe2a89b8$/ LIMIT 1"

    The post was edited 1 time, last by Mikesch ().

  • Hi Thank you for your test!

    Great to see that is can work and that this is probably a PEBCAC ;)


    version icinga:

    version influx:

    Code
    1. InfluxDB shell version: 1.3.2

    The Curl statement:

    Code
    1. (for: vethe2a89b8, veth592447f, veth09b5153)
    2. {
    3. "results": [
    4. {
    5. "statement_id": 0
    6. }
    7. ]
    8. }

    Curl statement2: ( for a working one )

    So somehow it is still hitting a limit if you look at the list from post1, The ones that are missing are the last few metrics:

    Code
    1. OK: Stats: veth4cbf651(723K/30K) veth7a0688f(26K/3K) vethae15efa(311/49) veth2702962(1K/66) veth463053a(111K/5K) veth3d75744(9K/2K) vethabcc38c(0/0) veth0427cd3(294K/224K) vetha8fded5(0/0) veth07fc40a(20/26) veth2e06c7e(285/30) veth1e4d50d(335/340) veth312b8bc(0/0) veth43eaf9f(5/1) vetheed9990(127/21) vethdbfffcb(0/0) vethf713819(6K/4K) veth51efa65(0/0) vethe2a89b8(0/0) veth09b5153(24/12) veth592447f(0/0) (in/out in bytes/s) | in-veth4cbf651=723869 out-veth4cbf651=30790 in-veth7a0688f=26746 out-veth7a0688f=3443 in-vethae15efa=311 out-vethae15efa=49 in-veth2702962=1346 out-veth2702962=66 in-veth463053a=111410 out-veth463053a=5131 in-veth3d75744=9454 out-veth3d75744=2309 in-vethabcc38c=0 out-vethabcc38c=0 in-veth0427cd3=294455 out-veth0427cd3=224088 in-vetha8fded5=0 out-vetha8fded5=0 in-veth07fc40a=20 out-veth07fc40a=26 in-veth2e06c7e=285 out-veth2e06c7e=30 in-veth1e4d50d=335 out-veth1e4d50d=340 in-veth312b8bc=0 out-veth312b8bc=0 in-veth43eaf9f=5 out-veth43eaf9f=1 in-vetheed9990=127 out-vetheed9990=21 in-vethdbfffcb=0 out-vethdbfffcb=0 in-vethf713819=6862 out-vethf713819=4533 in-veth51efa65=0 out-veth51efa65=0
    2. (from here they dont work any longer)
    3. in-vethe2a89b8=0 out-vethe2a89b8=0 in-veth09b5153=24 out-veth09b5153=12 in-veth592447f=0 out-veth592447f=0

    The entire list is 1233 characters long, And without the items that go wrong the list is 1126 characters long.

  • Hi,


    Ofcourse I see what the results will be, but your list is different from mine as you have:

    Code
    1. OK: Test

    before the "|"

    and I have:

    Code
    1. OK: Stats: veth4cbf651(723K/30K) veth7a0688f(26K/3K) vethae15efa(311/49) veth2702962(1K/66) veth463053a(111K/5K) veth3d75744(9K/2K) vethabcc38c(0/0) veth0427cd3(294K/224K) vetha8fded5(0/0) veth07fc40a(20/26) veth2e06c7e(285/30) veth1e4d50d(335/340) veth312b8bc(0/0) veth43eaf9f(5/1) vetheed9990(127/21) vethdbfffcb(0/0) vethf713819(6K/4K) veth51efa65(0/0) vethe2a89b8(0/0) veth09b5153(24/12) veth592447f(0/0) (in/out in bytes/s)

    which saves 420 characters, bringing the total from 1233 to 813

  • Hm, does InfluxDB have a message limit in characters? The messages are sent in bulk which means that multiple metrics are collected and then flushed (flush_* settings allow to control that).


    You could put tcpdump in the middle and extract which messages are sent, and maybe truncated by InfluxDB (but that should generate a log warning on their side probably, I don't know).

  • Instead of solving this wierd mystery I worked around it by having 3 checks.


    1 without veth devices, but physical cards

    1 with veth[0-9] without physical cards

    1 with veth[a-z] without physical cards


    Far from ideal. Because in the ideal world you want 1 check for it all. But atleast now i can find all cards in grafana.

    Luckily we don`t expect a lot of servers to have this problem. so it wont expand into having 200 extra checks.

  • I Installed influx with apt-get install

    So i would assume that the limit in influxDB is not there.

    Also if you look at the icinga log you clearly see that he sends everything to influx with 1 querry per card:

    Code
    1. Add to metric list: 'check_nrpe,hostname=DOCKER-www01,service=check\ network\ usage\ docker,metric=in-veth07fc40a value=0 1502378739'.

    So it seems that icinga is not adding it all or sending it all. Either way I have a work around for now lets hope I wont hit this limitation any more.

  • "add to metric list" just says that the single parsed performance data key/value pair is added to the string buffer. The final message which is sent is not logged again - that's why I wanted you to look into the network traffic with tcp dump and extract the sent message.


    NRPE is a good call. Before investigating there I'd suggest to fetch the "last_check_result" for that service via the Icinga 2 REST API and look if all performance data keys are available there. If not, the plugin provided garbage.

  • If i look at the byte size of a file then this is 1126 bytes long of data that is being send to the grafana server.

    SO the NRPE limitation is a interesting one. But i suspect not entirely correct that is including the before and after the "|"


    I rather avoid snmp, and was hoping that something we already used like NRPE would be sufficed to do everything.


    I will have a look the TCP Dump and see what is there and how many bites it actually wants to send!

  • You can also install nrpe3, that version has not the limitation. But keep min mind you need to install it on both sides.

    But why not snmp? Its easy to manage, and you can also restrict access. From my point, its not only how many bytes are going in/out, i want to know also if there are packets droped or discarded, the load on the interface etc.