Diagnosing late check results and deleting future entries in MySQL

Hi all,

I’ve been noticing some issues with late check results with some of my services. I queried the API for overall status and will paste its contents below. I also queried the API to see just how many late check results there were, and there happened to be ~350, which all exist on one of my satellite zones.
I thought there might be a resource issue, but when I look at load and memory usage for that satellite, there doesn’t seem to be any indication that it is being overloaded.
This problem seems to be relatively recent. My distributed monitoring set up has been in place for a few months now.
Does anybody know what my problem could be?

{
  "results": [{
        "name": "ApiListener",
        "perfdata": [{
            "counter": false,
            "crit": null,
            "label": "api_num_conn_endpoints",
            "max": null,
            "min": null,
            "type": "PerfdataValue",
            "unit": "",
            "value": 2.0,
            "warn": null
          },
          {
            "counter": false,
            "crit": null,
            "label": "api_num_endpoints",
            "max": null,
            "min": null,
            "type": "PerfdataValue",
            "unit": "",
            "value": 2.0,
            "warn": null
          },
          {
            "counter": false,
            "crit": null,
            "label": "api_num_http_clients",
            "max": null,
            "min": null,
            "type": "PerfdataValue",
            "unit": "",
            "value": 1.0,
            "warn": null
          },
          {
            "counter": false,
            "crit": null,
            "label": "api_num_json_rpc_clients",
            "max": null,
            "min": null,
            "type": "PerfdataValue",
            "unit": "",
            "value": 0.0,
            "warn": null
          },
          {
            "counter": false,
            "crit": null,
            "label": "api_num_json_rpc_relay_queue_item_rate",
            "max": null,
            "min": null,
            "type": "PerfdataValue",
            "unit": "",
            "value": 14.716666666666666785,
            "warn": null
          },
          {
            "counter": false,
            "crit": null,
            "label": "api_num_json_rpc_relay_queue_items",
            "max": null,
            "min": null,
            "type": "PerfdataValue",
            "unit": "",
            "value": 0.0,
            "warn": null
          },
          {
            "counter": false,
            "crit": null,
            "label": "api_num_json_rpc_sync_queue_item_rate",
            "max": null,
            "min": null,
            "type": "PerfdataValue",
            "unit": "",
            "value": 0.050000000000000002776,
            "warn": null
          },
          {
            "counter": false,
            "crit": null,
            "label": "api_num_json_rpc_sync_queue_items",
            "max": null,
            "min": null,
            "type": "PerfdataValue",
            "unit": "",
            "value": 0.0,
            "warn": null
          },
          {
            "counter": false,
            "crit": null,
            "label": "api_num_json_rpc_work_queue_count",
            "max": null,
            "min": null,
            "type": "PerfdataValue",
            "unit": "",
            "value": 1.0,
            "warn": null
          },
          {
            "counter": false,
            "crit": null,
            "label": "api_num_json_rpc_work_queue_item_rate",
            "max": null,
            "min": null,
            "type": "PerfdataValue",
            "unit": "",
            "value": 16.06666666666666643,
            "warn": null
          },
          {
            "counter": false,
            "crit": null,
            "label": "api_num_json_rpc_work_queue_items",
            "max": null,
            "min": null,
            "type": "PerfdataValue",
            "unit": "",
            "value": 0.0,
            "warn": null
          },
          {
            "counter": false,
            "crit": null,
            "label": "api_num_not_conn_endpoints",
            "max": null,
            "min": null,
            "type": "PerfdataValue",
            "unit": "",
            "value": 0.0,
            "warn": null
          }
        ],
        "status": {
          "api": {
            "conn_endpoints": ["icinga2sat1.DOMAIN", "icinga2sat2.DOMAIN"],
            "http": {
              "clients": 1.0
            },
            "identity": "icinga2master.DOMAIN",
            "json_rpc": {
              "clients": 0.0,
              "relay_queue_item_rate": 14.716666666666666785,
              "relay_queue_items": 0.0,
              "sync_queue_item_rate": 0.050000000000000002776,
              "sync_queue_items": 0.0,
              "work_queue_count": 1.0,
              "work_queue_item_rate": 16.06666666666666643,
              "work_queue_items": 0.0
            },
            "not_conn_endpoints": [],
            "num_conn_endpoints": 2.0,
            "num_endpoints": 2.0,
            "num_not_conn_endpoints": 0.0,
            "zones": {
              "icinga2sat1.DOMAIN": {
                "client_log_lag": 0.0,
                "connected": true,
                "endpoints": ["icinga2sat1.DOMAIN"],
                "parent_zone": "master"
              },
              "icinga2sat2.DOMAIN": {
                "client_log_lag": 0.0,
                "connected": true,
                "endpoints": ["icinga2sat2.DOMAIN"],
                "parent_zone": "master"
              },
              "master": {
                "client_log_lag": 0.0,
                "connected": true,
                "endpoints": ["icinga2master.DOMAIN"],
                "parent_zone": ""
              }
            }
          }
        }
      },
      {
        "name": "CIB",
        "perfdata": [],
        "status": {
          "active_host_checks": 0.5,
          "active_host_checks_15min": 837.0,
          "active_host_checks_1min": 30.0,
          "active_host_checks_5min": 191.0,
          "active_service_checks": 1.6666666666666667407,
          "active_service_checks_15min": 4281.0,
          "active_service_checks_1min": 100.0,
          "active_service_checks_5min": 827.0,
          "avg_execution_time": 0.61404339165476085061,
          "avg_latency": 0.0015682505606240638649,
          "max_execution_time": 60.232937097549438477,
          "max_latency": 0.021145105361938476562,
          "min_execution_time": 0.0,
          "min_latency": 4.07695770263671875e-05,
          "num_hosts_acknowledged": 1.0,
          "num_hosts_down": 3.0,
          "num_hosts_flapping": 0.0,
          "num_hosts_in_downtime": 0.0,
          "num_hosts_pending": 0.0,
          "num_hosts_unreachable": 0.0,
          "num_hosts_up": 125.0,
          "num_services_acknowledged": 18.0,
          "num_services_critical": 12.0,
          "num_services_flapping": 0.0,
          "num_services_in_downtime": 0.0,
          "num_services_ok": 671.0,
          "num_services_pending": 0.0,
          "num_services_unknown": 10.0,
          "num_services_unreachable": 0.0,
          "num_services_warning": 6.0,
          "passive_host_checks": 0.0,
          "passive_host_checks_15min": 0.0,
          "passive_host_checks_1min": 0.0,
          "passive_host_checks_5min": 0.0,
          "passive_service_checks": 0.0,
          "passive_service_checks_15min": 0.0,
          "passive_service_checks_1min": 0.0,
          "passive_service_checks_5min": 0.0,
          "uptime": 2229.8544561862945557
        }
      },
      {
        "name": "CheckResultReader",
        "perfdata": [],
        "status": {
          "checkresultreader": {}
        }
      },
      {
        "name": "CheckerComponent",
        "perfdata": [{
            "counter": false,
            "crit": null,
            "label": "checkercomponent_checker_idle",
            "max": null,
            "min": null,
            "type": "PerfdataValue",
            "unit": "",
            "value": 8.0,
            "warn": null
          },
          {
            "counter": false,
            "crit": null,
            "label": "checkercomponent_checker_pending",
            "max": null,
            "min": null,
            "type": "PerfdataValue",
            "unit": "",
            "value": 0.0,
            "warn": null
          }
        ],
        "status": {
          "checkercomponent": {
            "checker": {
              "idle": 8.0,
              "pending": 0.0
            }
          }
        }
      },
      {
        "name": "CompatLogger",
        "perfdata": [],
        "status": {
          "compatlogger": {}
        }
      },
      {
        "name": "ElasticsearchWriter",
        "perfdata": [],
        "status": {
          "elasticsearchwriter": {}
        }
      },
      {
        "name": "ExternalCommandListener",
        "perfdata": [],
        "status": {
          "externalcommandlistener": {
            "command": 1.0
          }
        }
      }, {
        "name": "FileLogger",
        "perfdata": [],
        "status": {
          "filelogger": {
            "main-log": 1.0
          }
        }
      }, {
        "name": "GelfWriter",
        "perfdata": [],
        "status": {
          "gelfwriter": {}
        }
      }, {
        "name": "GraphiteWriter",
        "perfdata": [],
        "status": {
          "graphitewriter": {}
        }
      }, {
        "name": "IcingaApplication",
        "perfdata": [],
        "status": {
          "icingaapplication": {
            "app": {
              "enable_event_handlers": true,
              "enable_flapping": true,
              "enable_host_checks": true,
              "enable_notifications": true,
              "enable_perfdata": true,
              "enable_service_checks": true,
              "node_name": "icinga2master.DOMAIN",
              "pid": 30513.0,
              "program_start": 1518463887.527602911,
              "version": "r2.8.0-1"
            }
          }
        }
      }, {
        "name": "IdoMysqlConnection",
        "perfdata": [{
          "counter": false,
          "crit": null,
          "label": "idomysqlconnection_ido-mysql_queries_rate",
          "max": null,
          "min": null,
          "type": "PerfdataValue",
          "unit": "",
          "value": 7.9833333333333333925,
          "warn": null
        }, {
          "counter": false,
          "crit": null,
          "label": "idomysqlconnection_ido-mysql_queries_1min",
          "max": null,
          "min": null,
          "type": "PerfdataValue",
          "unit": "",
          "value": 479.0,
          "warn": null
        }, {
          "counter": false,
          "crit": null,
          "label": "idomysqlconnection_ido-mysql_queries_5mins",
          "max": null,
          "min": null,
          "type": "PerfdataValue",
          "unit": "",
          "value": 2806.0,
          "warn": null
        }, {
          "counter": false,
          "crit": null,
          "label": "idomysqlconnection_ido-mysql_queries_15mins",
          "max": null,
          "min": null,
          "type": "PerfdataValue",
          "unit": "",
          "value": 8595.0,
          "warn": null
        }, {
          "counter": false,
          "crit": null,
          "label": "idomysqlconnection_ido-mysql_query_queue_items",
          "max": null,
          "min": null,
          "type": "PerfdataValue",
          "unit": "",
          "value": 0.0,
          "warn": null
        }, {
          "counter": false,
          "crit": null,
          "label": "idomysqlconnection_ido-mysql_query_queue_item_rate",
          "max": null,
          "min": null,
          "type": "PerfdataValue",
          "unit": "",
          "value": 13.033333333333333215,
          "warn": null
        }],
        "status": {
          "idomysqlconnection": {
            "ido-mysql": {
              "connected": true,
              "instance_name": "default",
              "query_queue_item_rate": 13.033333333333333215,
              "query_queue_items": 0.0,
              "version": "1.14.3"
            }
          }
        }
      }, {
        "name": "InfluxdbWriter",
        "perfdata": [{
          "counter": false,
          "crit": null,
          "label": "influxdbwriter_influxdb_work_queue_items",
          "max": null,
          "min": null,
          "type": "PerfdataValue",
          "unit": "",
          "value": 0.0,
          "warn": null
        }, {
          "counter": false,
          "crit": null,
          "label": "influxdbwriter_influxdb_work_queue_item_rate",
          "max": null,
          "min": null,
          "type": "PerfdataValue",
          "unit": "",
          "value": 5.1666666666666669627,
          "warn": null
        }, {
          "counter": false,
          "crit": null,
          "label": "influxdbwriter_influxdb_data_queue_items",
          "max": null,
          "min": null,
          "type": "PerfdataValue",
          "unit": "",
          "value": 91.0,
          "warn": null
        }],
        "status": {
          "influxdbwriter": {
            "influxdb": {
              "data_buffer_items": 91.0,
              "work_queue_item_rate": 5.1666666666666669627,
              "work_queue_items": 0.0
            }
          }
        }
      }, {
        "name": "NotificationComponent",
        "perfdata": [],
        "status": {
          "notificationcomponent": {
            "notification": 1.0
          }
        }
      }, {
        "name": "OpenTsdbWriter",
        "perfdata": [],
        "status": {
          "opentsdbwriter": {}
        }
      }, {
        "name": "PerfdataWriter",
        "perfdata": [],
        "status": {
          "perfdatawriter": {}
        }
      }, {
        "name": "StatusDataWriter",
        "perfdata": [],
        "status": {
          "statusdatawriter": {}
        }
      }, {
        "na

That is not enough (furthermore, you do not tell about the tools you get these metrics with).

  • check all metrics of top / htop, including irq/sirq. A machine (especially a router) might be 98% idle but spending lots of times in irq’s (bad device drivers, faulty / cheap hardware), which results in high load values in the output of uptime. Use iotop, iftop in addition.
  • using above tools, search for large mounts of check_* processes and find out why these queue up to 350 processes - What makes them slow ?
  • play with the concurrent_checks setting.

https://www.icinga.com/docs/icinga2/latest/doc/09-object-types/#checkercomponent
https://www.tummy.com/articles/isolating-heavy-load/

1 Like

I assume you haven’t looked here already.

https://www.icinga.com/docs/icinga2/latest/doc/15-troubleshooting/
https://www.icinga.com/docs/icinga2/latest/doc/15-troubleshooting/#late-check-results-in-distributed-environments

Hi sru and dnsmichi,

I used htop for getting the load and memory usage metrics initially, however, I was only looking at CPU % and MEM % instead of the other columns.

  • There doesn’t seem to be anything about irq/sirq in htop, so I used top to see those although irq stays at 0.0 and sirq will slightly spike up from 0.0 to 0.2 - 0.5 occasionally.
  • Executing uptime showed pretty low averages, with 1-min avg staying under 2 (~1.5), 5-min avg staying under 2 (~1.2), and 10-min avg staying under 3 (~2.7).
  • When using iotop, no process exceeded 5% I/O load and the processes that did reach around 5% were mysqld and influxdb for only a couple seconds at a time.
  • When using iftop, the 2-sec avg was ~200Kb, 10-sec avg was ~250Kb, and 40-sec Avg was ~250Kb.
  • As for the check_*processes, they were sporadic in the htop output. I’ve attached a screenshot of an example of htop showing some check_ping processes since I can’t copy/pipe it to a file. These processes will show up for up to 5 seconds before disappearing.
  • With the concurrent_checks setting, I tried setting it to as high as 1024 and as low as 16, but that didn’t seem to influence anything.

From what I can tell, most of what you had me check seems to be average… I checked the results from that server to my “working” instance and they seem pretty similar. The only difference between the two instances is that the “working” instance has 8 CPU cores (might be overkill) and 4 GB RAM and the non working instance has 2 and 8 GB RAM.

dnsmichi,
I did look at that part in the docs, which is where I got the initial metrics for number late check results (~350) and also where I got the contents of the API status file. I did however look at the docs again and found that one command to grab “hosts with service checks using Command Endpoints are not connected”, which yielded a strange result; it showed a host under my non-working satellite that doesn’t even use command_endpoint (verified using icinga2 object list --type=host).

Thanks for the help so far! That article you linked about isolating system load was interesting @sru.

Hi @watermelon,

did you check if the icinga user has a limit on “max user processes” or any other limit may cause trouble?
I know that icinga2 on SLES 12 SP12 has also a limit from systemd …

Regards,
Carsten

Great, but…

Two cores means a load value of 2 points to a fully loaded system.
No process needs to wait for execution here, but everything > 2 starts queuing up processes.

  • Having a load of 1.5 in the last minute i would vote pretty normal.
  • Having a load of 1.2 in the 5 minutes would mean more then 50% load summed up for both cores; this seems high for me as 5 minutes is a long period. I guess 1 core was at 100% and the other at 20%.
  • 2.7 in the 10 minutes field ? wtf !!! I mean, if the last 5 min is 1.2, how high must have been the load in the previous 5 min interval to cause an average of 2.7 ? 4.2 ! Whow !

I can not tell what causes this load at your machine, but 4.2 in a 5 minute intervall for a 2 core box really is a heavy load for me.

2 Likes

Hi sru,

Thanks for the input. I wasn’t realizing that 2.7 was actually quite high in the 10 minutes field…

I think I have determined the root cause of the issue: one of the other virtual machines that lives on the same host as my Icinga satellite was acting up and hogging all of the CPU resources from the other VM’s and therefore causing Icinga to become overloaded like as shown , sending late check results. Rebooting the “problem” VM fixed the problem, and Icinga’s load averages seemed to balance out.

Unfortunately… the problems don’t stop there. A couple weeks ago, our entire host chassis crashed and was down for two weeks because we weren’t able to get the correct parts for that long. I didn’t realize it until now, but the date and time for my Icinga satellite was actually off by two weeks (into the future!) because of this. Now, since we rebooted that “problem” VM that I mentioned earlier, checks were starting to come in again as normal. However, those checks were received to my master as results from March 2nd (two weeks from now).

Please, ask for clarification if needed because I’m unsure how to go about the problem I’m about to describe: I changed my date back to today (Feb 14) and now checks aren’t going through because the master thinks that the last check result was March 2nd and since my Icinga satellite is Feb 14th, it thinks that checks scheduled NOW are in the PAST because March 2 > Feb 14 and therefore Icinga sees no reason to update a check result received from the “past”. Note: I tested this interaction by setting the date to March 3, and check results starting going through again, so this theory seems to be correct.

Possible solutions that I’ve thought of are to 1) roll back the satellite and master to yesterday and try to set the date to Feb 14 before any checks go through (backup wasn’t taken of master ever) or 2) try to delete the entries from the MySQL and InfluxDB databases for all dates after Feb 14, but I’m unsure as to specifically what tables to delete from. 3) Rebuild the master and satellite (save configs and just rebuild the virtual machines, redeploy configs on fresh VM’s or 4) (kind of ridiculously) Wait until March 2nd for satellite checks to catch back up.

Does anyone have anything else they can think of that I could do? Or could anyone guide me through the deletion process from the databases? This seems to be the most feasible option currently.

Thanks so much.

P.S.
When the checks started going through again, I read these load averages from htop:
insane

Insane!

perhaps

will help you ?

1 Like

Great, thanks @sru!

This solution worked. I ran the commands as follows (from @sru’s solution) :

show databases;
use icinga; # adjust that !
delete from icinga_hoststatus where last_check > now();
delete from icinga_servicestatus where last_check > now();

and also rm /var/lib/icinga2/icinga2.state from both the satellite and master instances and everything started working fine.

This also pointed out a different problem I didn’t know about with time synchronization in a distributed monitoring set up, which I also fixed by synchronizing each server to an NTP server.

Thanks everyone!

note: I’ll mark this post as the solution since it includes the steps to remediate, but all credit goes to sru!