Icinga HA setup pending checks

hi

The icinga setup architecture:

I have icinga2 HA cluster setup, in which the primary master has configs in zones.d master & directories. Secondary master uses checker module to sync zones.d from primary master.
The satellite node has both masters defined in zones.conf.

But the issue i am getting here is on icingaweb check source for remote nodes checks is blank and they are shown as pending, while same node some checks are workinh with check source as their hostname.

referred this link : https://monitoring-portal.org/woltlab/index.php?thread/40804-successful-ping-pending-indefinitely/

and used

 curl -k -s -u root:icinga -H 'Accept: application/json' -H 'X-HTTP-Method-Override: GET' -X POST 'https://localhost:5665/v1/objects/services' -d '{ "filter": "regex(pattern, service.name)", "filter_vars": { "pattern": "^disk" }, "attrs": [ "__name", "last_check_result" ] }' | python -m json.tool

The output here is correct i.e checks are working, I could see the last check status as OK.

Then why is the icingaweb2 showing checks in Pending. Please suggest.
Below are the screenshots of issue
icinga_partial_check_source

Verify that the satellite node

  • has the entire configuration synced (via REST API, objects should be visible)
  • is connected to the master nodes
  • clients are connected to the satellite node

More hints on late check results can be found in the troubleshooting docs.

Yes, On satellite node, i can see config files in /var/lib/icinga2/api/zones/satellitehostname/client-nodes.conf for all IPs.

Yes satellite is connected, checked using telnet 5665 as well as by checking icinga2 object list --type service --name satellitehostname it shows proper zones

Yes, clients are connected, as I can see in debug-log the checks results and also on master by hitting curl command to check late check result.
But here even though the output shows last check time and status OK, inside flap_variables there is reachable:false . Does it have to do anything with Pending checks?

Setup cluster health checks to ensure that the zones are connected.
https://www.icinga.com/docs/icinga2/latest/doc/06-distributed-monitoring/#health-checks

Boil it down - setup the API on the satellite, and fetch a specific service object which has late check results on the master. Verify that the satellite is actively checking the client node.

  • if that’s ok, there’s something wrong with the check result sync from the satellite to the master (not connected, connection slow, master denies on the satellite’s check results, etc.)
  • if that’s not ok, try to manually reschedule a check on the satellite via /v1/actions/reschedule-check API endpoint and watch that in the debug log. Maybe the checker feature is not enabled, or the client does not return a check result on its own.

In any case, since you’ve mentioned reachable, ensure that there are no dependencies configured which would prevent checks in any case.

1 Like

Hi @dnsmichi,

I followed the steps but still the same issue.

Actually what I found till now is somehow satellite is checking the client nodes, i.e out of 128 nodes configured satellite monitors all but gives error for some nodes which can possibly be because of certs,5665 port etc. That can be solved.

But now the “MAJOR” issue here is that both the master nodes themselves are not showing all metrics as OK, Referring the screenshot: MGA022 is my satellite which despite connected on 5665 port and correct zones.conf not reflecting on UI. Also cluster zone which is on both masters is pending, output of these commands :

 icinga2 object list --type service --name cluster
 		Object 'XXXXXXXXX-MGA020!cluster' of type 'Service':
	  % declared in '/etc/icinga2/zones.d/master/XXXXXXXXX-MGA020.conf', lines 31:1-31:24
	  * __name = "XXXXXXXXX-MGA020!cluster"
	  * action_url = ""
	  * check_command = "cluster"
		% = modified in '/etc/icinga2/zones.d/master/XXXXXXXXX-MGA020.conf', lines 32:3-32:27
	  * check_interval = 5
		% = modified in '/etc/icinga2/zones.d/master/XXXXXXXXX-MGA020.conf', lines 33:3-33:21
	  * check_period = ""
	  * check_timeout = null
	  * command_endpoint = ""
	  * display_name = "cluster"
	  * enable_active_checks = true
	  * enable_event_handler = true
	  * enable_flapping = false
	  * enable_notifications = true
	  * enable_passive_checks = true
	  * enable_perfdata = true
	  * event_command = ""
	  * flapping_threshold = 0
	  * flapping_threshold_high = 30
	  * flapping_threshold_low = 25
	  * groups = [ ]
	  * host_name = "XXXXXXXXX-MGA020"
		% = modified in '/etc/icinga2/zones.d/master/XXXXXXXXX-MGA020.conf', lines 36:3-36:33
	  * icon_image = ""
	  * icon_image_alt = ""
	  * max_check_attempts = 3
	  * name = "cluster"
	  * notes = ""
	  * notes_url = ""
	  * package = "_etc"
	  * retry_interval = 1
		% = modified in '/etc/icinga2/zones.d/master/XXXXXXXXX-MGA020.conf', lines 34:3-34:21
	  * source_location
		* first_column = 1
		* first_line = 31
		* last_column = 24
		* last_line = 31
		* path = "/etc/icinga2/zones.d/master/XXXXXXXXX-MGA020.conf"
	  * templates = [ "cluster" ]
		% = modified in '/etc/icinga2/zones.d/master/XXXXXXXXX-MGA020.conf', lines 31:1-31:24
	  * type = "Service"
	  * vars = null
	  * volatile = false
	  * zone = "master"

	Object 'XXXXXXXXX-MGA021!cluster' of type 'Service':
	  % declared in '/etc/icinga2/zones.d/master/XXXXXXXXX-MGA021.conf', lines 31:1-31:24
	  * __name = "XXXXXXXXX-MGA021!cluster"
	  * action_url = ""
	  * check_command = "cluster"
		% = modified in '/etc/icinga2/zones.d/master/XXXXXXXXX-MGA021.conf', lines 32:3-32:27
	  * check_interval = 5
		% = modified in '/etc/icinga2/zones.d/master/XXXXXXXXX-MGA021.conf', lines 33:3-33:21
	  * check_period = ""
	  * check_timeout = null
	  * command_endpoint = ""
	  * display_name = "cluster"
	  * enable_active_checks = true
	  * enable_event_handler = true
	  * enable_flapping = false
	  * enable_notifications = true
	  * enable_passive_checks = true
	  * enable_perfdata = true
	  * event_command = ""
	  * flapping_threshold = 0
	  * flapping_threshold_high = 30
	  * flapping_threshold_low = 25
	  * groups = [ ]
	  * host_name = "XXXXXXXXX-MGA021"
		% = modified in '/etc/icinga2/zones.d/master/XXXXXXXXX-MGA021.conf', lines 36:3-36:37
	  * icon_image = ""
	  * icon_image_alt = ""
	  * max_check_attempts = 3
	  * name = "cluster"
	  * notes = ""
	  * notes_url = ""
	  * package = "_etc"
	  * retry_interval = 1
		% = modified in '/etc/icinga2/zones.d/master/XXXXXXXXX-MGA021.conf', lines 34:3-34:21
	  * source_location
		* first_column = 1
		* first_line = 31
		* last_column = 24
		* last_line = 31
		* path = "/etc/icinga2/zones.d/master/XXXXXXXXX-MGA021.conf"
	  * templates = [ "cluster" ]
		% = modified in '/etc/icinga2/zones.d/master/XXXXXXXXX-MGA021.conf', lines 31:1-31:24
	  * type = "Service"
	  * vars = null
	  * volatile = false
	  * zone = "master"

image

To troubleshoot further I removed the satellite zones client nodes folder so that i can see first the configs of both icingaweb and satellite node only.
And you were correct the pending checks Output is null. Can you pl suggest where can the config be wrong?
In below output both null results belong to pending checks. I tried rm -rf /var/lib/icinga2/api restarted icinga but no use

   o/p:
		XXXXXXXXX-MGA020 master]#  curl -k -s -u root:icinga -H 'Accept: application/json' -H 'X-HTTP-Method-Override: GET' -X POST 'https://localhost:5665/v1/objects/services' -d '{ "filter": "regex(pattern, service.name)", "filter_vars": { "pattern": "^users" }, "attrs": [ "__name", "last_check_result" ] }' | python -m json.tool
		{
			"results": [
				{
					"attrs": {
						"__name": "XXXXXXXXX-MGA021!users",
						"last_check_result": null
					},
					"joins": {},
					"meta": {},
					"name": "XXXXXXXXX-MGA021!users",
					"type": "Service"
				},
				{
					"attrs": {
						"__name": "XXXXXXXXX-MGA020!users",
						"last_check_result": null
					},
					"joins": {},
					"meta": {},
					"name": "XXXXXXXXX-MGA020!users",
					"type": "Service"
				},
				{
					"attrs": {
						"__name": "XXXXXXXXX-MGA022!users",
						"last_check_result": {
							"active": true,
							"check_source": "XXXXXXXXX-MGA021",
							"command": [
								"/usr/lib64/nagios/plugins/check_users",
								"-c",
								"50",
								"-w",
								"20"
							],
							"execution_end": 1530718822.35236,
							"execution_start": 1530718822.347281,
							"exit_status": 0.0,
							"output": "USERS OK - 2 users currently logged in ",
							"performance_data": [
								"users=2;20;50;0"
							],
							"schedule_end": 1530718822.352463,
							"schedule_start": 1530718822.346828,
							"state": 0.0,
							"type": "CheckResult",
							"vars_after": {
								"attempt": 1.0,
								"reachable": true,
								"state": 0.0,
								"state_type": 1.0
							},
							"vars_before": {
								"attempt": 1.0,
								"reachable": true,
								"state": 0.0,
								"state_type": 1.0
							}
						}
					},
					"joins": {},
					"meta": {},
					"name": "XXXXXXXXX-MGA022!users",
					"type": "Service"
				}
			]
		}

I would actually fix that before doing anything else. Or disable the clients to attempt to connect to the satellites. This likely influences the problem and puts more chaos into your setup and specifically, logs to look into.

In terms of pending checks, also extract the full runtime state and debug further with last_check and next_check - you can do so by removing the attrs key in your request body.

Maybe the object is paused, and the master node doesn’t feel responsible for it. Check that on the secondary master as well. There’s a debug console hint on the troubleshooting docs for late check results within distributed environments, ensure to check that as well.

https://www.icinga.com/docs/icinga2/latest/doc/15-troubleshooting/#late-check-results-in-distributed-environments

1 Like

I changed the satellite zones.conf to point actively conect to primary master and then checked and found 3 conclusions:

  1. the primary master from where the checks are executed are NOT paused:

    <5> => var res = {}; for (s in get_objects(Service).filter(s => s.last_check < get_time() - 2 * s.check_interval)) { res[s.paused] += 1 }; res
    {
     @false = 27.000000
    }
    
  2. Removed attr tags and checked the result in that last_check is null while next_check is 1530718707.605208

  3. Mysql cluster is running & connected but output of command: curl -k -s -u root:icinga ‘https://localhost:5665/v1/status’ | python -m json.tool | less
    shows ido-mysql { connected: false}

        "name": "CIB",
         "perfdata": [],
         "status": {
             "active_host_checks": 0.03333333333333333,
             "active_host_checks_15min": 30.0,
             "active_host_checks_1min": 2.0,
             "active_host_checks_5min": 10.0,
             "active_service_checks": 0.2833333333333333,
             "active_service_checks_15min": 245.0,
             "active_service_checks_1min": 17.0,
             "active_service_checks_5min": 82.0,
             "avg_execution_time": 0.754389762878418,
             "avg_latency": 0.0008933279249403211,
             "max_execution_time": 9.427098989486694,
             "max_latency": 0.0016641616821289062,
             "min_execution_time": 0.003062009811401367,
             "min_latency": 0.0003197193145751953,
             "num_hosts_acknowledged": 0.0,
     	"num_hosts_down": 1.0,
             "num_hosts_flapping": 0.0,
             "num_hosts_in_downtime": 0.0,
     	"num_hosts_pending": 1.0,
             "num_hosts_unreachable": 0.0,
         "num_hosts_up": 2.0,
             "num_services_acknowledged": 0.0,
             "num_services_critical": 1.0,
             "num_services_flapping": 0.0,
             "num_services_in_downtime": 0.0,
         "num_services_ok": 11.0,
         "num_services_pending": 27.0,
         "num_services_unknown": 33.0,
             "num_services_unreachable": 0.0,
             "num_services_warning": 0.0,
             "passive_host_checks": 0.0,
             "passive_host_checks_15min": 0.0,
             "passive_host_checks_1min": 0.0,
             "passive_host_checks_5min": 0.0,
             "passive_service_checks": 0.0,
             "passive_service_checks_15min": 0.0,
             "passive_service_checks_1min": 0.0,
             "passive_service_checks_5min": 0.0,
             "uptime": 61263.725548028946
         }
     },
     
         "status": {
             "idomysqlconnection": {
                 "ido-mysql": {
                     "connected": false,
                     "instance_name": "default",
                     "query_queue_item_rate": 1.2333333333333334,
                     "query_queue_items": 0.0,
                     "version": ""
                 }
             }
    

Whereas in debuglogs it is clearly showing updating DB:

[2018-07-05 17:12:11 +0530] notice/DbConnection: Updating programstatus table.
[2018-07-05 17:12:15 +0530] debug/DbEvents: add checkable check history for 'XXXX-MGA022!cpu'
[2018-07-05 17:12:18 +0530] debug/DbEvents: add checkable check history for 'XXXX-MGA022!icingaport5665'

This is a unix timestamp you can compare to your current time, or reformat it with date -r 1530718707. Find out more about scheduled checks and why they’re not executed (debug log).

Put aside any other findings and focus on why the checks are late. The IDO backend isn’t important here, neither is Icinga Web 2.

Thanks @dnsmichi. Isolating the nodes really helped.

Actually the issue was satellite & one of the mysql-cluster nodes had same hostnames, hence the client nodes configured to sent metric results to satellite were in split brain between satellite & mysql node.

Resolution: Changing the nodename of mysql node and restarting the mysql service worked. Had to create new certs and key for that.

Now after renaming the hostname, while adding the satellite zone directory(which contains all child node information), there are no hosts shown on icinga web

i referred the link:

I can see in icinga_object table is_active=0 for all the child_nodes. But child node pushing data to satellite in its debuglog.
Also, no issues in mysql service.
If I add a node directly in master zone it starts showing in icingaweb.

Any suggestions?