Host disappearing intermittently


#1

So this is a weird one.

I’m in the process of automating satellite deployment with Ansible. So far my playbook does everything right, but I’ve run into a weird issue and I’m at a loss.

I provisioned a satellite with the playbook. Icinga2 reloads successfully on the satellite and the master, connection direction is set master > satellite only, they’re both running the same versions of Icinga2 and Debian. I can see in the logs a successful TLS handshake and zone sync, and I can even verify with netstat that the satellite is 100% connected to the master. I’ve even gone as far as logging into mysql and making sure data for it exists in IDO.

But, for some reason, the host disappears from icingaweb2. If I restart Icinga2 on either node, it usually comes back for 5-15 minutes, then disappears again. While it’s present I can view its services, inspect it, everything you would with a normal host. Then it suddenly disappears.

Occasionally it comes back on its own. Then disappears again.

There doesn’t seem to be anything in icinga2.log indicating concurrent disconnections, db errors, or anything else. Icingaweb2 was using the syslog user facility and set to Warning and didn’t reveal any errors, either. I’ve since set it to log Information instead of Warning in the hopes that will reveal something…

I spent about 1.5 hours googling and haven’t found anything similar so far. I’m hoping someone here has run into this before (and fixed it)?

After a little more digging, I found that in the icinga_objects table of the IDO, is_active is set to 0 for this host during the times it’s not displaying in the web ui. So the question becomes, what’s setting it to 0?


(Roland) #2

I’ve had another strange behaviour which might appear similar to yours. We finally came to the conclusion that the icinga2 db was corrupt (unfortunately, without finding the root cause). So I dropped the db and recreated it from scratch. And the issue disappeared.


#3

I’m not quite willing to drop the whole database, but I did try deleting the host and its services from the hosts, services, and objects tables. This revealed another interesting bit of unexpected behavior.

I had disabled notifications in the web ui for this host pending resolving this issue and getting it fully configured with warn/crit thresholds. I had actually expected to have to click that checkbox again after deleting the row from icinga_hosts, but instead after deleting the row and restarting Icinga, the host comes back with notifications already disabled. While it’s convenient, it does suggest that there is some residual config source somewhere in the IDO or a file that’s storing that value and is able to apply it even when the host_id and host_object_id values have changed.

Probably a dead end, but I am curious what that source is and if it might include some setting or error that’s causing this intermittent issue.


#4

Wow, this timing. Right after I post my last update and right before my shift is over, the host disappeared again and I was able to view the logs with much less noise. Twice, actually, giving better confirmation.

So this is a little odd, still, but it’s a good lead.

In both instances where the host dropped out of the web ui, Icinga2 logged a MySQL deadlock. I saw these errors earlier today, as well… but the weird part is that I’ve seen these errors in the logs for months and it’s never seemed to manifest any real issues until now, hence I was dismissing those errors. I would also think that if the deadlocks were going to cause problems for one host, they’d cause problems for several or all hosts…

The IDO of course lives on one of the master endpoints. The master zone includes two endpoints, and the troublesome satellite is actually the 11th Icinga instance outside of master to be sending data that needs to be written to the db.

I guess I’ll need to look into ways to minimize these deadlocks when I come back in next week. Any tips would be appreciated. I’m also happy to provide more details/context if it helps devs/other users understand the issue and react to it.


(Roland) #5

What about dropping the database temporarily (and restoring the current db after some test)?


#6

Hmm, I’m still reluctant to drop the DB without understanding the problem. Even if doing so fixes the immediate problem, it doesn’t tell me anything about what caused it in the first place and how I can avoid it happening again in the future.

I can boil this issue down to two things that I need to understand:

  1. What is causing Icinga2 to run UPDATE queries that set these objects’ is_active field to 0, and
  2. Is #1 related to the mysql deadlocks?

I ended up completely removing the config from Icinga over the weekend so we didn’t have spurious alerts. Hopefully tomorrow I can add it back in and get better logs of the issue as it is.

After that, maybe I will drop the db and let it repopulate on its own and gather new logs. If the disappearing issue is resolved by this, that would be a useful datapoint–maybe good enough for a bug report :slight_smile: .

That still leaves the deadlock issue, though. It looks like there’s a pretty long history of people running into that one, with gradual improvements over time.


#7

An update: as I continued to investigate this, I went back to the docs and walked through the HA master set up guide and confirmed that I had correctly followed each step. Lo and behold, I somehow had not set enable_ha = true in the idomysql object. I set that and restart Icinga2 on both master endpoints and I think the issue is resolved–at least, I’ve not seen a single deadlock for 50 minutes, which is the longest interval it’s gone for about 5 days now.

I think this has resolved the issue. If I came back tomorrow to find it’s ongoing, then back to the drawing board :stuck_out_tongue: .

Edit to check in a day later: Yep, have not seen a single deadlock error in the logs all day. The hosts that were disappearing have also been stable in the web ui.


#8

There was one more step that I missed. I almost want to blame the docs a little for this one, but really I should have thought of it way sooner.

The secondary master node does not have zone/endpoint definitions for the two new hosts that have been giving me trouble.

Any time one instance or the other took over writing to the IDO, it would update the records with the data that it had–effectively, a split brain problem originating from outside of MySQL itself. This explains why the frequency of switching was much higher when enable_ha was false and both instances were stepping on each other to write to the db.

When I re-read the docs the other day to check my set-up process had been correct, I didn’t even think about making sure that zones.conf was in sync between the two master nodes. I didn’t think to check this because the tutorial for HA cluster setup only shows editing zones.conf on master1 and the satellites/clients. There doesn’t seem to be any mention at all of what needs to be done with zones.conf on the other master… but, it’s easy enough to extrapolate from the example of the HA satellite cluster that zones.conf needs to be sync’d between nodes in the same zone, so that’s my bad.

Again extrapolating from the HA satellite example, I tried moving zone configuration into a file under zones.d/master. I discovered that /etc/icinga2/zones.conf must exist, so that’s where to define the node’s own zone and endpoint information and the other endpoints in the same zone. I made zones.d/master/child_zones.conf and copied all the satellite/client zones and endpoints into it and restarted Icinga2. Voila, it works!




(The following is what I had originally written hoping that someone could make sense of these symptoms. In the process of writing I had a few epiphanies that led to understanding and the conclusion above. I figured I’d keep these observations in the post for the sake of future googlers who might run into the same issues and be as stumped as I’ve been.)

Maybe this wasn’t as fixed as I thought. Day 2, I come in to find one of the problem hosts has again been deactivated. No MySQL errors now, and the frequency of these random changes is much less. Some new observations:

From the API console:

=> get_service("problem_host", "procs").next_check
1547236808.330137

(which translates to Fri Jan 11 15:00:08 local 2019)

From the IDO:

select next_check from icinga_servicestatus where service_object_id = 7293; 
+---------------------+
| next_check          |
+---------------------+
| 2019-01-09 12:40:07 |
+---------------------+

So the service hasn’t been updated in 3 days, even after I forcibly activated the host this morning and all its services with an UPDATE command in MySQL, and despite Icinga2 seemingly doing its job of monitoring the host.

This seems to also occur when I try to silence notifications or ack problems from the WebUI for the problem hosts. Before enabling ha for the IDO, that was occuring for other hosts, as well, but currently seems isolated to these two.

So what’s special about these two hosts? [[and here is where I stopped to double check my config and had an epiphany]]

[[google keywords: can’t acknowledge alert, can’t disable notification, ido doesn’t update, cluster ido]]