Windows Agents crashing

Hello everyone,

is there a way to debug the windows agent for crashes?
Everytime I deploy a new configuration through the director, some of my windows agents crash and stop working until I restart the services.

I have an infrastructure with some hundreds of servers and I simply can’t connect everywhere to restart the services.

Cheers,
Kevin

Are there any hints in the event logs?
Are these crashes limited to a certain OS version/architecture?
Do these crashes occur every restart on certain/specific hosts?

1 Like
  • No hints at all
  • Crashs happened on 2012, 2012R2 and 2008R2 so far
  • Crashes are sometimes, more often if I generate two or more configs after another (like within 2 mins or so)

Sorry…

There should be at least an application log being written into %ProgramData/var/log/icinga2/icinga2.log to provide insights what happened before.

Hi,

what exactly do you mean with a crash? If you deploy a new configuration and the Icinga Agent crashes are you able to start the Icinga service again without any changes to the configuration?

Please provide a config validation output from one of the crashed Icinga Agents.

  1. Open a command prompt
  2. Navigate into the Icinga installation directory (e.g. C:\Program Files\Icinga2\sbin\)
  3. Run icinga2.exe daemon --validate

Please also have a look into the Icinga 2 log files, they are located at C:\ProgramData\icinga2\var\log\.

Kind regards
Michael

1 Like

are you able to start the Icinga service again without any changes to the configuration?

yes, I am

Have a look at the screenshots I made:
2018-01-09 23_01_28-Windows 2016 RDS - Desktop Viewer

And the second, because of the “new user restrictions”:

The output of your first screenshot points to the reason why the Agent cannot start anymore, the port on which Icinga 2 wants to listen (5665) is already in usage and Icinga 2 cannot add a listener for this port.

You can verify that by running netstat -ano | find "5665" after your Icinga 2 Agent is crashed. You should see that the port 5665 is still open, although the Agent isn’t running.

Which node opens the connection? I assume it’s Master -> Client? Under which user the Icinga Agent is running, the default?

Best regards
Michael

Thanks Michael, I had the same thoughts.

But I think the Agent is blocking himself, I can start the Agent almost immediately after it crashes through because of the new configuration.

Edit:

Nothing special on the Screenshot.
I can easily restart the agent without problems.

Both can open a connection, they know each other.
User is the local SYSTEM account.

This is key that only one connection direction is attempted (docs: “endpoint connection direction”). Can you share the Endpoint definitions from the master and the client’s zones.conf? I guess all of them have the host attribute specified.

I don’t have to share them. I know that I configured the endpoint attribute on both sides.

In other words: If I remove the endpoint attribute from the client, everything should work as expected?

The “I know everything, so I don’t show you” behaviour isn’t really helpful here. If you want help for free, be so kind and answer questions in a gentle way.

Regards,
Michael

2 Likes

Don’t be so negative… It’s just easier to answer things I already know.

Master:

object Endpoint "shdeofgvim01" {
        host = "10.21.244.200"
}

object Zone "xxxxxxx Master-Zone" {
        endpoints = [ "shdeofgvim01" ]
}

object Zone "global-templates" {
        global = true
}

object Zone "director-global" {
        global = true
}

Satellite 1:

object Endpoint "shdeofgvim01" {
        host = "10.21.244.200"
        port = "5665"
}

object Zone "master" {
        endpoints = [ "shdeofgvim01" ]
}

object Zone "global-templates" {
        global = true
}

object Zone "director-global" {
        global = true
}

object Endpoint NodeName {
}

object Zone ZoneName {
        endpoints = [ NodeName ]
        parent = "master"
}

Client:

object Endpoint "shdeofgvim01" {
        host = "10.21.244.200"
}

object Zone "xxxxxx Master-Zone" {
        endpoints = [ "shdeofgvim01" ]
}

object Zone "global-templates" {
        global = true
}

object Zone "director-global" {
        global = true
}

As I said, everyone know each other. :wink:

Edit:
TCP Listener is enabled through Setup Wizard, if it helps.

The client is connected to the master zone, not the satellite.

object Endpoint "shdeofgvim01" {
        host = "10.21.244.200"
}

object Zone "xxxxxx Master-Zone" {
        endpoints = [ "shdeofgvim01" ]
}

If actively connects there using the host attribute.

This explains why the satellite does not have any client Endpoint object.

The master node doesn’t have a client Endpoint object.

Seems the configuration is not entirely complete. If you don’t manage everything in zones.conf,

icinga2 object list --type Endpoint
icinga2 object list --type Zone

from all three instances is also an option.

Ok, then explain the endpoint connection direction in your configuration in three short sentences, adding the reasoning for each.

Whoops, copy/paste error that happens if you copy through rdp in a vmware remote console through citrix…:

Client:

object Endpoint "shdeofgnag05.xxxxx.org" {
	host = "10.150.0.10"
	port = "5665"
}

object Zone "master" {
	endpoints = [ "shdeofgnag05.xxxxx.org" ]
}

object Zone "global-templates" {
	global = true
}

object Zone "director-global" {
	global = true
}

object Endpoint NodeName {
}

object Zone ZoneName {
	endpoints = [ NodeName ]
	parent = "master"
}

shdeofgvim01 is the master.
shdeofgnag05 is the sattelite.

so:
The client knows his sattelite.
The sattelite knows his clients.
The master knows everything.

Endpoints attribute tells the instance to actively connect to it, or am I wrong?

I manage everythind in the director, and its working fine for that.
The only problem is, that the agent stops working sometimes if I deploy a new config.
50% of the time, everything is just working fine. And the other 50% like half of the agents don’t start again.

From your first screenshot you made (with the critical ApiListener log entries), are you sure you ran the right command icinga2 daemon --validate? If you forgot the --validate parameter you started a Icinga 2 instance in foreground, which will fail because Icinga 2 is already running in the background and listening on the port 5665.

Yes. The host attribute inside the Endpoint object is the parameter which defines the connection direction. For example if you set the host attribute (in the client Endpoint object) on your master, the master will actively build a connection to your client. If the host attribute (in the master Endpoint object) is set on your client, the client will actively build a connection.

When the Icinga Agent crashes, how you find this out? If you open the Windows Services Manager (services.msc), is the Icinga service running or it is stopped?

Best regards
Michael

The service is stopped, and in Icingaweb2 I see, that the service states are begin to change to unknown.

Maybe I forgot the --validate parameter, my bad.

Hi,

can you please execute the command again with the parameter and post the output here? Note: Do this after the Icinga 2 Agent crashed.

When the Agent crashed is there any log message in the icinga2.log file? Search for the exact time stamp where the crash happens.

Best regards
Michael

Hey guys,

sorry for the delay.

Attached you’ll find icinga2.log.

icinga2.exe daemon --validate is clean, except the apply rule warnings. No errors at all :frowning: