Icinga2 delayed checks

check

(Lucas Possamai) #1

Original post here:

Hi all.

My icinga server has some delayed checks and I don’t know where to look at to find out what’s going on.
Basically, ping4 and ping6 are fine. All other services are delayed, as “Last check” being more than 20 minutes ago.

I’m monitoring my hosts via icinga2 client agent connecting to its master.

Troubleshooting steps, which did not correct the problem:

  • restarted all services on the master and on the monitored host
  • “Force check” does not work; the service is not checked again after the force; it falls back into Late status

My Environment:
icinga2-ido-pgsql
icingaweb2 2.6.2-1
icinga2 2.10.2-1

  • Version used ( icinga2 --version ): version: r2.10.2-1
  • Operating System and version: Ubuntu 18.04.1 LTS (Bionic Beaver)
  • Enabled features ( icinga2 feature list ): Enabled features: api checker command ido-pgsql mainlog notification perfdata
  • Icinga Web 2 version and modules (System - About): 2.6.2
  • Config validation ( icinga2 daemon -C ):
[2018-12-20 13:23:22 +1300] information/cli: Icinga application loader (version: r2.10.2-1)
[2018-12-20 13:23:22 +1300] information/cli: Loading configuration file(s).
[2018-12-20 13:23:22 +1300] information/ConfigItem: Committing config item(s).
[2018-12-20 13:23:22 +1300] information/ApiListener: My API identity: icinga.datacentre.example.com
[2018-12-20 13:23:22 +1300] warning/ApplyRule: Apply rule 'ping6' (in /etc/icinga2/conf.d/services.conf: 34:1-34:21) for type 'Service' does not match anywhere!
[2018-12-20 13:23:22 +1300] warning/ApplyRule: Apply rule 'ssh' (in /etc/icinga2/conf.d/services.conf: 47:1-47:19) for type 'Service' does not match anywhere!
[2018-12-20 13:23:22 +1300] warning/ApplyRule: Apply rule 'apt' (in /etc/icinga2/zones.d/master/services.conf: 51:1-51:19) for type 'Service' does not match anywhere!
[2018-12-20 13:23:22 +1300] warning/ApplyRule: Apply rule '' (in /etc/icinga2/zones.d/master/services.conf: 69:1-69:66) for type 'Service' does not match anywhere!
[2018-12-20 13:23:22 +1300] warning/ApplyRule: Apply rule '' (in /etc/icinga2/zones.d/master/services.conf: 92:1-92:68) for type 'Service' does not match anywhere!
[2018-12-20 13:23:22 +1300] warning/ApplyRule: Apply rule '' (in /etc/icinga2/zones.d/master/services.conf: 191:1-191:86) for type 'Service' does not match anywhere!
[2018-12-20 13:23:22 +1300] information/ConfigItem: Instantiated 1 ScheduledDowntime.
[2018-12-20 13:23:22 +1300] information/ConfigItem: Instantiated 92 Services.
[2018-12-20 13:23:22 +1300] information/ConfigItem: Instantiated 1 IcingaApplication.
[2018-12-20 13:23:22 +1300] information/ConfigItem: Instantiated 9 Hosts.
[2018-12-20 13:23:22 +1300] information/ConfigItem: Instantiated 1 FileLogger.
[2018-12-20 13:23:22 +1300] information/ConfigItem: Instantiated 2 NotificationCommands.
[2018-12-20 13:23:22 +1300] information/ConfigItem: Instantiated 176 Notifications.
[2018-12-20 13:23:22 +1300] information/ConfigItem: Instantiated 1 NotificationComponent.
[2018-12-20 13:23:22 +1300] information/ConfigItem: Instantiated 4 HostGroups.
[2018-12-20 13:23:22 +1300] information/ConfigItem: Instantiated 1 ApiListener.
[2018-12-20 13:23:22 +1300] information/ConfigItem: Instantiated 1 Downtime.
[2018-12-20 13:23:22 +1300] information/ConfigItem: Instantiated 1 PerfdataWriter.
[2018-12-20 13:23:22 +1300] information/ConfigItem: Instantiated 1 Comment.
[2018-12-20 13:23:22 +1300] information/ConfigItem: Instantiated 1 CheckerComponent.
[2018-12-20 13:23:22 +1300] information/ConfigItem: Instantiated 11 Zones.
[2018-12-20 13:23:22 +1300] information/ConfigItem: Instantiated 1 ExternalCommandListener.
[2018-12-20 13:23:22 +1300] information/ConfigItem: Instantiated 9 Endpoints.
[2018-12-20 13:23:22 +1300] information/ConfigItem: Instantiated 3 ApiUsers.
[2018-12-20 13:23:22 +1300] information/ConfigItem: Instantiated 2 Users.
[2018-12-20 13:23:22 +1300] information/ConfigItem: Instantiated 215 CheckCommands.
[2018-12-20 13:23:22 +1300] information/ConfigItem: Instantiated 1 IdoPgsqlConnection.
[2018-12-20 13:23:22 +1300] information/ConfigItem: Instantiated 2 UserGroups.
[2018-12-20 13:23:22 +1300] information/ConfigItem: Instantiated 3 ServiceGroups.
[2018-12-20 13:23:22 +1300] information/ConfigItem: Instantiated 3 TimePeriods.
[2018-12-20 13:23:22 +1300] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2018-12-20 13:23:22 +1300] information/cli: Finished validating the configuration file(s).
  • If you run multiple Icinga 2 instances, the zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone ) from all affected nodes.

zones.conf:

object Endpoint "icinga.datacentre.example.com" {
}

object Zone "master" {
	endpoints = [ "icinga.datacentre.example.com" ]
}

object Zone "global-templates" {
	global = true
}

object Zone "director-global" {
	global = true
}

services.conf (original file from /etc/icinga2/conf.d/services.conf):

// Ping Check
apply Service "Ping" {
  check_command = "ping4"
  assign where host.address // check is executed on the master node
}

// System Load
apply Service "System Load" {
  check_command = "load"
  command_endpoint = host.vars.client_endpoint // Check executed on client node
  assign where host.vars.client_endpoint
}

// System Process Count
apply Service "Process" {
  check_command = "procs"
  command_endpoint = host.vars.client_endpoint
  assign where host.vars.client_endpoint
}

// Logged in User Count
apply Service "Users" {
  check_command = "users"
  command_endpoint = host.vars.client_endpoint
  assign where host.vars.client_endpoint
}

// Disk Usage Check
apply Service "Disk" {
  check_command = "disk"
  command_endpoint = host.vars.client_endpoint
  assign where host.vars.client_endpoint
}

// Disk Usage Check for Specific Partition
apply Service for (disk => config in host.vars.local_disks) {
  check_command = "disk"
  vars += config
  command_endpoint = host.vars.client_endpoint
  assign where host.vars.client_endpoint
}

// Icinga 2 Service Check
apply Service "Icinga2 Service" {
  check_command = "icinga"
  command_endpoint = host.vars.client_endpoint
  assign where host.vars.client_endpoint
}

// Apt service check
apply Service "apt" {
  import "generic-service"
  check_command = "apt"
  display_name = "apt-get updates"
  assign where host.vars.os == "ubuntu"
  command_endpoint = host.vars.client_endpoint
}

// YUM service check
apply Service "yum" {
  import "generic-service"
  check_command = "yum"
  display_name = "yum updates"
  assign where host.vars.os == "centos"
  command_endpoint = host.vars.client_endpoint
}

// TCP Port Check
apply Service for (tcp_port => config in host.vars.local_tcp_port) {
  check_command = "tcp"
  vars += config
  display_name = + vars.service_name + " - " + vars.port_number
  command_endpoint = host.vars.client_endpoint
  assign where host.vars.client_endpoint
}

//
// API check commands
//


// Apache VirtualHost Check
apply Service for (http_vhost => config in host.vars.local_http_vhosts) {
  check_command = "http"
  vars += config
  display_name = "API Status"
//  command_endpoint = host.vars.client_endpoint
  assign where host.vars.client_endpoint
}

Why are those checks taking so long?

UPDATE 1:

You may see the parameter "import “generic-service” missing from some services above.
However, they’re set in the /etc/icinga2/conf.d/services.conf file which apparently is still being used by Icinga. So that is not the issue.


#2

Please type on CLI “icinga2 object list --type Service” and check, if the services have a check and retry interval. Please post /etc/icinga2/conf.d/services.conf


(Lucas Possamai) #3

Yup… they do have the check and retry interval:

Object 'ns2.datacentre.example.com!/' of type 'Service':
  % declared in '/etc/icinga2/zones.d/master/services.conf', lines 41:1-41:59
  * __name = "ns2.datacentre.example.com!/"
  * action_url = ""
  * check_command = "disk"
    % = modified in '/etc/icinga2/zones.d/master/services.conf', lines 43:3-43:24
  * check_interval = 60
    % = modified in '/etc/icinga2/conf.d/templates.conf', lines 28:3-28:21
  * check_period = ""
  * check_timeout = null
  * command_endpoint = "ns2.datacentre.example.com"
    % = modified in '/etc/icinga2/zones.d/master/services.conf', lines 45:3-45:46
  * display_name = "/"
  * enable_active_checks = true
  * enable_event_handler = true
  * enable_flapping = false
  * enable_notifications = true
  * enable_passive_checks = true
  * enable_perfdata = true
  * event_command = ""
  * flapping_threshold = 0
  * flapping_threshold_high = 30
  * flapping_threshold_low = 25
  * groups = [ ]
  * host_name = "ns2.datacentre.example.com"
    % = modified in '/etc/icinga2/zones.d/master/services.conf', lines 41:1-41:59
  * icon_image = ""
  * icon_image_alt = ""
  * max_check_attempts = 5
    % = modified in '/etc/icinga2/conf.d/templates.conf', lines 27:3-27:24
  * name = "/"
    % = modified in '/etc/icinga2/zones.d/master/services.conf', lines 41:1-41:59
  * notes = ""
  * notes_url = ""
  * package = "_etc"
    % = modified in '/etc/icinga2/zones.d/master/services.conf', lines 41:1-41:59
  * retry_interval = 30
    % = modified in '/etc/icinga2/conf.d/templates.conf', lines 29:3-29:22
  * source_location
    * first_column = 1
    * first_line = 41
    * last_column = 59
    * last_line = 41
    * path = "/etc/icinga2/zones.d/master/services.conf"
  * templates = [ "/", "generic-service" ]
    % = modified in '/etc/icinga2/zones.d/master/services.conf', lines 41:1-41:59
    % = modified in '/etc/icinga2/conf.d/templates.conf', lines 26:1-26:34
  * type = "Service"
  * vars
    % = modified in '/etc/icinga2/zones.d/master/services.conf', lines 44:3-44:16
    * disk_partitions = "/"
  * volatile = false
  * zone = "master"
    % = modified in '/etc/icinga2/zones.d/master/services.conf', lines 41:1-41:59

/etc/icinga2/conf.d/services.conf:
apply Service “ping4” {
import “generic-service”

  check_command = "ping4"

  assign where host.address
}


/*
apply Service "ping6" {
  import "generic-service"

  check_command = "ping6"

  assign where host.address6
}
*/


/*
 * Apply the `ssh` service to all hosts
 * with the `address` attribute defined and
 * the custom attribute `os` set to `Linux`.
 */
apply Service "ssh" {
  import "generic-service"

  check_command = "ssh"

  assign where (host.address || host.address6) && host.vars.os == "Linux"
}



apply Service for (http_vhost => config in host.vars.http_vhosts) {
  import "generic-service"

  check_command = "http"

  vars += config
}

apply Service for (disk => config in host.vars.disks) {
  import "generic-service"

  check_command = "disk"

  vars += config
}

apply Service "icinga" {
  import "generic-service"

  check_command = "icinga"

  assign where host.name == NodeName
}

apply Service "load" {
  import "generic-service"

  check_command = "load"

  /* Used by the ScheduledDowntime apply rule in `downtimes.conf`. */
  vars.backup_downtime = "02:00-03:00"

  assign where host.name == NodeName
}

apply Service "procs" {
  import "generic-service"

  check_command = "procs"

  assign where host.name == NodeName
}

apply Service "swap" {
  import "generic-service"

  check_command = "swap"

  assign where host.name == NodeName
}

apply Service "users" {
  import "generic-service"

  check_command = "users"

  assign where host.name == NodeName
}

#4

you have to import generic-service or at least check and retry interval for every new service. What says
icinga2 object list --type Service --name *tcp*
about the check and retry interval? For me it lokks like there are just missing for a few services


(Lucas Possamai) #5

Since my last post I’ve added the “import generic-service” to each service I have in /etc/icinga2/zones.d/master/hostname.conf and restarted icinga. But I’m still having the same issue.

icinga2 object list --type Service --name *tcp* - Returns nothing.
icinga2 object list --type Service --name *proc* (just as an example):

Object 'ns.datacentre.example.com!Process' of type 'Service':
  % declared in '/etc/icinga2/zones.d/master/services.conf', lines 17:1-17:23
  * __name = "ns.datacentre.example.com!Process"
  * action_url = ""
  * check_command = "procs"
    % = modified in '/etc/icinga2/zones.d/master/services.conf', lines 19:3-19:25
  * check_interval = 60
    % = modified in '/etc/icinga2/conf.d/templates.conf', lines 28:3-28:21
  * check_period = ""
  * check_timeout = null
  * command_endpoint = "ns.datacentre.example.com"
    % = modified in '/etc/icinga2/zones.d/master/services.conf', lines 20:3-20:46
  * display_name = "Process"
  * enable_active_checks = true
  * enable_event_handler = true
  * enable_flapping = false
  * enable_notifications = true
  * enable_passive_checks = true
  * enable_perfdata = true
  * event_command = ""
  * flapping_threshold = 0
  * flapping_threshold_high = 30
  * flapping_threshold_low = 25
  * groups = [ ]
  * host_name = "ns.datacentre.example.com"
    % = modified in '/etc/icinga2/zones.d/master/services.conf', lines 17:1-17:23
  * icon_image = ""
  * icon_image_alt = ""
  * max_check_attempts = 5
    % = modified in '/etc/icinga2/conf.d/templates.conf', lines 27:3-27:24
  * name = "Process"
    % = modified in '/etc/icinga2/zones.d/master/services.conf', lines 17:1-17:23
  * notes = ""
  * notes_url = ""
  * package = "_etc"
    % = modified in '/etc/icinga2/zones.d/master/services.conf', lines 17:1-17:23
  * retry_interval = 30
    % = modified in '/etc/icinga2/conf.d/templates.conf', lines 29:3-29:22
  * source_location
    * first_column = 1
    * first_line = 17
    * last_column = 23
    * last_line = 17
    * path = "/etc/icinga2/zones.d/master/services.conf"
  * templates = [ "Process", "generic-service" ]
    % = modified in '/etc/icinga2/zones.d/master/services.conf', lines 17:1-17:23
    % = modified in '/etc/icinga2/conf.d/templates.conf', lines 26:1-26:34
  * type = "Service"
  * vars = null
  * volatile = false
  * zone = "master"
    % = modified in '/etc/icinga2/zones.d/master/services.conf', lines 17:1-17:23

(Lucas Possamai) #6

UPDATE 23-12-2018:

I realised that all the items that are checked on the server (icinga) are fine. The ones that are delayed are the services checked on the Linux host itself, as follow:

Also, this happens with some hosts. Not all of them. I’ve got another host, with same OS, which is fine. No late checks at all.

on the server:

apply Service “Ping” {
import “generic-service”
check_command = “ping4”
assign where host.address // check is executed on the master node
}

on the client (linux host):

// Disk Usage Check
apply Service “Disk” {
import “generic-service”
check_command = “disk”
command_endpoint = host.vars.client_endpoint
assign where host.vars.client_endpoint
}

// Disk Usage Check for Specific Partition
apply Service for (disk => config in host.vars.local_disks) {
import “generic-service”
check_command = “disk”
vars += config
command_endpoint = host.vars.client_endpoint
assign where host.vars.client_endpoint
}

As you can see, I’ve added the import "generic-service" option to both services, but that did not solve the problem.

My generic-service template looks like this:

template Service “generic-service” {
max_check_attempts = 5
check_interval = 1m
retry_interval = 30s
}

icinga2 daemon -C output:

[2018-12-24 00:43:50 +1300] information/cli: Icinga application loader (version: r2.10.2-1)
[2018-12-24 00:43:50 +1300] information/cli: Loading configuration file(s).
[2018-12-24 00:43:50 +1300] information/ConfigItem: Committing config item(s).
[2018-12-24 00:43:50 +1300] information/ApiListener: My API identity: icinga.datacentre.example.com
[2018-12-24 00:43:51 +1300] warning /ApplyRule: Apply rule ‘apt’ (in /etc/icinga2/zones.d/master/services.conf: 62:1-62:19) for type ‘Service’ does not match anywhere!
[2018-12-24 00:43:51 +1300] warning /ApplyRule: Apply rule ‘’ (in /etc/icinga2/zones.d/master/services.conf: 80:1-80:66) for type ‘Service’ does not match anywhere!
[2018-12-24 00:43:51 +1300] warning /ApplyRule: Apply rule ‘’ (in /etc/icinga2/zones.d/master/services.conf: 105:1-105:68) for type ‘Service’ does not match anywhere!
[2018-12-24 00:43:51 +1300] warning /ApplyRule: Apply rule ‘’ (in /etc/icinga2/zones.d/master/services.conf: 215:1-215:86) for type ‘Service’ does not match anywhere!
[2018-12-24 00:43:51 +1300] information/ConfigItem: Instantiated 8 ScheduledDowntimes.
[2018-12-24 00:43:51 +1300] information/ConfigItem: Instantiated 99 Services.
[2018-12-24 00:43:51 +1300] information/ConfigItem: Instantiated 1 IcingaApplication.
[2018-12-24 00:43:51 +1300] information/ConfigItem: Instantiated 9 Hosts.
[2018-12-24 00:43:51 +1300] information/ConfigItem: Instantiated 1 FileLogger.
[2018-12-24 00:43:51 +1300] information/ConfigItem: Instantiated 2 NotificationCommands.
[2018-12-24 00:43:51 +1300] information/ConfigItem: Instantiated 188 Notifications.
[2018-12-24 00:43:51 +1300] information/ConfigItem: Instantiated 1 NotificationComponent.
[2018-12-24 00:43:51 +1300] information/ConfigItem: Instantiated 4 HostGroups.
[2018-12-24 00:43:51 +1300] information/ConfigItem: Instantiated 1 ApiListener.
[2018-12-24 00:43:51 +1300] information/ConfigItem: Instantiated 8 Downtimes.
[2018-12-24 00:43:51 +1300] information/ConfigItem: Instantiated 1 PerfdataWriter.
[2018-12-24 00:43:51 +1300] information/ConfigItem: Instantiated 1 CheckerComponent.
[2018-12-24 00:43:51 +1300] information/ConfigItem: Instantiated 11 Zones.
[2018-12-24 00:43:51 +1300] information/ConfigItem: Instantiated 1 ExternalCommandListener.
[2018-12-24 00:43:51 +1300] information/ConfigItem: Instantiated 9 Endpoints.
[2018-12-24 00:43:51 +1300] information/ConfigItem: Instantiated 3 ApiUsers.
[2018-12-24 00:43:51 +1300] information/ConfigItem: Instantiated 8 Users.
[2018-12-24 00:43:51 +1300] information/ConfigItem: Instantiated 215 CheckCommands.
[2018-12-24 00:43:51 +1300] information/ConfigItem: Instantiated 1 IdoPgsqlConnection.
[2018-12-24 00:43:51 +1300] information/ConfigItem: Instantiated 4 UserGroups.
[2018-12-24 00:43:51 +1300] information/ConfigItem: Instantiated 3 ServiceGroups.
[2018-12-24 00:43:51 +1300] information/ConfigItem: Instantiated 3 TimePeriods.
[2018-12-24 00:43:51 +1300] information/ScriptGlobal: Dumping variables to file ‘/var/cache/icinga2/icinga2.vars’
[2018-12-24 00:43:51 +1300] information/cli: Finished validating the configuration file(s).

The debuglog on the Icinga server shows me this:

[2018-12-24 00:49:40 +1300] debug/CheckerComponent: Executing check for ‘db2.datacentre.example.com!Disk’
[2018-12-24 00:49:40 +1300] debug/Checkable: Update checkable ‘db2.datacentre.example.com!Disk’ with check interval ‘60’ from last check time at 2018-12-24 00:22:50 +1300 (1.54556e+09) to next check time at 2018-12-24 00:50:39 +1300(1.54557e+09).
[2018-12-24 00:49:40 +1300] notice/ApiListener: Relaying ‘event::SetNextCheck’ message
[2018-12-24 00:49:40 +1300] notice/ApiListener: Sending message ‘event::ExecuteCommand’ to ‘db2.datacentre.example.com
[2018-12-24 00:49:40 +1300] debug/CheckerComponent: Check finished for object ‘db2.datacentre.example.com!Disk’

however, on the web, I do not see the last check at 00:49:40.

The debuglog on the db2 host shows me this:

[2018-12-24 00:34:58 +1300] notice/Process: Running command ‘/usr/lib64/nagios/plugins/check_disk’ ‘-c’ ‘10%’ ‘-w’ ‘15%’ ‘-X’ ‘none’ ‘-X’ ‘tmpfs’ ‘-X’ ‘sysfs’ ‘-X’ ‘proc’ ‘-X’ ‘configfs’ ‘-X’ ‘devtmpfs’ ‘-X’ ‘devfs’ ‘-X’ ‘mtmfs’ ‘-X’ ‘tracefs’ ‘-X’ ‘cgroup’ ‘-X’ ‘fuse.gvfsd-fuse’ ‘-X’ ‘fuse.gvfs-fuse-daemon’ ‘-X’ ‘fdescfs’ ‘-X’ ‘overlay’ ‘-X’ ‘nsfs’ ‘-X’ ‘squashfs’ ‘-m’ ‘-p’ ‘/prodBackup’: PID 20775
[2018-12-24 00:34:58 +1300] notice/Process: PID 20775 (’/usr/lib64/nagios/plugins/check_disk’ ‘-c’ ‘10%’ ‘-w’ ‘15%’ ‘-X’ ‘none’ ‘-X’ ‘tmpfs’ ‘-X’ ‘sysfs’ ‘-X’ ‘proc’ ‘-X’ ‘configfs’ ‘-X’ ‘devtmpfs’ ‘-X’ ‘devfs’ ‘-X’ ‘mtmfs’ ‘-X’ ‘tracefs’ ‘-X’ ‘cgroup’ ‘-X’ ‘fuse.gvfsd-fuse’ ‘-X’ ‘fuse.gvfs-fuse-daemon’ ‘-X’ ‘fdescfs’ ‘-X’ ‘overlay’ ‘-X’ ‘nsfs’ ‘-X’ ‘squashfs’ ‘-m’ ‘-p’ ‘/prodBackup’) terminated with exit code 0
[2018-12-24 00:34:58 +1300] notice/ApiListener: Sending message ‘event::CheckResult’ to ‘icinga.datacentre.example.com
[2018-12-24 00:35:01 +1300] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0

So, the checks are heppening but are not shown on the web?


(Lucas Possamai) #7

UPDATE 24-12-2018:

Uninstalled icinga2 server, deleted databases, ran setup again (maybe it was something misconfigured), added the all the hosts back. The problem persists.

UPDATE 01-01-2019:

I haven’t found a solution for this yet.

UPDATE 11-01-2019: I have doubled check all the services have the “import generic-service” and the hosts have “import generic-hosts”.

And they both have:

max_check_attempts = 5
check_interval = 1m
retry_interval = 30s

Still haven’t found a solution for this. :frowning: