Icinga2 check_by_ssh with persistent SSH connections


Hi all,
if you are monitoring your linux clients agentless via check_by_ssh, you may also get the problem, that you have a lot of messages in /var/log/syslog or /var/log/auth for every SSH check, because every check creates a new SSH Session. Instead of this, you can also use a persistent SSH connection.

Since OpenSSH 5.6:
Added a ControlPersist option to ssh_config(5) that automatically
starts a background ssh(1) multiplex master when connecting. This
connection can stay alive indefinitely, or can be set to
automatically close after a user-specified duration of inactivity.

As I did not find any topics about persistent SSH connections, here is a little how-to (OS: Ubuntu 16.04):
On your Checking Server (Master, Satellites), create a directory for the control socket. I use /var/run/icinga2 (which is already there)

Create a Service Template for your SSH Checks:

template Service "by_ssh" {
  import "generic-service"
  check_command = "by_ssh" 
  vars.by_ssh_logname = "nagios"
  vars.by_ssh_identity = "/<path_to_your>/id_rsa"
  vars.by_ssh_options = [ "ControlMaster=auto","ControlPath=/var/run/icinga2/$host.name$","ControlPersist=10m" ]


  • ControlMaster=auto: Create the control master socket automatically
  • ControlPersist=10m: Enable Control persist and spam a ssh process in background that will keep the connection for 10 minutes after your last SSH session on that connection has exited
  • ControlPath=/var/run/icinga2/ssh/$host.name$: Path to the control socket.

Then use it for your further SSH checks:

apply Service "load" {
	import "by_ssh"
	vars.by_ssh_command = "/usr/lib/nagios/plugins/check_load"
	vars.by_ssh_arguments = {
		"-w" = {
                        value = "$load_wload1$,$load_wload5$,$load_wload15$"
                        description = "Exit with WARNING status if load average exceeds WLOADn"
        "-c" = {
                        value = "$load_cload1$,$load_cload5$,$load_cload15$"
                        description = "Exit with CRITICAL status if load average exceed CLOADn; the load average format is the same used by 'uptime' and 'w'"
        "-r" = {
                        set_if = "$load_percpu$"
                        description = "Divide the load averages by the number of CPUs (when possible)"
assign where host.vars.os == "linux"

and so on…


(Alireza J Erfani) #2

Thanks @anon9062050. I found your post very helpful. however I still get an error in icingaweb saying: “Remote command execution failed: Host key verification failed.” which clearly is caused because the key cannot be read., can’t figure what’s wrong, do you have any idea? I included my template and service:
template Service “by_ssh” {
import “generic-service”
check_command = “by_ssh”
vars.by_ssh_logname = “root”
vars.by_ssh_identity = “/var/lib/nagios/.ssh/id_rsa”
apply Service “computeGrideDiskCheck” {
import “by_ssh”
display_name =“disks”
check_command = “by_ssh”
vars.disk_wfree = “25%”
vars.disk_cfree = “15%”
vars.disk_partitions = ["/tmp"]
vars.by_ssh_command = ["/usr/lib/nagios/plugins/check_disk"]
vars.by_ssh_arguments = {
“-w” = “$disk_wfree$”
“-c” = “$disk_cfree$”
“-p” = “$disk_partitions$”
assign where host.vars.os == “dev” && host.vars.check_byssh == 1 && host.vars.address == “”
also since user nagios by default doesn’t have either a shell and password is it really safe to use it as login user?
Thanks very much.


I’d doubt that.

In most cases the reason is that there hasn’t been an acknowledgement of the connection. The keys have been transferred to the target machine (client) but there hasn’t been a successful first attempt from the source machine (monitoring server) which requests your approval that you actually trust the client. Once this approval has been given there are no further requests until the key changes.

So log in to the monitoring machine, type “ssh <user>@<target>”, acknowledge and leave the target machine. Repeat for additional clients.

1 Like
(Alireza J Erfani) #4

thanks very much for the info. I just had to give user nagios a shell and a password. it’s working now!