Faulty Backend takes down Thruk


(rick.browne@gmail.com) #1

We have 35 Nagios Servers linked to two Thruk servers (One version 2.14 and one 2.18)

Occasionally we have issues with one of the Nagios servers where it has memory leaks and hangs

The issue is that this seems to knock out both of the Thruk servers as well - the site just stops responding with the error-

Service Unavailable
The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.

The logs record the backend going down-

[2018/04/25 11:31:58][opthruk01][ERROR][Thruk] No Backend available
[2018/04/25 11:31:58][opthruk01][ERROR][Thruk] on page: opthruk01/thruk/cgi-bin/extinfo.cgi?type=2&host=server&service=Hardware - Health&backend=f50e2&=1524652316900
[2018/04/25 11:31:58][opthruk01][ERROR][Thruk] opnag01: ERROR: not a valid header (no content-length)
[2018/04/25 11:31:58][opthruk01][ERROR][Thruk] got: No UNIX socket /usr/local/nagios/var/rw/live existing. (10.2.50.247:6557)
[2018/04/25 11:32:28][opthruk01][ERROR][Thruk] No Backend available
[2018/04/25 11:32:28][opthruk01][ERROR][Thruk] on page: opthruk01/thruk/cgi-bin/extinfo.cgi?type=2&host=server&service=Hardware - Health&backend=f50e2&
=1524652346896
[2018/04/25 11:32:28][opthruk01][ERROR][Thruk] opnag01: ERROR: not a valid header (no content-length)
[2018/04/25 11:32:28][opthruk01][ERROR][Thruk] got: No UNIX socket /usr/local/nagios/var/rw/live existing. (10.2.50.247:6557)
[2018/04/25 11:35:37][opthruk01][ERROR][Thruk] No Backend available
[2018/04/25 11:35:37][opthruk01][ERROR][Thruk] on page: opthruk01/thruk/cgi-bin/extinfo.cgi?type=2&host=server&service=Hardware - Health&backend=f50e2&_=1524652497900
[2018/04/25 11:35:37][opthruk01][ERROR][Thruk] opnag01: ERROR: socket error. (10.2.50.247:6557)

I’ve had Nagios servers go offline before, and they dont effect Thruk this way - it just seems to be this particular back end when it has memory issues

(FYI: i had to chop out the https:// on the url so it would let me post)


(Sven Nierlein) #2

There are two possible solutions for this:

  • state_hosts
  • LMD

state hosts require at least one local nagios/naemon installation which has a host for each connection. The idea here is to query the local nagios/naemon which remote site is available and only query the ones available. Read more here: https://www.thruk.org/documentation/configuration.html#check_local_states

LMD more or less completley replaces Thruks Backend handling with a separate Daemon which runs updates for each backend in a separate go routine, so a single failed backend does not slow down Thruk.
See https://www.thruk.org/documentation/configuration.html#use_lmd_core

With 35 remote sites, i’d suggest using LMD for performance reasons alone already.