We have a relatively large set-up, almost 500 hosts, and over 5500 services being primarily monitored by nagios core.
One of my colleagues set-up thruk (version 1.80-3 at installation), and another scheduled some downtime for certain hosts (8 hosts, each had downtime of 30 minutes at about 00:20 each night - as they were being rebooted).
Whilst I'm aware of thruk, my main area is nagios... and I got involved because I was aware that nagios was dying "sometimes" - approximately once or twice per month.
I have set-up cron jobs to detect failure of nagios.... and to collect various logs (/usr/local/nagios/var/nagios.log, /var/log/messages etc) at the time of the failure.... before re-starting nagios (as it is a production system).
Having investigated the logs, I was able to tie each failure down to the same time.... and the only thing running at that time was the thruk scheduled downtimes.
I have "managed" the apache cron file - which launched the jobs.... and found that by increasing the number of jobs, I was able to get more re-productions... and therefore more evidence, whilst the auto-restart wasn't causing too much problem.
The cron entry is
My force, is to have 10 instances of this line.... per minute for 5 minutes.... (so a total of 50 entries). This is typically increasing the repeats to once or twice a week.
I initially raised this with nagios.... but currently we have no further information....
Last week, I upgraded thruk to 2.00 - and to date there has been no repeats.
Any suggestions (additional logs to collect, settings to set etc etc), much appreciated.