Scheduled tasks (set by thruk) cause nagios to die, sometimes

  • We have a relatively large set-up, almost 500 hosts, and over 5500 services being primarily monitored by nagios core.


    One of my colleagues set-up thruk (version 1.80-3 at installation), and another scheduled some downtime for certain hosts (8 hosts, each had downtime of 30 minutes at about 00:20 each night - as they were being rebooted).


    Whilst I'm aware of thruk, my main area is nagios... and I got involved because I was aware that nagios was dying "sometimes" - approximately once or twice per month.


    I have set-up cron jobs to detect failure of nagios.... and to collect various logs (/usr/local/nagios/var/nagios.log, /var/log/messages etc) at the time of the failure.... before re-starting nagios (as it is a production system).


    Having investigated the logs, I was able to tie each failure down to the same time.... and the only thing running at that time was the thruk scheduled downtimes.


    I have "managed" the apache cron file - which launched the jobs.... and found that by increasing the number of jobs, I was able to get more re-productions... and therefore more evidence, whilst the auto-restart wasn't causing too much problem.


    The cron entry is


    Code
    1. 20 0 * * * cd /usr/share/thruk && /bin/bash -l -c '/usr/bin/thruk -a downtimetask="1"' >/dev/null 2>>/var/lib/thruk/cron.log

    My force, is to have 10 instances of this line.... per minute for 5 minutes.... (so a total of 50 entries). This is typically increasing the repeats to once or twice a week.


    I initially raised this with nagios.... but currently we have no further information....


    Last week, I upgraded thruk to 2.00 - and to date there has been no repeats.


    Any suggestions (additional logs to collect, settings to set etc etc), much appreciated.


    Thanks, Malcolm

  • nagios core version 4.2.4
    livestatus: Livestatus 1.2.0p3 by Mathias Kettner. Socket: '/usr/local/nagios/var/rw/live'
    Thruk 2.00-2


    The problem only occurs at the time when "downtimes" scheduled from with thruk either start or end (from what I've found).


    Typically, it is seen at the scheduled start time - but I have seen a failure at the end of a downtime.


    The problem occurred relatively regularly (once or twice a week) when using Thruk 1.80-3 (with all other software the same), but currently, there has been no recurrence since the thruk upgrade (3-Jan-2017).


    The problem is no longer vital (as I have scripts in place to detect the failure, collect logs, and restart nagios... and I suspect I can stop the problem completely, by removing all thruk scheduled downtimes.... but because it is not a show-stopper, I am interested to "fix" the problem...


    Hence this thread.


    Should I be looking to update livestatus ?


    Malcolm

  • we can try.... I'm assuming that what you are suggesting is to stop the current nagios... then start nagios from within gdb.... and start it without any breakpoints... is that correct ?


    For the meantime, I would prefer to await a reproduction first with the latest configuration (just in case the problem is solved).


    And then, assuming we get a re-production... run the gdb for "more data".


    Does that make sense?

  • for the record.... we had a reproduction during working hours yesterday, again related (it seems) to scheduled downtime... either the beginning of, or the end of, with multiple, overlapping downtime.


    This time, they were scheduled using the nagios programmatic interface (SCHEDULE_HOSTGROUP_HOST_DOWNTIME) - and the only thruk involvement was that people were using it.


    As such, despite nagios support stating "its probably thruk", I believe this problem is related to downtime, and nagios core, and thruk was an innocent bystander


    As such, I am happy that this thread can be closed - thank you for taking the time to read !