Performances drecreasing over time (Icinga 1.7.1)

  • Hi everyone,


    I am running icinga 1.7.1 on a rather large environments (~ 40k service checks, ~ 3k host checks), and thus using mod gearman (1.3.6). I am also using ido2db with a mysql backend. The installation is tweeked for big environment following all the different tricks I could follow on the wiki&doc&cie, and the mysql db is optimized as well.


    As a result, when I start Icinga, I have good performances, with a latency < 2 s. This stay very stable for more or less 12 hours, and then things are degrading very fast, and my latency reaches easily 60s after a few hours. I thought this could come from the ido2db module, but removing it just slows down the degradation process.
    From the icingastat output, I can say that the execution time remains constant, but what is puzzling me is the average amounts of checks over the last 5 mns (actchk5m) : it is absolutely not constant over time, but follow a regular patterns. The typical values are 31k Max and 15k average for the services, and 2k max 1.2k average for the hosts. I can provide graphs of all this measure if needed.


    I have no idea if this actchk5m behavior is related or not to my problem, but I have monitored all the basic performances of the server running icinga & gearman, of several workers, and of the machine running mysql.
    According to all those graphs, it would seem that icinga is doing more and more checks : the network traffic increases, the interruptions increase as well as the context switches, the mysql gets more and more requests, the nfs server on which the checks are shared with the workers gets more and more requests... all those values are linearly increasing, and it is when a certain threshold is reached that the icinga's performances start degrading. What is funny is that the memory consumption and the cpu load is more or less constant. A restart of icinga (and only icinga, not gearmand or ido2db) cures the problem.


    I can't really explain this increase of activity. I also want to precise that those machines are doing only the monitoring, so there is no interaction with other software.
    Does anybody else suffer from similar behavior?


    Thanks in advance
    Chris

  • blind guess - any possible relation to those reports?
    https://dev.icinga.org/issues/2993


    and yet, possible fix as well?
    https://git.icinga.org/?p=icin…95590350ccb309b3ec79212da

  • Hum, I don't know the part of the code which is mentioned, so I should have a closer look. Nevertheless I can say that this happen without any forced check, so if this part of the code is not reached "naturally", then the root is different. I can anyway apply the patch and give you feedback but it will have to wait for Monday :-)
    Thanks for the very quick reply! :-)

    The post was edited 1 time, last by chaen ().

  • There is a discussion on the mod-gearman mailing list ongoing about this problem. Seems like there are 2 problems here:
    - icinga (1.7.1) reschedules an immediate check on some conditions
    - mod-gearman (1.3.6) sends 2 results when a timeout is hit and mod-gearman has to kill the check plugin


    This leads to the current behaviour and the amount of checks done per minute increases over time till all worker are busy all the time or your machines ressources reaches their limits.


    The mod-gearman part is fixed in the current snapshots (http://mod-gearman.org/daily/2012-08-18/). The only change is the fix for this
    problem, so the beta release is stable.

  • This sounds interesting. I will apply the patch asap.
    And under which condition does icinga reschedule the check?
    I also just found in the logs many messages like "(Could Not Start Check In Time)" that were not there previously. Is this one of the symptom of the problem you mention?

  • i can only debug and tell for the icinga side. the mod_gearman part is some levels too far beyond currently. from what i can tell, the event reschedule bug could also happen on (re)start, as this comment remarks - https://dev.icinga.org/issues/2993#note-8 though, i need to get this running on my dev boxes asap.

  • Wahou, thanks a lot for the very quick reaction :-)
    I will deploy those changes asap (I cannot unfortunately touch the worker whenever I want)

  • So... I could finally deploy the changes yesterday evening, and I had icinga + mod_gearman + ido2db running over night. The versions used are those suggested above.
    The situation seems to be pretty similar as before... actually maybe even worse, the situation degraded faster.
    I restarted everything right now to get other plots. I keep you posted

  • icinga 1.7.2 will be out on monday. meanwhile your can fetch the nightly builds of r1.7 or the git branch. from my reports, various mentioned issues are resolved, but yours is still missing feedback on the icinga side.

  • So I confirm, the bug still hits me. But it shows up much faster.
    I will try right now with the latest git repository, to see if it changes anything.


    Also, could you tell me if those variations I observe in the actchk5m plots are something to worry about?

  • The behaviour seems a bit different though... I need to let it run longer to make sure, but it seems that the performances drop faster, but the are fluctuating more than before. Before it was a linear decrease, whereas now it looks more like a step function. I let you know.


    Just about the latest git version, did it pass the nightly build? I get this when trying to package it :


    + cp -pr README LICENSE Changelog UPGRADING README.RHEL /home/chaen/rpmbuild/BUILDROOT/icinga-1.7.1-1.el6.x86_64/usr/share/doc/icinga-doc-1.7.1
    + exit 0
    Requires(rpmlib): rpmlib(CompressedFileNames) <= 3.0.4-1 rpmlib(PayloadFilesHavePrefix) <= 4.0-1
    Checking for unpackaged file(s): /usr/lib/rpm/check-files /home/chaen/rpmbuild/BUILDROOT/icinga-1.7.1-1.el6.x86_64
    error: Installed (but unpackaged) file(s) found:
    /usr/share/icinga/jquery-ui/jquery.ui.core.min.js
    /usr/share/icinga/jquery-ui/jquery.ui.datepicker.min.js
    /usr/share/icinga/jquery-ui/jquery.ui.mouse.min.js
    /usr/share/icinga/jquery-ui/jquery.ui.slider.min.js
    /usr/share/icinga/jquery-ui/jquery.ui.timepicker-addon.min.js
    /usr/share/icinga/jquery-ui/jquery.ui.widget.min.js
    /usr/share/icinga/jquery-ui/theme/images/ui-bg_flat_0_aaaaaa_40x100.png
    /usr/share/icinga/jquery-ui/theme/images/ui-bg_flat_75_ffffff_40x100.png
    /usr/share/icinga/jquery-ui/theme/images/ui-bg_glass_55_fbf9ee_1x400.png
    /usr/share/icinga/jquery-ui/theme/images/ui-bg_glass_65_ffffff_1x400.png
    /usr/share/icinga/jquery-ui/theme/images/ui-bg_glass_75_dadada_1x400.png
    /usr/share/icinga/jquery-ui/theme/images/ui-bg_glass_75_e6e6e6_1x400.png
    /usr/share/icinga/jquery-ui/theme/images/ui-bg_glass_95_fef1ec_1x400.png
    /usr/share/icinga/jquery-ui/theme/images/ui-bg_highlight-soft_75_cccccc_1x100.png
    /usr/share/icinga/jquery-ui/theme/images/ui-icons_222222_256x240.png
    /usr/share/icinga/jquery-ui/theme/images/ui-icons_2e83ff_256x240.png
    /usr/share/icinga/jquery-ui/theme/images/ui-icons_454545_256x240.png
    /usr/share/icinga/jquery-ui/theme/images/ui-icons_888888_256x240.png
    /usr/share/icinga/jquery-ui/theme/images/ui-icons_cd0a0a_256x240.png
    /usr/share/icinga/jquery-ui/theme/jquery.ui.all.css
    /usr/share/icinga/jquery-ui/theme/jquery.ui.base.css
    /usr/share/icinga/jquery-ui/theme/jquery.ui.core.css
    /usr/share/icinga/jquery-ui/theme/jquery.ui.datepicker.css
    /usr/share/icinga/jquery-ui/theme/jquery.ui.slider.css
    /usr/share/icinga/jquery-ui/theme/jquery.ui.theme.css
    /usr/share/icinga/jquery-ui/theme/jquery.ui.timepicker-addon.css



    RPM build errors:
    File listed twice: /var/log/icinga/archives
    Installed (but unpackaged) file(s) found:
    /usr/share/icinga/jquery-ui/jquery.ui.core.min.js
    /usr/share/icinga/jquery-ui/jquery.ui.datepicker.min.js
    /usr/share/icinga/jquery-ui/jquery.ui.mouse.min.js
    /usr/share/icinga/jquery-ui/jquery.ui.slider.min.js
    /usr/share/icinga/jquery-ui/jquery.ui.timepicker-addon.min.js
    /usr/share/icinga/jquery-ui/jquery.ui.widget.min.js
    /usr/share/icinga/jquery-ui/theme/images/ui-bg_flat_0_aaaaaa_40x100.png
    /usr/share/icinga/jquery-ui/theme/images/ui-bg_flat_75_ffffff_40x100.png
    /usr/share/icinga/jquery-ui/theme/images/ui-bg_glass_55_fbf9ee_1x400.png
    /usr/share/icinga/jquery-ui/theme/images/ui-bg_glass_65_ffffff_1x400.png
    /usr/share/icinga/jquery-ui/theme/images/ui-bg_glass_75_dadada_1x400.png
    /usr/share/icinga/jquery-ui/theme/images/ui-bg_glass_75_e6e6e6_1x400.png
    /usr/share/icinga/jquery-ui/theme/images/ui-bg_glass_95_fef1ec_1x400.png
    /usr/share/icinga/jquery-ui/theme/images/ui-bg_highlight-soft_75_cccccc_1x100.png
    /usr/share/icinga/jquery-ui/theme/images/ui-icons_222222_256x240.png
    /usr/share/icinga/jquery-ui/theme/images/ui-icons_2e83ff_256x240.png
    /usr/share/icinga/jquery-ui/theme/images/ui-icons_454545_256x240.png
    /usr/share/icinga/jquery-ui/theme/images/ui-icons_888888_256x240.png
    /usr/share/icinga/jquery-ui/theme/images/ui-icons_cd0a0a_256x240.png
    /usr/share/icinga/jquery-ui/theme/jquery.ui.all.css
    /usr/share/icinga/jquery-ui/theme/jquery.ui.base.css
    /usr/share/icinga/jquery-ui/theme/jquery.ui.core.css
    /usr/share/icinga/jquery-ui/theme/jquery.ui.datepicker.css
    /usr/share/icinga/jquery-ui/theme/jquery.ui.slider.css
    /usr/share/icinga/jquery-ui/theme/jquery.ui.theme.css
    /usr/share/icinga/jquery-ui/theme/jquery.ui.timepicker-addon.css

  • you are using the git master build instead of r1.7 which i told you to. the git master remains under heavy development and is not tested for packaging yet. though thanks for the note on the new files.

  • Sorry, it's Friday.... I deployed the latest version of the proper git branch. Let see how it goes :-)

  • I am pleased to announce that it is now running for 2 days with no loss of performance! :-)
    A really big congratulation and thank you for this very fast and efficient support :-)