I am running icinga 1.7.1 on a rather large environments (~ 40k service checks, ~ 3k host checks), and thus using mod gearman (1.3.6). I am also using ido2db with a mysql backend. The installation is tweeked for big environment following all the different tricks I could follow on the wiki&doc&cie, and the mysql db is optimized as well.
As a result, when I start Icinga, I have good performances, with a latency < 2 s. This stay very stable for more or less 12 hours, and then things are degrading very fast, and my latency reaches easily 60s after a few hours. I thought this could come from the ido2db module, but removing it just slows down the degradation process.
From the icingastat output, I can say that the execution time remains constant, but what is puzzling me is the average amounts of checks over the last 5 mns (actchk5m) : it is absolutely not constant over time, but follow a regular patterns. The typical values are 31k Max and 15k average for the services, and 2k max 1.2k average for the hosts. I can provide graphs of all this measure if needed.
I have no idea if this actchk5m behavior is related or not to my problem, but I have monitored all the basic performances of the server running icinga & gearman, of several workers, and of the machine running mysql.
According to all those graphs, it would seem that icinga is doing more and more checks : the network traffic increases, the interruptions increase as well as the context switches, the mysql gets more and more requests, the nfs server on which the checks are shared with the workers gets more and more requests... all those values are linearly increasing, and it is when a certain threshold is reached that the icinga's performances start degrading. What is funny is that the memory consumption and the cpu load is more or less constant. A restart of icinga (and only icinga, not gearmand or ido2db) cures the problem.
I can't really explain this increase of activity. I also want to precise that those machines are doing only the monitoring, so there is no interaction with other software.
Does anybody else suffer from similar behavior?
Thanks in advance