jump to 100% cpu + database isn't able to keep up

  • Hi,


    I manage a monitoring setup running Icinga2 version r2.6.3-1 (from the 2.6.3-1~debmon8+1 debian package) in a 2 node master-client setup with 153 hosts and 1456 services on a postgresql bdr database.


    After the icinga2 process is started or reloaded, the cpu usage is normal (15%). After an arbitrary amount of time (between ~10 min and ~6 hours), the cpu load of the icinga2 process jumps to 100% and the icinga2 query queue starts rising, complaining about "empty in infinite time, your database isn't able to keep up". It will remain like this until the next reload, at which time the old process will stop monitoring and quickly empty its query queue into the database and a new process is forked which starts monitoring like it should. For as far as I've been able to see no data gets lost. Also monitoring gets done normally both in the normal state and in the high cpu state.


    A dirty work around for this issue could be accomplished by monitoring the icinga2 log and letting icinga2 restart itself whenever it finds a complaint about the time it expects to empty its queue. A real fix is preferred :-).


    I've not been able to find a recent post with similar issues, though it could be (related to) this bug: https://github.com/Icinga/icinga2/issues/4133


    What action/info would you need on my end to help fix this?


    Kind regards,


    Lukas

    The post was edited 3 times, last by lvs182 ().

  • Try upgrading to v2.7 and monitor your host system, especially how the MySQL database backend performs. If the IDO feature cannot keep up with processing the monitoring data and dumping it into the database, that is a normal error you should investigate on.

  • Thank you for your response.


    Since your answer yesterday I've upgraded to the 2.7.2-1.jessie packages and put monitoring on the Postgres database. The only two databases in this Postgres instance are those for Icinga2 and Icingaweb2.


    What I see when cpu usage jumps to 100%:

    - the IdoPgsqlConnection query rate jumps from ~27/s to ~190000/s over the course of one minute and remains that high (see attached query-rate.txt for this transition)

    - from the perspective of the Postgres database the transactions per second increase by ~10%, the amount of insert, update, hot update and delete per second remain at the same level, general load on the database increases by ~10%, but nothing worrying at all, the Postgres process still has low cpu usage and low io-wait


    What I see when I reload Icinga2:

    - there is a short spike in the load on the database when the IdoPgsqlConnection worker quickly finishes its entire queue

    - after the reload everything is back to normal


    Also attached: strace-output.txt of the 100% cpu thread while the icinga2 process is in high cpu load.


    Because this happens regularly on our setup, we're in a good position to investigate. What information would you like me to gather (and how) to help pinpoint the origin of this issue?


    Thanks and cheers!

  • Can you add some numbers here, e.g. the output of "icinga2 daemon -C" (I'm interested in object counts) and the frequency of checks and other intervals?


    The query rate numbers sound wrong, 200k/s is a bit huge compared to the items which remain in the queue at the measurement point each ten seconds.



    That would sounds like some query cannot be executed at this point, and it has a dependency on a related object or insert id (e.g. when a statehistory entry should happen, but the object does not exist yet, or notifications are sent, contactnotifications should be updated but the notification_id is not yet set).


    Two ideas:


    - Enable the debug log, and look for specific IDO queries which are not immediately executed, but re-inserted into the workqueue

    - Please also show the configuration of the ido-pgsql feature. Especially I'd like to know about "categories" being set to something else.

  • Most check-intervals are at 120, many are at 300, some are more, we're averaging at 135 for active checks. The retry_intervals are mostly at 30, we're averaging at 47 there. Notification intervals are at 300.


    Please see the attachments for config of ido-pgsql and icinga2 daemon -C output (removed sensitive data of both).


    After enabling debug log it took roughly two minutes before the high cpu state was triggered. In the debug log I've not been able to find any rapidly repeating queries or messages that indicate that any reinserting into the queue might be going on.


    Stripping the timestamp at the start of the line, the most frequent identical messages that I found were BEGIN and COMMIT queries averaging at a rate of one per second. While sampling through them, the queries all seem to handle objects, in which I haven't found a repeating pattern.


    There are ~100.000 lines in the debug.log (I've turned it back off). The IdoPgsqlConnection worker is responsible of ~5000 of those lines. Are there specific strings you'd suggest I look for in that file?


    Thanks!

  • Hm, 10000 notification objects for 2000 hosts/services and only 10 users. That sounds a lot, how's that configuration built?


    The PgSQL driver does not log so much in contrast to the MySQL feature unfortunately. I'm mainly interested in the queries then, to get an idea what's going on.

  • A sample of a notification scheme for one particular service:

    Object 'cap1!load!call-noc-service' of type 'Notification':

    Object 'cap1!load!callsms-l1-daytime-service' of type 'Notification':

    Object 'cap1!load!callsms-l1-night-service' of type 'Notification':

    Object 'cap1!load!callsms-l1-service' of type 'Notification':

    Object 'cap1!load!callsms-l2-daytime-service' of type 'Notification':

    Object 'cap1!load!callsms-l2-night-service' of type 'Notification':

    Object 'cap1!load!callsms-l2-service' of type 'Notification':

    Object 'cap1!load!mail-noc-service' of type 'Notification':


    This setup covers notification of our noc, and escalation to our two levels of on call with different settings for office hours, evening and night. Not all hosts/services have evening or night notifications enabled. Notifications are applied per host/service.


    I'm happy to share the query part of the debug log with you, but not with the rest of the internet. How would you like me to send you this info?

  • Ah ok, so lots of different notifications. That's fine, just curious about such high numbers in comparison.


    Do you have sorts of owncloud/nextcloud where you can share that URL in private? Depending on the size and content, your company policy might need you to obfuscate the content anyways. Keep in mind that this here is on a private basis, I don't sign NDAs or similar.


    Not sure if we can learn something about it. One thing you can do of course - grep the log, and count the queries by table. Group them also into config, status and historical table queries (the DB IDO schema is documented here: https://www.icinga.com/docs/icinga1/latest/en/db_model.html)


    It might be a general performance issue with the IDO feature in combination with PostgreSQL, or some queries which never get executed. Could be a bug too.


    ido-pgsql looks fine btw, I was worrying about missing categories which could cause a scenario where historical queries would be fired, but no config objects are actually dumped.


    Just to be sure - PostgreSQL is running on the same host, or is there some network latency involved here? Which PostgreSQL version is running?


    Also interesting - which hardware is used for that server, any insights in performance troubles or especially disk I/O?

    https://www.icinga.com/docs/ic…#analyze-your-environment

  • Thanks again for your in-depth response!


    I've written a quick script to analyse the debug log (see attachments for both the script and the output).


    Our postgres bdr is running on the same host (both on the Icinga2 master and on the Icinga2 client). We're using postgres version 9.4.14-1.jessie+1 and bdr plugin version 1.0.3-1.jessie+1.


    Master and client both have 8-core Xeon and 16GB ram. We have many performance graphs and other than the high cpu (User), the accompanying load (+1) and increase in temperature when in high cpu state, there's nothing out of the ordinary. Specifically looking at Wait-IO I can say that it rarely ever rises over 3% for longer than a minute and when it does, it's because of a file system backup or when generating 100.000 lines of debug log in under 5 minutes :-).


    Please let me know if you have suggestions for additions to the analysis.

  • I honestly have no idea what BDR is. I needed to google for it. Could that replication influence the overall performance, or sort of table sequence counts/increments?


    8 cores, 16 gb ram is fine. I was worried about the usual 2 core, 4 gb ram VMs some people throw into the wild.


    I suspect a problem with notifications -> insert_id (which is the SELECT sequence value in your analysis) -> contactnotifications.


    If a notification for a specified user happens, first an insert into icinga_notifications takes place (or it should). icinga_contactnotification splits that up by user, but needs a reference into the history table icinga_notifications (that's the "insert id", or the current sequence number in PGSQL). If there is none, the query is not executed, but re-inserted into the query queue. I would believe that you hit sort of a bug or problem, so that contactnotification queries are never executed.


    Since you've listed the SELECT seqval queries under "error" .. what does that mean exactly, is there more info about its error state? I'd like to see that :)


    Can you extract all those queries which happen for anything which has "notification" inside the table name? Shouldn't be too much, and can easily be replaced with sensitive information.

  • After reading the first line in your previous post I feel a little guilty that I wrote bdr in lower case in my first post. My apologies for that, you've been a great help so far and I should have made the bdr more obvious.


    Theoretically, the bdr could introduce issues I'm currently unaware of. Eliminating bdr from the equation is quite a bit of work and also requires taking Icinga2 offline altogether, which I'm not very keen on. If we need to, I'll consider this.


    Ah, yeah, figured "error" might be not exactly the right word for what I meant there. It just means that the method I used to parse the log line found a query, but did not understand it sufficiently to put it into one of the categories. I've renamed it to "queryparsefailed" :-).


    Attached are the queries on any tables with "notification" in the name.

  • Hmmm, I don't see any contactnotifications table inserts here. There are two queries for notifications where contacts_notified is set to '1', they should trigger an additional insert here.


    I know that this might be bad, but ... can you disable notifications globally, e.g. with "icinga2 feature disable notification" and perform those reloads again. To see whether the queries and behaviour is the same. Don't do that for long, notifications are needed.


    It might be worthwhile to reproduce this on a test system (without client checks, but sort of dummy / random checks for hosts and services, and dummy users, just to trigger the notifications). Then you can play around more, if the issue is reproducible there as well.


    To me, it sounds like the "notification_id" is somehow not fetched, or zero, and that could explain the behaviour.


    Different thought - disable the notification category from the IDO feature.


    https://www.icinga.com/docs/ic…types/#idopgsqlconnection

    Code
    1. categories = [ DbCatConfig, DbCatState, DbCatAcknowledgement, DbCatComment, DbCatDowntime, DbCatFlapping, DbCatProgramStatus, DbCatRetention, DbCatStateHistory ]

    leaves out DbCatNotification.

  • During office hours I don't mind turning off notifications, we still have Icinga Web 2 that we can watch. But it's not something we can do all day.


    With notifications turned off via "icinga2 feature disable notification" there are indeed no queries to any table with notification in the table name. Also the high cpu issue doesn't show itself in the first 10 minutes. Attachment query-analysis-output-notification-feature-disabled.txt shows the analysis of this run.

    Different thought - disable the notification category from the IDO feature.

    Do you mean to do this instead of turning off the notification feature altogether?


    And is my assumption correct that notifications are then still sent to the users, but are not logged into the database and so not enter in the recorded history?

  • Both methods work, even if combined. If the IDO feature doesn't write anything to the backend, that's the less invasive change.


    I have the feeling that there's a bug inside the PGSQL implementation with dumping this kind of notification history. I cannot investigate further now, as I have to prepare the v2.8 release and OSMC slides & demos too. Can you please collect your findings here and put that into a Github issue then?

  • Thanks for the clarification.


    I'll collect the info and create a Github issue for it.


    Good luck with preparations for v2.8 and OSMC!


    Thank you for your help!

  • One last question, as we haven't been able to reproduce this behavior reliably yet, does it make sense to create the Github issue now? Or would you want me to hold off on that until we've put the time in to create a lab setup? (to which I must confess, I'm not sure when that will be)

  • If you can reliably reproduce the issue, it will help. You will be asked for feedback in that issue from different developers, and it makes little to no sense to say "cannot reproduce it right now".


    Since you have sort of a workaround in place, it shouldn't hurt now. But this should be tackled down at some point, or other members join here and add their findings too. I haven't seen that for a while now, those issues are >1 year old.


    Keeping track, and refining the analysis surely helps.