Posts by ibrewster

This forum was archived to /woltlab and is now in read-only mode.

    Yep, just tested the enable_ha=false setting, and it didn't appear to have any effect - but, then, I didn't test sending any reminder notifications either.

    That said, while the observed behavior arguably could be a bug, as I think about it more it's also arguable that having notifications disabled on one of the masters is kind of counterintuitive, somewhat defeating the purpose of having multiple masters in the first place. So for now, I've just re-written my notifications to be host agnostic, which works well enough.

    I agree. Just snapped that one while skimming through...

    Back to the problem:

    What about setting enable_ha = false ?

    Ah, thanks. I'm not sure about the enable_ha option. According to the documentation (…es/#notificationcomponent)


    Disabling this currently only affects reminder notifications

    Which would make me think it won't help with the initial notifications. Worth a shot just to see what happens, though. Especially since the documentation also says there are "no configurable options", and then proceeds to list this configurable option... :P

    Sorry, but no - while that may not be ideal syntax (which I'll go ahead and fix), that's not the problem. As I stated in my last post, if I restart the secondary server, or if I enable the notification feature on the secondary server, then ALL notifications come through. Besides, I actually had it as a single line to begin with, and only went to two lines as an experiment when that didn't work. I had actually simply forgotten I still had it as two lines.

    The problem is that the secondary server is attempting to send notifications, even when the notification feature is disabled on the secondary server. Said problem is compounded by my use case (physical SMS modem attached via USB to one server, and trying to use notifications as EventHandlers, since EventHandlers are apparently broken in a cluster) that says some notifications MUST be run on a specific machine - otherwise, the problem could be solved by simply enabling the notification feature on the secondary and letting it send notifications.

    As a stopgap measure, I've re-written the SMS notify script to ssh into the proper machine before attempting to send a notification. This is inefficient, as even if I am on the machine with the modem, it will still open a SSH connection to itself, but it works. I guess I could add more smarts to the script, and pass it the NodeName as well (or check hostname), so it only does the ssh if needed, but that's just an added layer of complexity to the script. I still need to re-write the service restart script to do this as well, unless the underlying issue has a fix.

    Ideally, I'd be able to use something like the command_endpoint argument on a notification_command, since I really do want the notification feature enabled on the secondary master - otherwise, should the primary fail, I'd have no notifications, which isn't good. Since such an option apparently doesn't exist, however, I guess I just have to make all my host specific scripts ssh. Oh well.

    Further testing results:

    - If I restart the secondary server after missing a notification, the primary server will send the missing notification. Apparently the secondary server knows about the notifications needing to be sent, but it's not until the secondary is restarted that the primary sees this and tries to send them?

    - If I enable the notification feature on the secondary server, it will then actively try to send the formerly "missing" notifications. This is only a partial solution, however, as one notification type is SMS, and the SMS modem is only attached to the primary machine, not the secondary.

    From the documentation I've read, I was under the impression that if the notification feature was disabled on the secondary host, then all notifications would be sent from the primary, but this is apparently not the case.

    Is there a way to restrict certain notifications to only run on a specific host? Otherwise, I'll need to write a wrapper for said notifications that ssh's into the proper host and runs the proper notification command, which could be complicated given the parameters I need to pass through.

    Further testing shows that this issue, like the EventHandler issue I'm having, is related to clustering. That is, if I shut down the secondary master server, then all alerts are generated correctly for all services. It's almost as though icinga is trying to load-balance the alerts, except that the "secondary" master shows no sign of trying to send the missing alerts.

    Given that I'm having issues both with alerts and EventHandlers when clustered makes me wonder if maybe there's a problem in my cluster setup?

    It seems that my install of icinga2 is picking and choosing which configured notifications it want's to send for any given service. I have two services, configured identically (process checks, just looking for different processes names):

    Then I have three different notifications that apply to both services: e-mail, SMS, and a restart (since EventHandlers don't work):

    Given the above, whenever a problem/recovery occurs, I would expect to get three types of "notifications": e-mail, SMS, and then the restart script run. However, that is not what I am observing, nor what icinga2 is logging. According to both the log and observation, MRTG-Main only sends SMS notifications, while MRTG-Outstation sends everything BUT SMS notifications. It's like the two services are splitting the notification types between themselves. Do I have a typo in my config somewhere that could be causing this?

    Well, I've been able to come up with an ugly workaround: use notifications rather than event handlers. Of course, this is a bastardized use of notifications, and comes with a number of caveats, such as not being able to restart the service before it goes into a hard state, and not being able to turn on/off notifications separately from event handlers, but at least it works. It would be really nice to get this fixed, however.

    In further testing, it would also appear that the EventCommand is running on *both* master icinga servers in the cluster, not just "monitor" as I specified in the service configuration. Also, I had to reload icinga2 on the "monitor2" instance (the other host in the cluster) to get the script to stop - just reloading on "monitor" didn't accomplish anything. Nor did adding an explicit "exit 3" command in the script itself - apparently icinga just keeps calling the script, regardless of return value.

    EDIT: Apparently reloading on JUST the secondary machine (the one where the service check and command are NOT supposed to be running at all) is sufficient to get the script to stop firing. I don't have to reload on the primary machine at all. So the sequence of events is now:

    • Trigger service failure on "monitor" for service with "command_endpoint" specified as "monitor", in a two-master cluster setup
    • icinga2 triggers event_command on BOTH "monitor" and "monitor2"
    • script fires repeatedly, many times per second, until "monitor2" (not "monitor", where the failure occurred and where the service check should be running) is issued a reload command.

    I haven't yet tested to see what would happen if the service went both down and up without a reload, i.e. if the script would be firing still for both the old AND new states, or if it would just switch over to only firing for the new state - in practical usage, icinga would have crashed before the service state changed again. That said, I could see such a test as being informative.

    I have a service, defined as follows:

    So monitoring the MRTG service running on one of my hosts, which happens to be one of two clustered master servers. As such, whenever the status of the service changes, it should run the EventCommand restartMRTG, which is defined as follows (yes, I realize I could generalize that, and probably will at some point):

    1. object EventCommand "restartMRTG" {
    2. import "plugin-event-command"
    3. command = [ PluginDir + "/", "$rc_script$", "$service.state$", "$service.check_attempt$"]
    4. }

    The "" script, in turn, is (at the moment) a simple test script I wrote to figure out what was going on that just dumps the arguments to a file:

    1. #!/usr/local/bin/bash
    2. SERVICE=$1
    5. echo "`date`:`hostname` - $SERVICE $SERVICE_STATE $CHECK_ATTEMPT" >> /tmp/restartMRTG.log

    So fairly straight forward. The problem is that, rather than being called once per check, the restart script is called many, many times per second once it is triggered, repeating until I reload the icinga2 service. Note that this only started happening once I set up icinga2 in a clustered situation. This same configuration, without the `command_endpoint` parameter, and with a real restart script, was running just fine under the non-clustered scenario. It may also be noteworthy that the host for the check is actually one of the icinga2 hosts, and not some third-party host.

    In any case, once the script triggered, it output the following:

    Which, as I said, continued until I issued a reload command to the icinga2 process. Before I put the test script in place, it would repeatedly attempt to restart MRTG until system resources were exhausted and icinga2 crashed (on a critical state, of course, this example just happened to be switching back to OK).

    How can I fix this so the handler only runs once (per check, if something changed)?

    Right, but too early for him.

    Let him reach some confidence first by running a single machine.

    ibrewster : Great that you found it by your own !

    Thanks :) While I can't disagree with your sentiments (I am finding icinga2 to be *quite* confusing over my old icinga1/nagios installations), I was able to get the cluster set up and running. The hardest part was figuring out that I needed to put the configs in a subfolder of zones.d rather than in conf.d where I had been putting them all. So that really does resolve the issue ideally, while also getting some use out of the second machine.

    I have two icinga2 nodes set up in a multi-master cluster, which of course means that icinga2 load balances the checks between the two nodes. This is fine for most of my checks, but I have a couple that check local processes which only run on one of the hosts. Of course, the check_local_procs plugin doesn't pay any attention to what host it is on, it just checks the local process list. So if that particular check gets load-balanced to the other host, the check will fail.

    Is there some way I can limit the check to only run on a specific host? Or do I need to re-work the check to be a ssh check or the like?

    When I try this from the remote machine, I just get an "unauthorized" response. Course, I can certainly ssh in and run it on local host if I need to, but if I just need to enable access or something to be able to access it remotely, that sounds ideal.

    So, actually, I just tried this from the localhost as well, and got the same error. So apparently either I'm using the wrong username/password or I don't have things set up correctly or something.

    If you have a icinga2 cluster you can enable these checks, just change the assignrules to fit your enviorment.

    I may actually want/need to look at setting up a cluster, but at the moment I don't have one. The goal is to set up a second machine to "monitor" my monitor box, so should icinga itself die I can get notified of that. So I don't really need another full icinga instance, just some little script I can run periodically to keep an eye on things.

    To get the status that icingaweb2 displays, run at the commandline:

    curl -k -s -u root:password 'https://localhost:5665/v1/status' | python -m json.tool

    It should be easy to wrap a script around this that returns one of 0 1 2 3 to emulate the check you are after.

    When I try this from the remote machine, I just get an "unauthorized" response. Course, I can certainly ssh in and run it on local host if I need to, but if I just need to enable access or something to be able to access it remotely, that sounds ideal.

    Under icinga1 I could use the check_nagios monitoring plugin from the command line (script, cron, etc) to check on the icinga process itself. With icinga2, I don't have a status.dat file, so I can't just use the plugin. What's the replacement? I'm pretty sure it should be possible, given that the icinga2_web interface shows the status of the backend process. Thanks.

    Well, you may send fields of config objects as well as the…tification-runtime-macros to your notification scripts.

    That should give you lots of formatting possibilities.

    Indeed it does, and I'll have to look into that more to see if/what I want to do with it, but what I am going for here is getting the output of the check that failed - but only one line, not the full thing. In many cases, the one line is the entire output, so it works perfectly. In cases like above, however, the output is many lined, and contains characters that cause issues with the command line. At this point, since no one seems to have other ideas, I guess I'll just have to write a custom script that strips the outback back to a clean single-line.

    Now if only I could get this forum to actually notify me about replies...

    Is there a way to get a short output in a service notification command? At the moment, I am using $service.output$ in my notify by SMS, which, according to my logs, resulted in the following attempted SMS message:

    Which as we can see has two problems: First, that is *WAY* too much data to be trying to send via SMS - I'd like more than just warning/critical, if possible - for example, if a ping check fails, the output of that fits nicely - but not the full traceback I got. Secondly, because that traceback contains various special characters, we can see that it completely failed. Is there a different variable I can use to get just a short output (basically just the first line of the output), or is my best bet going to be to throw the whole thing into a script where I can truncate/escape/etc the output?

    Not that it helps any, but I'm seeing the same behavior with NagVis 1.8.5 (the latest non-beta version). Interestingly, if I individually select one and choose "Refresh Status" manually, it will turn green (or whatever appropriate for the members of the group), but it only lasts until the next automatic update, when it goes back to "the object does not exist", which is patently false.