DNS plugins don't return lookup time

plugins

#1

This was just brought to my attention today, and applies to both check_dig and check_dns. The plugins’ output and perf data reflect the execution time of the plugin, rather than the time it took to get a response from the DNS server. Is this something that can be changed with --extra-opts?

Here’s what I’m seeing:

$ time /usr/lib/nagios/plugins/check_dig -l ....  -t 10
DNS OK - 0.015 seconds response time (...)|time=0.014952s;;;0.000000

real    0m0.018s

According to the plugin, it took about 15ms to receive a response from the server. Technically this is true, if you count the time to actually run the plugin. Compare to:

$ time dig  ...
; <<>> DiG 9.10.3-P4-Debian <<>> ...
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 36655
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;... IN A

;; ANSWER SECTION:
... 0 IN A      ...

;; Query time: 0 msec


real    0m0.015s

If I just run dig, it still takes about 15ms to run, but I can see that the time to get a response from the server was less than 1ms. Works out the same with check_dns and nslookup, also.


(Markus Frosch) #2

If you keep a cache of a local resolver in mind results are pretty similar.

I’m not sure how the plugin measures time exactly, but there should be no real overhead with running the code.

$ time /usr/lib/nagios/plugins/check_dig -l mail.lazyfrosch.de -H 8.8.8.8
DNS OK - 0,052 seconds response time (mail.lazyfrosch.de. 3599 IN A 94.130.26.191)|time=0,052351s;;;0,000000

real	0m0,054s

$ dig mail.lazyfrosch.de @8.8.8.8

; <<>> DiG 9.11.5-1-Debian <<>> mail.lazyfrosch.de @8.8.8.8
...
mail.lazyfrosch.de.	3599	IN	A	94.130.26.191

;; Query time: 54 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)

#3

That’s interesting that your two tests returned very similar numbers. I tried once again and still got fairly different results. Here, too, dig gives a query time significantly shorter than the response time returned by check_dig:

$ time ./check_dig -l monitoring-portal.org -H 8.8.8.8 -v
/usr/bin/dig   -p 53 @8.8.8.8 monitoring-portal.org A +retry=2 +time=7
Looking for: 'monitoring-portal.org'
;; ANSWER SECTION:
monitoring-portal.org.  599     IN      A       185.11.254.41
DNS OK - 0.150 seconds response time (monitoring-portal.org. 599 IN A 185.11.254.41)|time=0.150219s;;;0.000000
                       
real    0m0.155s 
..
..

$ time dig monitoring-portal.org @8.8.8.8                                                                                                                                                      [4/1425]
                                                                                                                                                                                                                  
; <<>> DiG 9.10.3-P4-Debian <<>> monitoring-portal.org @8.8.8.8
;; global options: +cmd                    
;; Got answer:                         
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 1990
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;monitoring-portal.org.         IN      A

;; ANSWER SECTION:
monitoring-portal.org.  599     IN      A       185.11.254.41

;; Query time: 129 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
...


real    0m0.152s
..
..

The reason this has become a significant issue in my environment is that were testing against an Anycast address, where the Icinga2 satellite and the Anycast node are in the same datacenter. Any response time longer than about 2ms would indicate the local Anycast node isn’t working, and some other node is answering. Since the plugin itself takes 15ms to execute and seems to be counting that towards the response time, it throws off the warn/crit thresholds quite a bit.

(Actually, I would like to just use that 15ms as the baseline for the warn/crit, but it seems our spec for this project demands a higher degree of accuracy.)


(Michael Friedrich) #4

Anycast depends on routes and there is no guarantee whether you hit the first, second, etc. real DNS instance here. I would inspect whether there is a real problem with your network (routes) here which cause such a latency in answering - sometimes. Also, try analyzing the path and layers down below.

The check plugin adds some parameters to dig like retries and timeouts. This maybe influences the overall calls as well.


#5

True–that’s actually why the warn/crit thresholds are so low. When in the same dc, so long as the Anycast route is correct and the local node is healthy, ICMP response times are generally less than 1ms. If the local Anycast node drops and routing goes to the next nearest node 8ms away, the ping service will alert to indicate that a change has occurred. All the nodes that are being checked for DNS are also being checked with ICMP, so we have visibility into whether it’s network latency (eg. due to routing to a non-local Anycast endpoint) or something else.

(Although… it would be worth double checking that ICMP and UDP are taking the same path. I can’t think of any reason they wouldn’t be inside our network, but…)

As for the check_dns and check_dig plugins, it’s looking like there isn’t a straightforward way to ensure the output is reflecting the query time. That pretty much leaves hacking the plugins or writing new ones to parse output from dig?


(Michael Friedrich) #6

Plugins which actually execute a binary and parse its output somehow serve the 95% of users. If I need more deep insights and analysis, I would write my own native binary doing the DNS magic on socket level. I haven’t done that though.

http://www.yonch.com/uncategorized/dns-resolver-libraries lists some of them, where https://github.com/c-ares/c-ares looks promising but complicated. For Python, I would just look into https://github.com/rthalley/dnspython for example.


#7

With regard to the Response Time metric, that’s not what this plugin does. I am not a C programmer (yet), but looking at the following sections of code, it seems this plugin checks the time of day when it starts its main loop, and again just before printing to STDOUT, and returns the delta between the two. This a very reliable, if imprecise, method.

For 95% of use-cases where DNS is being checked, this seems to be sufficient. It covered all of my own use cases until we started adding Anycast into the mix. Some of the Anycast nodes we monitor are very close together, less than 5ms in some cases. We need more precision in order to reliably detect if one of those were to fail over to another for some reason.

To that end I wrote a new plugin in Python that parses the response time from the Stats section of Dig’s output. It should cover our needs for now, but I think I’ll use this opportunity to finally learn some C and submit a pull request for check_dig to change the response time logic. Might be a while though :slight_smile:


(Michael Friedrich) #8

On point analysis, very nice :+1: