Host CPU Usage Check_MK RAW 1.5.0p6


(Chris Gregson) #1

Hi guys – I’m running Check_MK RAW 1.5.0p6 on a newly built VM running Ubuntu 18.04 with all updates applied. I’m currently monitoring 40 hosts and 1,700 services. I’m seeing CPU usage in ESXi 6.5 for the VM at around 2.5Ghz with some spikes. The VM has 2 vCPU’s and 2GB RAM. Memory usage is around 500MB.

I’m just wondering if this is normal CPU usage for Check_MK or am I seeing some issues here? - I should mention that I am not using Hardware/Software inventory. I have set Check_MK Discovery to once every 12 hours.

If I check htop I see /omd/sites/Apollo/bin/python as the top CPU usage.

If I stop the site, CPU Usage for the VM drops to 200Mhz in the vSphere Client.

Many thanks


#2

Sounds unusual. Do you have many active checks like DNS, HTTP, SSH, SQL Queries, etc.?


#3

I agree with jat. Our similar installation with 843 services is consuming 200 MHz - in the nights about half.


(Chris Gregson) #4

Thanks for the replies guys - I only have 6 active HTTP checks currently. These are just to check SSL certificate age. Other than that - I have a mix of Linux and Windows VM’s along with various SNMP checks. - Nothing fancy.

I’m checking Windows Service status - if that has an impact on host CPU?

Finally I’m doing a datasource check for an ESXi host.

Checking htop this is the top process that is consuming so much CPU usage

17670 apollo 20 0 123M 27244 8356 S 33.3 1.3 0:00.66 /omd/sites/apollo/bin/python /omd/sites/apollo/var/check_mk/precompiled


(Chris Gregson) #5

I’ve tried disabling the HTTP active checks and the CPU usage on the host is the same. I’ve also removed any plugins and that has made no difference either.

Any other ideas guys? - This seems weird considering this is a freshly built VM with the latest versionof check_mk RAW.


#6

So what is actually running on that VM now? Try to identify the processes consuming CPU. I’d doubt that the Python binary alone consumes a considerable part so one/some plugins/checks/scripts are probably doing that.


(Chris Gregson) #7

If I stop the site my CPU usage drops to 90Mhz. So it’s definitely related to check_mk

If I do a ps auwwwx I can see the command with the highest usage is 0:00 /omd/sites/apollo/bin/python /omd/sites/apollo/var/check_mk/precompiled/narkissos.apollo.local

I see the same command for each of my hosts - then the usage drops and then spikes back up to 40% usage again - this happens every few seconds.

CPU utilization:-

OK - user: 20.6%, system: 13.5%, wait: 0.0%, steal: 0.0%, guest: 0.0%, total: 34.2% - This spikes to 40% total repeatedly.


#8

Is the system kind of unresponsive (during these spikes) or is it just a “visual” thing?


(Philipp Näther) #9

Not so unusual for me. As you already figured out, some checks consume more CPU utilization than others. SNMP can be a problem sometimes, the hw/sw inventory is a huge factor and some other checks too. We have 110 hosts with 2600 services (direct SNMP checks by far not as much as TCP checks) on a physical machine with an crippling old Xeon E5410 with 4 Cores @ 2,33 GHz. Here my utilization over 1 year:

111

As you can see the system is running on well over 20% total with some heavy peaks aswell. As long as your system doesn’t get unresponsive, as Wolfgang mentioned, I wouldn’t bother. Just give it some more cores, that’s it.


(Chris Gregson) #10

I’d say it’s mostly a ‘visual thing’ - The Web UI can seem a little unresponsive - especially if I’m making edits to a large amount of services. But - I’d mostly put that down to I/O - since it’s shared on an NFS datastore.


(Chris Gregson) #11

Your CPU utilisation is pretty damn close to what I’m seeing on average (between 35-40% with spikes) both in vSphere and from the RRD graphs.

I’ve given the VM a couple more cores and resigned it as just being a given.

I suppose the first couple of responses to this post threw me off a bit - so I have no idea how they are seeing such lower utilisation given the number of hosts/services and/or plugins present.

P.S - I’m not using hardware/software inventory. I was initially, but found the spikes a little worse.


#12

Well, so far each one only mentioned the number of objects but the check interval is a relevant factor, too. Checking the objects every minute creates another load than checking them every five minutes. Interface traffic, load are more important than disk space so it might be an option to change the check interval for some objects.


#13

I have a couple of VMs with Check_MK monitoring my IT infraestructure; one with 1.2.8 and other with 1.5.0p7 (upgraded from an 1.4.0p37) and the new version is pretty bad about performance levels, the Load average in the old ones was between 1 or 2 but in the new one with 3 cores is moving between 10 and 20.

I hope in future versions the developers can solve this issue.