Perl scripts - random exit code 128 / Timeout exceeded

Hello,

I have an icinga2 server which was running in ubuntu 16.04 without any worries. I did the update to 18.04 few weeks ago and since then, I have a problem with perl scripts which often go out in timeout

Icinga2 version: 2.11.1
icingaweb2 uptodate
director uptodate

My problem: I have several times per minute and on perl check_snmp (check_snmp_storage / int / process …) which very often do timeouts. These seem random

An example of a debug command:

/usr/lib/nagios/plugins/check_snmp_load.pl ‘’ -C ‘’ xxxxxxx ‘’ -H ‘’ 1.2.3.4 ‘’ -T ‘’ netsl ‘’ -c '‘20, 19,18’ '-f ‘’ -t ‘’ 60 ‘’ -w ‘’ 10, 9.8 ') terminated with exit code 128, output: <Terminated by signal 9 (Killed).>

The observation: when I run this script manually, I have the result in less than a second. If I ‘spam’ with this command, about once in 50, the command hangs, and after 60 seconds, I have my timeout

It’s extremely random. It happens on differents hosts

I tried to disable some icinga2 features without success (API in particular, which I need though)
I tried to change snmp version from 2c to 1 without success
I tried to increase the interval between checks without success
I tried a lot of things…

It’s been 2 weeks and I have no more idea.

It seems to happen only with scripts executed in perl, all the other services are OK

I have read and reread the logs of my upgrade from 16.04 to 18.04 to see what would have been modified or deleted, but I don’t find anything convincing.

Any ideas ?

Thank’s

Hello
Did you try to run the script in perl debug mode ? also with strace ?

The perl version between 16 and 18 changes but is still the same major version (5.22 to 5.26) so that should be compatible.

When you run the script manually, which user are you doing it as , root or Icinga?
exit code 128 says that it gets an invalid argument on the exit command, can you check that the way you pass + parse arguments to the command has not been altered in some way ?
example - you are missing a closed " after the ’ -c’ values

Regards

Thank’s for your reply Aflatto :slight_smile:

I run my script with user nagios. It seems like with user root, the issue is just the same.

The missing " is just an error of my copy/paste from the debug log file of icinga2.

Perl is kind of an unknown territory for me, but I’ll try to run it in debug mode

Hi,

Error indicates some timeout in snmp-bulk-request. You could try replacing lines 349 to 351 with:

my $resultat =  $session->get_table(Baseoid => $linload_table) or die $session->error();

to see whats actually causing the error.

Thank’s Raphael for your answer.

Can you, please, tell me which file you are talking about?

Thank’s :slight_smile:

well the check-script u are using. assuming http://nagios.manubulon.com/check_snmp_load.pl

Happened to me not long ago, that checking for definement of result-hash would not be sufficient.

Thank’s

My issue is for many perl scripts. This one but also check_snmp_process.pl, check_snmp_int.pl, etc …

But I’ll try your ‘patch’ and let you know :slight_smile:

Thank’s again

Well its not a patch, its error-handling, i just ran into this issue querying big tables. In my case increasing the message_size of the bulk-request was the catch.

But since you are using icinga, maybe would be far better to just let the hosts connect to your master-instance and issue checks locally on the clients than querying data via snmp.