Hello,
this is interesting. Please collect couple coredump files and upload them for inspection,
we need to examine what went wrong.
I will send you instructions for file upload in an separate e-mail.
Thank you for your time!
Petr Špaček @ CZ.NIC
On 16. 12. 19 9:32, Milan Jeskynka Kazatel wrote:
Hello Tomas,
I did on Friday an upgrade to the latest version of Knot-Resolver 4.3 as was suggested.
In the log was recorded a few unfortunate restarts, even if the DNSSEC validation was
disabled and the bogud_log was unloaded. (disabled since 14.12.2019 20:30).
My server installed packages:
Knot Resolver, version 4.3.0
rpm -qa | grep knot
knot-libs-2.9.1-1.el7.x86_64
knot-resolver-4.3.0-1.el7.x86_64
knot-resolver-module-http-4.2.2-2.el7.x86_64
CentOS Linux release 7.7.1908 (Core)
In the time between 19:00-19:08, the WM backup is provided.
Each service restart causes new record in the /var/cache/knot-resolver/tty and the old
one still persists (This is an unfortunate state of things in CentOS 7 right now. We have
a
solution for it in an upcoming 5.0 release. Each instance will have
exactly one deterministic control socket. ).
A log cut:
Dec 13 19:03:00 dnsserver systemd[1]: kresd(a)1.service watchdog timeout (limit 10s)!
Dec 13 19:03:01 dnsserver systemd[1]: kresd(a)1.service: main process exited, code=killed,
status=6/ABRT
Dec 13 19:03:01 dnsserver systemd[1]: Unit kresd(a)1.service entered failed state.
Dec 13 19:03:01 dnsserver systemd[1]: kresd(a)1.service failed.
Dec 13 19:19:25 dnsserver systemd[1]: kresd(a)1.service watchdog timeout (limit 10s)!
Dec 13 19:19:35 dnsserver systemd[1]: kresd(a)1.service stop-sigabrt timed out.
Terminating.
Dec 13 19:19:45 dnsserver systemd[1]: kresd(a)1.service stop-sigterm timed out. Killing.
Dec 13 19:19:47 dnsserver systemd[1]: kresd(a)1.service: main process exited, code=killed,
status=9/KILL
Dec 13 19:19:47 dnsserver systemd[1]: Unit kresd(a)1.service entered failed state.
Dec 13 19:19:47 dnsserver systemd[1]: kresd(a)1.service failed.
Dec 14 19:01:23 dnsserver systemd[1]: kresd(a)1.service watchdog timeout (limit 10s)!
Dec 14 19:01:24 dnsserver systemd[1]: kresd(a)1.service: main process exited, code=killed,
status=6/ABRT
Dec 14 19:01:24 dnsserver systemd[1]: Unit kresd(a)1.service entered failed state.
Dec 14 19:01:24 dnsserver systemd[1]: kresd(a)1.service failed.
Dec 14 19:02:19 dnsserver systemd[1]: kresd(a)1.service watchdog timeout (limit 10s)!
Dec 14 19:02:23 dnsserver systemd[1]: kresd(a)1.service: main process exited, code=killed,
status=6/ABRT
Dec 14 19:02:23 dnsserver systemd[1]: Unit kresd(a)1.service entered failed state.
Dec 14 19:02:23 dnsserver systemd[1]: kresd(a)1.service failed.
Dec 15 19:03:58 dnsserver systemd[1]: kresd(a)1.service watchdog timeout (limit 10s)!
Dec 15 19:04:08 dnsserver systemd[1]: kresd(a)1.service stop-sigabrt timed out.
Terminating.
Dec 15 19:04:19 dnsserver systemd[1]: kresd(a)1.service stop-sigterm timed out. Killing.
Dec 15 19:04:25 dnsserver systemd[1]: kresd(a)1.service: main process exited, code=killed,
status=9/KILL
Dec 15 19:04:25 dnsserver systemd[1]: Unit kresd(a)1.service entered failed state.
Dec 15 19:04:25 dnsserver systemd[1]: kresd(a)1.service failed.
--
Smil Milan Jeskyňka Kazatel
---------- Původní e-mail ----------
Od: Tomas Krizek <tomas.krizek(a)nic.cz>
Komu: Knot Resolver Users List <knot-resolver-users(a)lists.nic.cz>cz>, Milan Jeskynka
Kazatel <KazatelM(a)seznam.cz>cz>, petr.spacek(a)nic.cz
Datum: 12. 12. 2019 16:04:59
Předmět: Re: [knot-resolver-users] debugging Knot Resolver crashes on CentOS7
Hi,
first - please try to use more descriptive e-mail subjects. It helps
others to find solutions to same/similar issues in the future.
On 12/12/2019 14.29, Milan Jeskynka Kazatel wrote:> I´m still facing the
service kresd@1 crashes without any obvious reasons.
Today I did a second try to upgrade to Knot
Resover to version 4.2.2 and the
upgrade seems to be ok, service can start without any difficulties.
The latest released version is 4.3.0. Before any further debugging,
please ensure you're using the latest version. EPEL repositories lag
behind the upstream releases, but there's usually an update waiting
shortly after our upstream release. You can install it using:
yum update knot-resolver --enablerepo epel-testing
Alternately, you can use our upstream package repositories to get the
updates right as they're released:
https://www.knot-resolver.cz/download/
It runs
as expected more than 3,5 hour, but unfortunately, it starts to write in the
log the same messages as was reported in my previous post and the service
get restart by itself.
The auto-restart is a systemd feature we're using to recover from
crashes/failures. It's preferable to a dead service.
However, it'd be interesting to find out the cause of these crashes.
Could you explore the errors in journal and post the output?
journalctl -u kresd@1 -p notice --since -2w
Every restarts couse a new sevice PID in
/var/cache/
knot-resolver/tty, the old one was not correctly finished
This is an unfortunate state of things in CentOS 7 right now. We have a
solution for it in an upcoming 5.0 release. Each instance will have
exactly one deterministic control socket.
and the whole
operating system goes to a visible slowdown.
I don't see how knot-resolver crash under systemd would cause any
slowdown. Do you have any evidence of that? Are there any hanging kresd
process in ps, which weren't correctly terminated? What system resources
are they using?
I don´t know how to do an
exact sevice crashdump file, but I can provide any log messages if needed.
If the crashes keep happening after upgrade to 4.3.0 and the journal
messages don't help with debugging, this is how I managed to turn on
coredump collection on CentOS 7:
1. install debugsymbols
$ debuginfo-install knot knot-resolver luajit
2. create /etc/sysctl.d/50-core.conf with the following content:
kernel.core_pattern=|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
3. modify/uncomment the following parameters in /etc/systemd/system.conf
DumpCore=yes
DefaultLimitCORE=infinity
4. reboot
Please refer to man systemd-coredump for more details.
The next time kresd crashes, there should be a PID in
$ coredump list
which can be used to display some information about the coredump:
$ coredump info $PID
Even the stack trace could helps us track the root of the issue. If you
believe you've found a security issue, please report it via a
*confidential* issue at
https://gitlab.labs.nic.cz/knot/knot-resolver/issues or to
knot-resolver(a)labs.nic.cz (non-public list).
Thanks!
--
Tomas Krizek
PGP: 4A8B A48C 2AED 933B D495 C509 A1FB A5F7 EF8C 4869