Hi,
Roland and I ran into a crashing condition for knotd 2.6.[689],
presumably caused by a race condition in the threaded use of PKCS #11
sessions. We use a commercial, replicated, networked HSM and not SoftHSM2.
WORKAROUND:
We do have a work-around with "conf-set server.background-workers 1" so
this is not a blocking condition for us, but handling our ~1700 zones
concurrency would be add back later.
PROBLEM DESCRIPTION:
Without this work-around, we see crashes quite reliably, on a load that
does a number of zone-set/-unset commands, fired by sequentialised knotc
processes to a knotd that continues to fire zone signing concurrently.
The commands are generated with the knot-aware option -k from ldns-zonediff,
https://github.com/SURFnet/ldns-zonediff
ANALYSIS:
Our HSM reports errors that look like a session handle is reused and
then repeatedly logged into, but not always, so it looks like a race
condition on a session variable,
27.08.2018 11:48:59 | [00006AE9:00006AEE] C_Login
| E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
27.08.2018 11:48:59 | [00006AE9:00006AEE] C_GetAttributeValue
| E: Error CKR_USER_NOT_LOGGED_IN occurred.
27.08.2018 11:48:59 | [00006AE9:00006AED] C_Login
| E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
27.08.2018 11:48:59 | [00006AE9:00006AED] C_GetAttributeValue
| E: Error CKR_USER_NOT_LOGGED_IN occurred.
27.08.2018 11:49:01 | [00006AE9:00006AED] C_Login
| E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
27.08.2018 11:49:01 | [00006AE9:00006AED] C_Login
| E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
27.08.2018 11:49:01 | [00006AE9:00006AED] C_GetAttributeValue
| E: Error CKR_USER_NOT_LOGGED_IN occurred.
27.08.2018 11:49:02 | [00006AE9:00006AEE] C_Login
| E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
27.08.2018 11:49:03 | [00006AE9:00006AEE] C_Login
| E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
27.08.2018 11:55:50 | [0000744C:0000744E] C_Login
| E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
These errors stopped being reported with the work-around configured.
Until that time, we have crashes, of which the following dumps one:
Thread 4 "knotd" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffcd1bd700 (LWP 27375)]
0x00007ffff6967428 in __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:54
54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 0x00007ffff6967428 in __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:54
#1 0x00007ffff696902a in __GI_abort () at abort.c:89
#2 0x00007ffff69a97ea in __libc_message (do_abort=do_abort@entry=2,
fmt=fmt@entry=0x7ffff6ac2ed8 "*** Error in `%s': %s: 0x%s ***\n") at
../sysdeps/posix/libc_fatal.c:175
#3 0x00007ffff69b237a in malloc_printerr (ar_ptr=,
ptr=,
str=0x7ffff6ac2fe8 "double free or corruption (out)", action=3) at
malloc.c:5006
#4 _int_free (av=, p=, have_lock=0) at
malloc.c:3867
#5 0x00007ffff69b653c in __GI___libc_free (mem=) at
malloc.c:2968
#6 0x0000555555597ed3 in ?? ()
#7 0x00005555555987c2 in ?? ()
#8 0x000055555559ba01 in ?? ()
#9 0x00007ffff7120338 in ?? () from /usr/lib/x86_64-linux-gnu/liburcu.so.4
#10 0x00007ffff6d036ba in start_thread (arg=0x7fffcd1bd700) at
pthread_create.c:333
#11 0x00007ffff6a3941d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:109
DEBUGGING HINTS:
Our suspicion is that you may not have set the mutex callbacks when
invoking C_Initialize() on PKCS #11, possibly due to the intermediate
layers of abstraction hiding this from view. This happens more often.
Then again, the double free might pose another hint.
This is on our soon-to-go-live platform, so I'm afraid it'll be very
difficult to do much more testing, I hope this suffices for your debugging!
I hope this helps Knot DNS to move forward!
-Rick
Hello,
I have an issue with a zone where KNOT is slave server. I am not able to
transfer a zone: refresh, failed (no usable master). BIND is able to
transfer this zone and with host command AXFR works as well. There are
more domains on this master and the others are working. The thing is
that I can see in Wireshark that the AXFR is started, zone transfer
starts and for some reason KNOT after the 1st ACK to AXFR response
terminates the TCP connection with RST resulting in AXFR fail. AXFR
response is spread over several TCP segments.
I can provide traces privately.
KNOT 2.6.7-1+0~20180710153240.24+stretch~1.gbpfa6f52
Thanks for help.
BR
Ales Rygl
Dear all,
I use knot 2.7.1 with automatic DNSSEC signing and key management.
For some zones I have used "cds-cdnskey-publish: none".
As .CH/.LI is about to support CDS/CDNSKEY (rfc8078, rfc7344) I thought
I should enable to publish the CDS/CDNSKEY RR for all my zones. However,
the zones which are already secure (trust anchor in parent zone) do not
publish the CDS/CDNSKEY record when the setting is changes to
"cds-cdnskey-publish: always".
I have not been able to reproduce this error on new zones or new zones
signed and secured with a trust anchor in the parent zone for which I
then change the cds-cdnskey-publish setting from "none" to "always".
This indicates that there seems to be some state error for my existing
zones only.
I tried but w/o success:
knotc zone-sign <zone>
knotc -f zone-purge +journal <zone>
; publish a inactive KSK
keymgr <zone> generate ... ; knotc zone-sign <zone>
Completely removing the zone (and all keys) and restarting fixes the
problem obviously. However, I cannot do this for all my zones as I would
have to remove the DS record in the parent zone prior to this...
Any idea?
Daniel
Hi all,
I would like to kindly ask you to check the Debian repository state? It
looks like it is a bit outdated... The latest version available is
2.6.7-1+0~20180710153240.24+stretch~1.gbpfa6f52 while 2.7.0 has been
already released.
Thanks
BR
Ales Rygl
Hey,
We're scripting around Knot, and for that we pipe sequences of commands
to knotc. We're running into a few wishes for improved rigour that look
like they are generic:
1. WAITING FOR TRANSACTION LOCKS
This would make our scripts more reliably, especially when we need to do
manual operations on the command line as well. There should be no hurry
for detecting lock freeing operations immediately, so retries with
exponential backoff would be quite alright for us.
Deadlocks are an issue when these are nested, so this would at best be
an option to knotc, but many applications call for a single level and
these could benefit from the added sureness of holding the lock.
2. FAILING ON PARTIAL OPERATIONS
When we script a *-begin, act1, act2, *-commit, and pipe it into knotc
it is not possible to see intermediate results. This could be solved
when any failures (including for non-locking *-begin) would *-abort and
return a suitable exit code. Only success in *-commit would exit(0) and
that would allow us to detect overall success.
We've considered making a wrapper around knotc, but that might actually
reduce its quality and stability, so instead we now propose these features.
Just let me know if you'd like to see the above as a patch (and a repo
to use for it).
Cheers,
-Rick
Hello,
I am seeing segfault crashes from knot + libknot7 version 2.6.8-1~ubuntu
for amd64, during a zone commit cycle. The transaction is empty by the
way, but in general we use a utility to compare Ist to Soll.
This came up while editing a zone that hasn't been configured yet, so we
are obviously doing something strange. (The reason is I'm trying to
switch DNSSEC on/off in a manner orthogonal to the zone data transport,
which is quite clearly not what Knot DNS was designed for. I will post
a feature request that could really help with orthogonality.)
I'll attach two flows, occurring virtually at the same time on our two
machines while doing the same thing locally; so the bug looks
reproducable. If you need more information, I'll try to see what I can do.
Cheers,
-Rick
Jul 24 14:22:59 menezes knotd[17733]: info: [example.com.] control,
received command 'zone-commit'
Jul 24 14:22:59 menezes kernel: [1800163.196199] knotd[17733]: segfault
at 0 ip 00007f375a659410 sp 00007ffde37d46d8 error 4 in
libknot.so.7.0.0[7f375a64b000+2d000]
Jul 24 14:22:59 menezes systemd[1]: knot.service: Main process exited,
code=killed, status=11/SEGV
Jul 24 14:22:59 menezes systemd[1]: knot.service: Unit entered failed state.
Jul 24 14:22:59 menezes systemd[1]: knot.service: Failed with result
'signal'.
Jul 24 14:22:59 vanstone knotd[6473]: info: [example.com.] control,
received command 'zone-commit'
Jul 24 14:22:59 vanstone kernel: [3451862.795573] knotd[6473]: segfault
at 0 ip 00007ffb6e817410 sp 00007ffd2b6e1d58 error 4 in
libknot.so.7.0.0[7ffb6e809000+2d000]
Jul 24 14:22:59 vanstone systemd[1]: knot.service: Main process exited,
code=killed, status=11/SEGV
Jul 24 14:22:59 vanstone systemd[1]: knot.service: Unit entered failed
state.
Jul 24 14:22:59 vanstone systemd[1]: knot.service: Failed with result
'signal'.
Hi,
after updating from 2.6.8 to 2.7.0 none of my zones gets loaded:
failed to load persistent timers (invalid parameter)
error: [nord-west.org.] zone cannot be created
How can I fix this?
Kind Regards
Bjoern