Hi,
Roland and I ran into a crashing condition for knotd 2.6.[689],
presumably caused by a race condition in the threaded use of PKCS #11
sessions. We use a commercial, replicated, networked HSM and not SoftHSM2.
WORKAROUND:
We do have a work-around with "conf-set server.background-workers 1" so
this is not a blocking condition for us, but handling our ~1700 zones
concurrency would be add back later.
PROBLEM DESCRIPTION:
Without this work-around, we see crashes quite reliably, on a load that
does a number of zone-set/-unset commands, fired by sequentialised knotc
processes to a knotd that continues to fire zone signing concurrently.
The commands are generated with the knot-aware option -k from ldns-zonediff,
https://github.com/SURFnet/ldns-zonediff
ANALYSIS:
Our HSM reports errors that look like a session handle is reused and
then repeatedly logged into, but not always, so it looks like a race
condition on a session variable,
27.08.2018 11:48:59 | [00006AE9:00006AEE] C_Login
| E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
27.08.2018 11:48:59 | [00006AE9:00006AEE] C_GetAttributeValue
| E: Error CKR_USER_NOT_LOGGED_IN occurred.
27.08.2018 11:48:59 | [00006AE9:00006AED] C_Login
| E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
27.08.2018 11:48:59 | [00006AE9:00006AED] C_GetAttributeValue
| E: Error CKR_USER_NOT_LOGGED_IN occurred.
27.08.2018 11:49:01 | [00006AE9:00006AED] C_Login
| E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
27.08.2018 11:49:01 | [00006AE9:00006AED] C_Login
| E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
27.08.2018 11:49:01 | [00006AE9:00006AED] C_GetAttributeValue
| E: Error CKR_USER_NOT_LOGGED_IN occurred.
27.08.2018 11:49:02 | [00006AE9:00006AEE] C_Login
| E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
27.08.2018 11:49:03 | [00006AE9:00006AEE] C_Login
| E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
27.08.2018 11:55:50 | [0000744C:0000744E] C_Login
| E: Error CKR_USER_ALREADY_LOGGED_IN occurred.
These errors stopped being reported with the work-around configured.
Until that time, we have crashes, of which the following dumps one:
Thread 4 "knotd" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffcd1bd700 (LWP 27375)]
0x00007ffff6967428 in __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:54
54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0 0x00007ffff6967428 in __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:54
#1 0x00007ffff696902a in __GI_abort () at abort.c:89
#2 0x00007ffff69a97ea in __libc_message (do_abort=do_abort@entry=2,
fmt=fmt@entry=0x7ffff6ac2ed8 "*** Error in `%s': %s: 0x%s ***\n") at
../sysdeps/posix/libc_fatal.c:175
#3 0x00007ffff69b237a in malloc_printerr (ar_ptr=,
ptr=,
str=0x7ffff6ac2fe8 "double free or corruption (out)", action=3) at
malloc.c:5006
#4 _int_free (av=, p=, have_lock=0) at
malloc.c:3867
#5 0x00007ffff69b653c in __GI___libc_free (mem=) at
malloc.c:2968
#6 0x0000555555597ed3 in ?? ()
#7 0x00005555555987c2 in ?? ()
#8 0x000055555559ba01 in ?? ()
#9 0x00007ffff7120338 in ?? () from /usr/lib/x86_64-linux-gnu/liburcu.so.4
#10 0x00007ffff6d036ba in start_thread (arg=0x7fffcd1bd700) at
pthread_create.c:333
#11 0x00007ffff6a3941d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:109
DEBUGGING HINTS:
Our suspicion is that you may not have set the mutex callbacks when
invoking C_Initialize() on PKCS #11, possibly due to the intermediate
layers of abstraction hiding this from view. This happens more often.
Then again, the double free might pose another hint.
This is on our soon-to-go-live platform, so I'm afraid it'll be very
difficult to do much more testing, I hope this suffices for your debugging!
I hope this helps Knot DNS to move forward!
-Rick