We're experiencing occasional failures with Knot crashing while running as a slave.
The behavior is as follows: the slave will run for 2 months or so and then segfault. Our
system automatically restarts the process, but after 15 minutes or less, the segfault
happens again. This repeats until we remove the /var/lib/knot/journal and
/var/lib/knot/timers directories. This seems to fix it up for a while: a newly started
process will run fine for another couple of months.
More details on our setup: These systems serve a little less than a hundred zones, some of
which change at a rapid rate. We have configured the servers to not flush the zone data to
regular files. The server software is 2.5.7, but with the changes from the
"ecs-patch" branch applied.
A while back, I tried a release from the newer branch (I'm pretty sure it was 2.6.4),
but I had a problem there where some servers were falling behind the master, as evidenced
by their SOA serial number. Diagnosing this on a more recent branch probably makes more
sense, but I'd be a little leery of dealing with two problems, not just one.
I can provide various data: the (gigantic) seemingly "corrupt" journal/timer
files and the segfault messages from the syslog. I don't have any coredumps, but
I'll turn those on today. Given the nature of the problem, it might take a while for
it to manifest.
Chuck