Hello Chuck,
Please, could you send me (in private) your configuration and
the latest logs including the segfault message?
How much RAM does does the server have?
What is the size of the journal database?
What do you mean by "some servers were falling behind the master,
as evidenced by their SOA serial number.". Do you have some logs?
FYI, the upcoming Knot DNS 2.7.0 implements ECS functionality.
Thanks,
Daniel
On 2018-07-03 18:46, Chuck Musser wrote:
We're experiencing occasional failures with Knot
crashing while
running as a slave. The behavior is as follows: the slave will run for
2 months or so and then segfault. Our system automatically restarts
the process, but after 15 minutes or less, the segfault happens again.
This repeats until we remove the /var/lib/knot/journal and
/var/lib/knot/timers directories. This seems to fix it up for a while:
a newly started process will run fine for another couple of months.
More details on our setup: These systems serve a little less than a
hundred zones, some of which change at a rapid rate. We have
configured the servers to not flush the zone data to regular files.
The server software is 2.5.7, but with the changes from the
"ecs-patch" branch applied.
A while back, I tried a release from the newer branch (I'm pretty sure
it was 2.6.4), but I had a problem there where some servers were
falling behind the master, as evidenced by their SOA serial number.
Diagnosing this on a more recent branch probably makes more sense, but
I'd be a little leery of dealing with two problems, not just one.
I can provide various data: the (gigantic) seemingly "corrupt"
journal/timer files and the segfault messages from the syslog. I don't
have any coredumps, but I'll turn those on today. Given the nature of
the problem, it might take a while for it to manifest.
Chuck