Hello KNOT folks,
We've found an issue 1.3 with bootstrapping. We're using FreeBSD 9.x, but we
also quickly confirmed it exists on Ubuntu 12.x to confirm it was not
isolated to FreeBSD. We're testing with about 3000 to 4000 zones, so our
environment is not even very large at this point and the bootstrapping
failures are very problematic. There are three causes that we've seen thus
far:
1. If the AXFR TCP connect is interrupted by a signal, the whole AXFR is
aborted and the bootstrap is rescheduled instead of selecting on the socket
to either get the successful connection, or until it times out/fails. This
can result in a flood of connects, with little to no progress in the
bootstrapping.
2. When connected, if a recv() is interrupted by a signal, it isn't retried.
This results in connections being dropped that don't need to be dropped.
3. If a successful connect is made, but the remote end subsequently drops it
(e.g., resets the connection), then the bootstrap fails without being
rescheduled. This was found when slaving from a non-KNOT DNS server that may
have TCP rate limiting enabled, or something of that nature. Either way, the
fact that it is not rescheduled is very undesirable.
I suspect that there are other cases of interrupted system calls not being
handled correctly.
Here is some additional info that may help find the root cause:
- The greater the latency between the master and slave, the worse the
problem is. We tested with a slave 80 ms RTT away and it was very bad.
- The more worker threads you have, the worse the problem is. So even
locally (slave 0 ms away from master) we could reproduce the issue fairly
easily.
Hopefully this can be remedied!
Cheers,
Jonathan