Hi Juha,
thanks much for reaching us with this issue, and for nicely and clearly
describing your problem!
Indeed, in version 3.3, the Knot routines for selecting which primary
server to use and in which order, were slightly refactored and improved.
This is in general a really complicated piece of design, since various
things may happen at each primary...
One specific situation we need to handle is when one of the primaries is
chronically out-of-date -- our secondary must not stick with it and try
with some other primaries to find the most recent one.
And this is exactly in conflict with what you would want, that Knot
would happily accept one up-to-date primary, and not consider the
failure to contact the second one as an error.
I agree that the message "no usable master" might be imprecise in
various situations. After all, this message has not been touched for ages :)
I'd recommend you to re-consider your setup -- perhaps it would be
better to keep the backup primary up and running as well?
You might also check out
https://www.knot-dns.cz/docs/3.3/singlehtml/index.html#master-pin-tolerance
which helps the secondary stick with one primary unless (until) it
appears to be broken. This would help achieve the "default X backup"
scenario.
Thanks!
Libor
Dne 08. 02. 24 v 17:23 Juha Suhonen napsal(a):
Hello fellow Knot users,
We're using Knot on some of our public authoritative servers. We operate a hidden
primary configuration where there are two internal primary servers sending notifies to
publicly accessible servers whenever zones change (ofc internal servers allow zone
transfers & queries from the knot servers).
By design, our hidden internal primary servers operate in a hot/cold setup - one of them
is answering to zone transfers / sending out notifies and the other one is not (dns server
software is not running on the cold server).
This configuration causes knot to log alarming errors:
Feb 1 17:28:45 ns2 knotd[624]: info: [zone.fi.] refresh, remote 10.54.54.1@8054, remote
serial 2021083015, zone is up-to-date, expires in 864000 seconds
Feb 1 17:28:45 ns2 knotd[624]: info: [zone.fi.] refresh, remote hidden-ns-2.internal,
address 10.54.54.2@8054, failed (connection reset)
Feb 1 17:28:45 ns2 knotd[624]: info: [zone.fi.] refresh, remote hidden-ns-2.internal,
address [ipv6_address]@8054, failed (connection reset)
Feb 1 17:28:45 ns2 knotd[624]: warning: [zone.fi.] refresh, remote hidden-ns-2.internal
not usable
Feb 1 17:28:45 ns2 knotd[624]: error: [zone.fi.] refresh, failed (no usable master), next
retry at 2024-02-01T17:58:45+0200
Feb 1 17:28:45 ns2 knotd[624]: error: [zone.fi.] zone event 'refresh' failed (no
usable master)
Specifically, error "(no usable master)" is worrying - knot is able to reach
hidden-ns-1.internal and verify that the zone it has is up-to-date. Also zone updates work
normally and the zones don't seem to expire away (as zones should do if no master is
reachable for an extended period of time).
Looks like this error appeared in version 3.3.0. 3.2.9 did not log similar errors (except
in cases where all primaries really were unreachable). Has there been some design change
in 3.3.0 (ie is this intentional) or could this be a bug? Or could it be related to our
configuration?
Our configuration relies heavily on templates. These should be the important bits from
knot's config, with ipv6 addresses hidden:
acl:
- id: "hidden-ns-1.internal"
address: [ 10.54.54.1, ipv6_address ]
action: notify
- id: "hidden-ns-2.internal"
address: [ 10.54.54.2, ipv6_address ]
action: notify
remote:
- id: "hidden-ns-1.internal"
address: [ 10.54.54.1@8054, ipv6_address@8054 ]
via: [ x.x.x.x, ipv6_address ]
- id: "hidden-ns-2.internal"
address: [ 10.54.54.2@8054, ipv6_address@8054 ]
via: [ x.x.x.x, ipv6_address ]
template:
- id: default
storage: /var/lib/knot/zones
master: hidden-ns-1.internal
master: hidden-ns-2.internal
acl: hidden-ns-1.internal
acl: hidden-ns-2.internal
semantic-checks: false
global-module: mod-rrl/default
zone:
- domain: zone.fi