Ahoj,
I've searched through the archives and read documentation+RFCs to
no avail, so I hope you can help out.
I run the authoritative DNS servers for an associative ISP in
Barcelona (eXO.cat), and we are running them in FreeBSD using Knot
DNS (2.9.5, but .6 and .7's changelogs do not point to a similar
issue being solved).
We have since then gotten a few reports from different parties of
"DNS issues", I am reasonably sure to have pinpointed this down to
badly configured DNS resolvers, but our weight is really too tiny
to force any change; and the doubt remains as to whether or not
this is on us. Without access to the resolvers it's also a tad
tricky for us to reproduce.
Things do work beautifully with pretty much all of the internet.
This appears to be related to:
https://tools.ietf.org/html/rfc7873#section-5.2.3
(sometimes in connection with 5.2.4)
Maybe someone with more experience and running at a larger scale
has pointers on this topic.
Specifically, I have observed:
Case A:
1. (Probably Bind) Resolver contacts Knot DNS with Client Cookie
and old Server Cookie
2. Knot DNS responds BADCOOKIE with the provided Client Cookie,
Old Server Cookie, and adds the new server cookie.
3. Resolver contacts Knot DNS with the same Client Cookie and old
Server Cokie
4. 2 and 3 repeat for a long time.
5. Domains end up not resolving.
Case B:
https://github.com/matrix-org/synapse/issues/8581
Their supplier says the DNS server "replies with BADCOOKIE to UDP
queries", as if that were a bad thing; but from reading RFC7873, I
understand that it is expected documented behaviour and the client
ought to try again with the given Server Cookie.
They say: "AIUI the resolver does retry, but again receives a
BADCOOKIE response."
I sadly don't have the tcpdump for those, but it could be just
like case A again.
These are the relevant bits from the config:
server:
rundir: "/var/run/knot"
user: knot:knot
listen: [ 0.0.0.0@53, ::@53 ]
log:
- target: syslog
any: info
statistics:
append: off
database:
storage: "/var/db/knot"
mod-cookies:
- id: default
badcookie-slip: 1
mod-rrl:
- id: default
rate-limit: 200 # Requests per second
template:
- id: default
storage: "/var/db/knot/primary"
semantic-checks: on
disable-any: on
serial-policy: dateserial
file: "%s.zone"
global-module: mod-cookies/default
global-module: mod-rrl/default
global-module: mod-stats
- id: secondary
storage: "/var/db/knot/secondary"
semantic-checks: on
disable-any: on
serial-policy: dateserial
file: "%s.zone"
module: mod-cookies/default
module: mod-rrl/default
module: mod-stats
- id: signed
storage: "/var/db/knot/primary"
dnssec-signing: on
zonefile-load: difference
semantic-checks: on
disable-any: on
serial-policy: dateserial
file: "%s.zone"
module: mod-cookies/default
module: mod-rrl/default
module: mod-stats
We are not using anycast clusters or anything like that.
A quick solution would be to disable cookies or to experiment with
noudp, but that's something we'd like to avoid.
The name servers in question are:
- ns3.exo.cat
- ns4.exo.cat
- ns1.unchat.cat (not association, but identical setup)
All tests I've ran against the servers and the associated domains
(e.g. exo.cat + unchat.cat) appear to be fine.
There is an odd one with
https://dnsviz.net/d/unchat.cat/dnssec/,
but other tests like
https://www.zonemaster.net/result/d530cb253298b56c do not report
any issues.
Thank you in advance for any pointers you may have (and for Knot
DNS!),
--
Evilham