[knot-dns-users] Fixing broken AXFR zone transfer with knot slave and dnsmasq as master

25 Zář 2021

Hi knot-dns-users,
I have knotd as a slave for a dnsmasq and I'm seeing fallback from IXFR to
AXFR not working properly. TLDR: I have patches and while I'm waiting for
my gitlab access to be allowed to submit MRs I wrote this email :)
What happens is that the first AXFR goes through and everything looks happy
dandy, but if I restart dnsmasq to force a SOA update knot tries to do an
IXFR which fails and doesn't fall back to AXFR.
First the AXFR works:
    knotd: info: [myzone.example.net.] AXFR, incoming, remote 2001:db8::2@53, started
    knotd: [myzone.example.net.] AXFR, incoming, remote 2001:db8::2@53, finished, 0.00
seconds, 1 messages, 870 bytes
    knotd: info: [myzone.example.net.] refresh, remote 2001:db8::2@53, zone updated, 0.00
seconds, serial none -> 1632501860
when I restart dnsmasq to force a SOA update, the following IXFR fails:
    knotd: info: [myzone.example.net.] refresh, remote 2001:db8::2@53, remote serial
1632503553, zone is outdated
    knotd: warning: [myzone.example.net.] IXFR, incoming, remote 2001:db8::2@53, malformed
response SOA
    knotd: warning: [myzone.example.net.] refresh, remote myremote-dnsmasq not usable
    knotd: error: [myzone.example.net.] refresh, failed (no usable master)
I've been sitting on this bug for a while now since I couldn't make heads
or tails of what the code is doing but my third try at debugging this
finally worked out.
If we look at the fallback logic in src/knot/events/handlers/refresh.c, in
try_refresh(), it revolves around this kind of gnarly while statement:
        // while loop runs 0x or 1x; IXFR to AXFR failover
        while (ret = knot_requestor_exec(&requestor, req, timeout),
               ret = (data.ret == KNOT_EOK ? ret : data.ret),
               ixfr_error_failover(ret) && data.xfr_type == XFR_TYPE_IXFR
&&
               data.state != STATE_SOA_QUERY) {
If the while loop is taken fallback to AXFR happens otherwise not.
Attaching gdb to my running knot I can see that in my failure case all
conditionals except `data.xfr_type == _IXFR` are true but xfr_type is in
fact not _IXFR but _ERROR. As far as I can tell the only reason for this
check is to terminate the loop once the fallback to AXFR has happened as
`_AXFR != _IXFR`. I shall elaborate on this below.
So where does xfr_type become _ERROR? In my case determine_xfr_type()
returns _ERROR due to the first check, `answer->count < 1` and ixfr_consume
assigns this to xfr_type which in turn happens as part of the
knot_requestor_exec() call in the while.
So how does the fallback usually happen then? Well ixfr_consume checks the
RCODE and fails immediately, without modifying xfr_type that is, if it's no
good. That way the while statement will run the fallback code.
    // Check RCODE
    if (knot_pkt_ext_rcode(pkt) != KNOT_RCODE_NOERROR) {
        [...]
        return KNOT_STATE_FAIL;
    }
I'm not really familliar with the DNS RFCs so I'm not sure if dnsmasq is
supposed to repond with an error to unsupported query types but from some
quick experiments against prominent authorative DNS servers around the
internet it does seem like it's behaviour, reponding with NOERROR, an empty
answer section and a SOA in authority is what everyone else does too when
they get an rrtype they don't care for.
Why now would the fallback logic not want to fallback to AXFR in this case?
I can't think of a reason it would really want to do that.  Like I said
above it seems to me the only reason for the _IXFR check in the while is to
make sure the loop terminates. Though IMO this is a very roundabout way of
doing that and just using an if would be cleaner to boot.
Anyway, let's do some git archeology on that point. The first related
commit I can find is
    commit 39e447e4590c7d5fc12a5c05ed11a45fd6561dda
    Author: Libor Peltan &lt;libcha.p(a)gmail.com&gt;
    Date:   Wed Oct 17 10:08:53 2018 +0200
        ixfr/axfr: failover even when the ixfr reply wasn't valid
        before: ixfr falls back to axfr only if valid, but negative reply, e.g. NOTAUTH,
journal inconsistency..
        after: ixfr falls back to axfr in any case ixfr fails, e.g. malformed reply packet
This is the commit that introduced the while statement for the fallback in
try_refresh. It looked like so back then:
    // while loop runs 0x or 1x; IXFR to AXFR failover
    while ((ret = knot_requestor_exec(&requestor, req, timeout)) != KNOT_EOK
&&
           data.xfr_type == XFR_TYPE_IXFR) {
Much simpler, if knot_requestor_exec returns a failure and xfr_type is
_IXFR the fallback to AXFR happens. Now what I don't understand is that
from the commit message this fix sounds just like my issue. Dnsmasq sends
back what knot considers an invalid reply so why didn't this fix my problem
too? Weird.
Looking further I find
    commit 5c1a5e28797f9c2eab1300247c052451ccd3be3e
    Author: Libor Peltan &lt;libor.peltan(a)nic.cz&gt;
    Date:   Tue Jan 31 18:07:28 2017 +0100
        XFR: proper handling of single-SOA first response message
which introduced determine_xfr_type and the XFR_TYPE_ERROR distinction. If
I build the commit just before that one
    a8237b1f8 xfr: separated functions consuming one rrset of [ai]xfr response
and setup a config to reproduce my setup, then XFR against dnsmasq actually
starts to work. Back at 5c1a5e287 things are broken with the "malformed
SOA" message again. So we likely found the culprit.
Ok so back at the bad commit 5c1a5e287
      // IXFR to AXFR failover
    - if (data->is_ixfr && next == KNOT_STATE_FAIL) {
    + if (data->xfr_type == XFR_TYPE_IXFR && next == KNOT_STATE_FAIL) {
This is what introduced the `xfr_type == _IXFR` check which later got moved
from this if statement in transfer_consume() to the while loop in
try_refresh() by 39e447e45. The check used to be just for is_ixfr which got
initialized in refresh_begin to start with (in the good commit 5c1a5e287~1):
    if (data->soa) {
        data->state = STATE_SOA_QUERY;
        data->is_ixfr = true;
    } else {
in bad commit 5c1a5e287:
    if (data->soa) {
        data->state = STATE_SOA_QUERY;
        data->xfr_type = XFR_TYPE_IXFR;
        data->initial_soa_copy = NULL;
    } else {
and then got reset to false by the fallback code.
So I'm pretty convinced at this point that 5c1a5e287 just messed up that
logic and the specificity of the XFR_TYPE_IXFR check really is unintended.
To fix this we simply replace that check with a counter that ensures the
loop will run at most twice like so:

--- a/src/knot/events/handlers/refresh.c
+++ b/src/knot/events/handlers/refresh.c
@@ -1188,12 +1188,12 @@ static int try_refresh(conf_t *conf, zone_t *zone, const
conf_remote_t *master,
        int timeout = conf->cache.srv_tcp_remote_io_timeout;
-       int ret;
+       int ret, i;
        // while loop runs 0x or 1x; IXFR to AXFR failover
        while (ret = knot_requestor_exec(&requestor, req, timeout),
               ret = (data.ret == KNOT_EOK ? ret : data.ret),
-              ixfr_error_failover(ret) && data.xfr_type == XFR_TYPE_IXFR
&&
+              ixfr_error_failover(ret) && i++ <= 1 &&
               data.state != STATE_SOA_QUERY) {
                REFRESH_LOG(LOG_WARNING, data.zone->name, data.remote,
                            "fallback to AXFR (%s)", knot_strerror(ret));
---
Thoughts? Am I missing something this check is doing?
--Daniel
    

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

[knot-dns-users] Fixing broken AXFR zone transfer with knot slave and dnsmasq as master