Re: [knot-dns-users] Fixing broken AXFR zone transfer with knot slave and dnsmasq as master

27 Zář 2021

Hi Daniel,
thanks much for your report and for thorough insight of the unexpected
answer processing of Knot.
While IXFR RFC is pretty poor on details, the handling of various answer
possibilities is pretty complicated, which relates to the complex and
changing code.
Despite you will get contributing permissions in our repo probably soon,
I have already employed your patch in
https://gitlab.nic.cz/knot/knot-dns/-/commits/ixfr_notimp .
May you please re-verify once again that the fix is working in your
environment?
Thank you,
Libor
Dne 25. 09. 21 v 22:01 Daniel Gröber napsal(a):
...
  Hi knot-dns-users,
 I have knotd as a slave for a dnsmasq and I'm seeing fallback from IXFR to
 AXFR not working properly. TLDR: I have patches and while I'm waiting for
 my gitlab access to be allowed to submit MRs I wrote this email :)
 What happens is that the first AXFR goes through and everything looks happy
 dandy, but if I restart dnsmasq to force a SOA update knot tries to do an
 IXFR which fails and doesn't fall back to AXFR.
 First the AXFR works:
      knotd: info: [myzone.example.net.] AXFR, incoming, remote 2001:db8::2@53, started
      knotd: [myzone.example.net.] AXFR, incoming, remote 2001:db8::2@53, finished, 0.00
seconds, 1 messages, 870 bytes
      knotd: info: [myzone.example.net.] refresh, remote 2001:db8::2@53, zone updated,
0.00 seconds, serial none -> 1632501860
 when I restart dnsmasq to force a SOA update, the following IXFR fails:
      knotd: info: [myzone.example.net.] refresh, remote 2001:db8::2@53, remote serial
1632503553, zone is outdated
      knotd: warning: [myzone.example.net.] IXFR, incoming, remote 2001:db8::2@53,
malformed response SOA
      knotd: warning: [myzone.example.net.] refresh, remote myremote-dnsmasq not usable
      knotd: error: [myzone.example.net.] refresh, failed (no usable master)
 I've been sitting on this bug for a while now since I couldn't make heads
 or tails of what the code is doing but my third try at debugging this
 finally worked out.
 If we look at the fallback logic in src/knot/events/handlers/refresh.c, in
 try_refresh(), it revolves around this kind of gnarly while statement:
        // while loop runs 0x or 1x; IXFR to AXFR failover
        while (ret = knot_requestor_exec(&requestor, req, timeout),
               ret = (data.ret == KNOT_EOK ? ret : data.ret),
               ixfr_error_failover(ret) && data.xfr_type == XFR_TYPE_IXFR
&&
               data.state != STATE_SOA_QUERY) {
 If the while loop is taken fallback to AXFR happens otherwise not.
 Attaching gdb to my running knot I can see that in my failure case all
 conditionals except `data.xfr_type == _IXFR` are true but xfr_type is in
 fact not _IXFR but _ERROR. As far as I can tell the only reason for this
 check is to terminate the loop once the fallback to AXFR has happened as
 `_AXFR != _IXFR`. I shall elaborate on this below.
 So where does xfr_type become _ERROR? In my case determine_xfr_type()
 returns _ERROR due to the first check, `answer->count < 1` and ixfr_consume
 assigns this to xfr_type which in turn happens as part of the
 knot_requestor_exec() call in the while.
 So how does the fallback usually happen then? Well ixfr_consume checks the
 RCODE and fails immediately, without modifying xfr_type that is, if it's no
 good. That way the while statement will run the fallback code.
      // Check RCODE
      if (knot_pkt_ext_rcode(pkt) != KNOT_RCODE_NOERROR) {
          [...]
          return KNOT_STATE_FAIL;
      }
 I'm not really familliar with the DNS RFCs so I'm not sure if dnsmasq is
 supposed to repond with an error to unsupported query types but from some
 quick experiments against prominent authorative DNS servers around the
 internet it does seem like it's behaviour, reponding with NOERROR, an empty
 answer section and a SOA in authority is what everyone else does too when
 they get an rrtype they don't care for.
 Why now would the fallback logic not want to fallback to AXFR in this case?
 I can't think of a reason it would really want to do that.  Like I said
 above it seems to me the only reason for the _IXFR check in the while is to
 make sure the loop terminates. Though IMO this is a very roundabout way of
 doing that and just using an if would be cleaner to boot.
 Anyway, let's do some git archeology on that point. The first related
 commit I can find is
      commit 39e447e4590c7d5fc12a5c05ed11a45fd6561dda
      Author: Libor Peltan &lt;libcha.p(a)gmail.com&gt;
      Date:   Wed Oct 17 10:08:53 2018 +0200
          ixfr/axfr: failover even when the ixfr reply wasn't valid
          before: ixfr falls back to axfr only if valid, but negative reply, e.g. NOTAUTH,
journal inconsistency..
          after: ixfr falls back to axfr in any case ixfr fails, e.g. malformed reply
packet
 This is the commit that introduced the while statement for the fallback in
 try_refresh. It looked like so back then:
      // while loop runs 0x or 1x; IXFR to AXFR failover
      while ((ret = knot_requestor_exec(&requestor, req, timeout)) != KNOT_EOK
&&
             data.xfr_type == XFR_TYPE_IXFR) {
 Much simpler, if knot_requestor_exec returns a failure and xfr_type is
 _IXFR the fallback to AXFR happens. Now what I don't understand is that
 from the commit message this fix sounds just like my issue. Dnsmasq sends
 back what knot considers an invalid reply so why didn't this fix my problem
 too? Weird.
 Looking further I find
      commit 5c1a5e28797f9c2eab1300247c052451ccd3be3e
      Author: Libor Peltan &lt;libor.peltan(a)nic.cz&gt;
      Date:   Tue Jan 31 18:07:28 2017 +0100
          XFR: proper handling of single-SOA first response message
 which introduced determine_xfr_type and the XFR_TYPE_ERROR distinction. If
 I build the commit just before that one
      a8237b1f8 xfr: separated functions consuming one rrset of [ai]xfr response
 and setup a config to reproduce my setup, then XFR against dnsmasq actually
 starts to work. Back at 5c1a5e287 things are broken with the "malformed
 SOA" message again. So we likely found the culprit.
 Ok so back at the bad commit 5c1a5e287
        // IXFR to AXFR failover
      - if (data->is_ixfr && next == KNOT_STATE_FAIL) {
      + if (data->xfr_type == XFR_TYPE_IXFR && next == KNOT_STATE_FAIL) {
 This is what introduced the `xfr_type == _IXFR` check which later got moved
 from this if statement in transfer_consume() to the while loop in
 try_refresh() by 39e447e45. The check used to be just for is_ixfr which got
 initialized in refresh_begin to start with (in the good commit 5c1a5e287~1):
      if (data->soa) {
          data->state = STATE_SOA_QUERY;
          data->is_ixfr = true;
      } else {
 in bad commit 5c1a5e287:
      if (data->soa) {
          data->state = STATE_SOA_QUERY;
          data->xfr_type = XFR_TYPE_IXFR;
          data->initial_soa_copy = NULL;
      } else {
 and then got reset to false by the fallback code.
 So I'm pretty convinced at this point that 5c1a5e287 just messed up that
 logic and the specificity of the XFR_TYPE_IXFR check really is unintended.
 To fix this we simply replace that check with a counter that ensures the
 loop will run at most twice like so:
 --- a/src/knot/events/handlers/refresh.c
 +++ b/src/knot/events/handlers/refresh.c
 @@ -1188,12 +1188,12 @@ static int try_refresh(conf_t *conf, zone_t *zone, const
conf_remote_t *master,
          int timeout = conf->cache.srv_tcp_remote_io_timeout;
 -       int ret;
 +       int ret, i;
          // while loop runs 0x or 1x; IXFR to AXFR failover
          while (ret = knot_requestor_exec(&requestor, req, timeout),
                 ret = (data.ret == KNOT_EOK ? ret : data.ret),
 -              ixfr_error_failover(ret) && data.xfr_type == XFR_TYPE_IXFR
&&
 +              ixfr_error_failover(ret) && i++ <= 1 &&
                 data.state != STATE_SOA_QUERY) {
                  REFRESH_LOG(LOG_WARNING, data.zone->name, data.remote,
                              "fallback to AXFR (%s)", knot_strerror(ret));
 ---
 Thoughts? Am I missing something this check is doing?
 --Daniel

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: [knot-dns-users] Fixing broken AXFR zone transfer with knot slave and dnsmasq as master