On 11 March 2013 17:43, Anand Buddhdev <anandb(a)ripe.net> wrote:
  On 11/03/2013 15:27, Marek Vavruša wrote:
 Hi Marek,
  Well, you probably hit the weak spot of current
implementation.
 We regularly test bootstrap speed of about 5k small zones and it
 finishes in about 1 minute or so, 
 Oh, interesting. How many parallel XFRs does Knot try? Our master is
 configured to allow a maximum of 500 XFRs in parallel, but of course
 there are other clients as well, so Knot would get a share of that. And
 then the master will refuse additional connections. 
There is no finite upper bound, at any time there can only be 3
transfers processed but
others may be pending and waiting for data for example. When the
transfer is pending for a long
time without data, it get's discarded (I think it's about 5 minutes
between packets).
The congestion is "solved" really primitively using jittered timers,
but that may or may not work
and gives no guarantee, that's why I wan't to rework it.
   but the
problem is that this is done over a 1GbE. The thing is we do
 not handle congestion very efficiently when the there are a large
 number of larger zones or the line is slower. 
 In our case, we have a mixture of zones. Some are small, while others
 are quite large. Additionally, not all the zones can be loaded. For many
 zones, the master replies with SERVFAIL, because the upstream master of
 that zone has not provided a zone transfer, and the zone has gone stale
 on our intermediate distribution master.
 The connection between our Knot instance and the master is a 1 GbE
 connection, but as I explained, the master cannot cope with a thundering
 herd of incoming AXFR requests from Knot. 
 
I see, that would be the case.
  As of current implementation, bootstrap requests
are scheduled with
 jittered timer and some stepping,
 but over non-ideal lines it may happen that the transfer rate is
 slower, packets may be lost, connections may be interrupted and so on.
 We are working on a new implementation with a fixed queue, that would
 handle this situation efficiently (it will be self-throttling) but it
 probably won't get into 1.2.0.
 For what it's worth, the problem is most evident on a bootstrap, when
 you have most of the zones and
 reasonable refresh timers, it will get up to speed again. Sorry for that. 
 Okay fair enough. We don't expect to have to bootstrap a server too
 often, but when we do have to, it's not ideal to have to wait so long
 for it to be ready, so better queuing of the AXFR requests would be good.
 Any idea which release you expect to put this code in?
 Regards,
 Anand 
 
I'll try my best to put it into 1.3.
Kind regards,
Marek