Lets start step-by-step
On Wed, 2019-04-03 at 16:56 +0200, Petr Špaček wrote:
Hello,
as you already found out it is complicated ;-)
Linux kernel has its own magic algorithms to schedule work on
multi-core/multi-socket/NUMA machines and it DNS benchmarking also very
much depends on network card, its drivers etc.
If we were going to fine-tune your setup we would have to go into details:
What is your CPU architecture? Number of sockets, CPU in them etc.?
It's russian Elbrus CPU, i have a little info about it architecture. 4
socket motherboard with 8-core CPU at each socket.
elbrus01 ~/src/dnsperf # uname -a
Linux elbrus01 4.9.0-2.2-e8c #1 SMP Mon Nov 12 10:52:48 GMT 2018 e2k E8C
E8C-SWTX GNU/Linux
elbrus01 ~/src/dnsperf # cat /etc/mcst_version
4.0-rc2
elbrus01 ~/src/dnsperf #
How is operating memory connected to CPUs?
Is it NUMA?
I think it is NUMA. I can see some memory scew across numa nodes
elbrus01 ~/src/dnsperf # numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23 32 33 34 35 36 37
38 39 48 49 50 51 52 53 54 55
cpubind: 0 1 2 3
nodebind: 0 1 2 3
membind: 0 1 2 3
elbrus01 ~/src/dnsperf # numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 64433 MB
node 0 free: 63261 MB
node 1 cpus: 16 17 18 19 20 21 22 23
node 1 size: 64467 MB
node 1 free: 62363 MB
node 2 cpus: 32 33 34 35 36 37 38 39
node 2 size: 64467 MB
node 2 free: 63768 MB
node 3 cpus: 48 49 50 51 52 53 54 55
node 3 size: 64467 MB
node 3 free: 63811 MB
node distances:
node 0 1 2 3
0: 10 20 20 20
1: 20 10 20 20
2: 20 20 10 20
3: 20 20 20 10
elbrus01 ~/src/dnsperf #
Do you have irqbalance enabled?
I try to use irqbalance on previous OperatinSystem based on 3.11 Linux
kernel. New one OS based on 4.9 Linux kernel has no irqbalance binary at
all.
Have you somehow configured IRQ affinity?
What is your network card (how many IO queues it has)?
Network card is Intel 540 10Gbe PCI card. PIC lines are connected to
CPU0 socket. NSD prefer all the IRQ bounded to CPU0, memory allocated at
numa-node0 AND workers bound to CPU0/CPU1. KNOT prefer IRQ spreaded
across all 4 CPUs and workers using all CPUs.
Did you configure network card queues and other driver
settings explicitly?
etc.
I spread incoming UDP across all 32 RX-queues
elbrus01 ~/src/dnsperf # ethtool -N eth4 rx-flow-hash udp4 sdfn
And affine each net4 IRQ to its personal CPU core
elbrus01 ~/src/dnsperf # echo 00800000,00000000
/proc/irq/120/smp_affinity
elbrus01
~/src/dnsperf # echo 00400000,00000000
/proc/irq/119/smp_affinity
elbrus01
~/src/dnsperf # echo 00200000,00000000
/proc/irq/118/smp_affinity
elbrus01
~/src/dnsperf # echo 00100000,00000000
/proc/irq/117/smp_affinity
elbrus01
~/src/dnsperf # echo 00080000,00000000
/proc/irq/116/smp_affinity
elbrus01
~/src/dnsperf # echo 00040000,00000000
/proc/irq/115/smp_affinity
elbrus01
~/src/dnsperf # echo 00020000,00000000
/proc/irq/114/smp_affinity
elbrus01
~/src/dnsperf # echo 00010000,00000000
/proc/irq/113/smp_affinity
elbrus01
~/src/dnsperf # echo 00000080,00000000
/proc/irq/112/smp_affinity
elbrus01
~/src/dnsperf # echo 00000040,00000000
/proc/irq/111/smp_affinity
elbrus01
~/src/dnsperf # echo 00000020,00000000
/proc/irq/110/smp_affinity
elbrus01
~/src/dnsperf # echo 00000010,00000000
/proc/irq/109/smp_affinity
elbrus01
~/src/dnsperf # echo 00000008,00000000
/proc/irq/108/smp_affinity
elbrus01
~/src/dnsperf # echo 00000004,00000000
/proc/irq/107/smp_affinity
elbrus01
~/src/dnsperf # echo 00000002,00000000
/proc/irq/106/smp_affinity
elbrus01
~/src/dnsperf # echo 00000001,00000000
/proc/irq/105/smp_affinity
elbrus01
~/src/dnsperf # echo 00800000 > /proc/irq/104/smp_affinity
elbrus01 ~/src/dnsperf # echo 00400000 > /proc/irq/103/smp_affinity
elbrus01 ~/src/dnsperf # echo 00200000 > /proc/irq/102/smp_affinity
elbrus01 ~/src/dnsperf # echo 00100000 > /proc/irq/101/smp_affinity
elbrus01 ~/src/dnsperf # echo 00080000 > /proc/irq/100/smp_affinity
elbrus01 ~/src/dnsperf # echo 00040000 > /proc/irq/99/smp_affinity
elbrus01 ~/src/dnsperf # echo 00020000 > /proc/irq/98/smp_affinity
elbrus01 ~/src/dnsperf # echo 00010000 > /proc/irq/97/smp_affinity
elbrus01 ~/src/dnsperf # echo 00000080 > /proc/irq/96/smp_affinity
elbrus01 ~/src/dnsperf # echo 00000040 > /proc/irq/95/smp_affinity
elbrus01 ~/src/dnsperf # echo 00000020 > /proc/irq/94/smp_affinity
elbrus01 ~/src/dnsperf # echo 00000010 > /proc/irq/93/smp_affinity
elbrus01 ~/src/dnsperf # echo 00000008 > /proc/irq/92/smp_affinity
elbrus01 ~/src/dnsperf # echo 00000004 > /proc/irq/91/smp_affinity
elbrus01 ~/src/dnsperf # echo 00000002 > /proc/irq/90/smp_affinity
elbrus01 ~/src/dnsperf # echo 00000001 > /proc/irq/89/smp_affinity
A /proc/interrupt output is attached
The load are starts with
[nikor@kaa5 dnsperf]$ ./dnsperf -s 10.0.0.4 -d out -n 20 -c 103 -T72 -t
500 -S 1 -q 1000 -D
Different runs shows random numer of unused cores from 4 to 1. The
request-per-seconds changes accordingly. Less unused cores get more
performance.
Fine tunning always has to take into account your specific environment
and it is hard to provide general advice.
If you find specific reproducible problem please report it to our Gitlab:
https://gitlab.labs.nic.cz/knot/knot-dns/issues/
Please understand that amount of time and hardware we can allocate for
free support is limited. In case you require fine-tunning for your
specific deployment please consider byuing professional support:
https://www.knot-dns.cz/support/
It is reproducible case but in very specific environment. The overall
result may be good enought for production usage. At this case
professional support will be good option.
Myself interested in such strange behavior. Hope this case may be
usefull for you to. I will resend this thread to OS developers. May be
they can clear this issue.
Thank you for you attention.
Thank you for understanding.
Petr Špaček @ CZ.NIC
On 03. 04. 19 16:37, Sergey Petrov wrote:
I reverse the client and the server. So the
server now is 36-cores intel
box (72 HT-core)
Starting with small loads i see knot use lower cores except core-0.
When adding more load i see cores 0-17 AND 37-54 are used but not to
100% level. At maximum load i see all cores are about 100% used.
It seems to me as system scheduler feature. First it starts with lower
number cores, then add cores from second CPU socket and after all
HT-cores.
On Wed, 2019-04-03 at 12:53 +0300, Sergey Petrov wrote:
> On Wed, 2019-04-03 at 10:52 +0200, Petr Špaček wrote:
>> On 03. 04. 19 10:45, Sergey Petrov wrote:
>>> I perfoms benchmarks with knot-dns as a authoritative server and dnsperf
>>> as a workload client. Knot server has 32 cores. Interrupts from 10Gb
>>> network card are spreaded across all 32 cores. Knot configured with
>>> 64 udp-workers. Each knot thread assigned to one core. So there are at
>>> least two knot threads assigned to one core. Then i start dnsperf with
>>> command
>>>
>>> ./dnsperf -s 10.0.0.4 -d out -n 20 -c 103 -T 64 -t 500 -S 1 -q 1000 -D
>>>
>>> htop on knot server shows 3-4 cores completly unused. Then i restart
>>> dnsperf unused cores are changes.
>>>
>>> That is the reason for unused core?
>>
>> Well, sometimes dnsperf is too slow :-)
>>
>> I recommend to check this:
>> - Make sure dnsperf ("source machine") is not 100 % utilized.
>> - Try to increase number of sockets used by dnsperf, i.e. -c parameter.
>> I would try also values like 500 and 1000 to see if it makes any
>> difference. It might change results significantly because Linux kernel
>> is using hashes over some packet fields and low number of sockets might
>> result in uneven query distribution.
>>
>> Please let us know what are your new results.
>>
>
> The source machne is about 15% utilized.
>
> ./dnsperf -s 10.0.0.4 -d out -n 20 -c 512 -T 512 -t 500 -S 1 -q 1000 -D
>
> get us some performance penalty (260000 rps VS 310000 rps) and more even
> distribution across all cores with 100% usages of all eight cores on
> last CPU socket. While other CPU socket cores are aproximately 60%
> loaded.
>
> Using "-c 1000 -T 1000" parameters of dnsperf i see practicaly the same
> core load distribution and even more performance penalty.
>
> Using "-c 16 -T 16" parameters i see 14 0% utilized cores, 16 100%
> utilized cores and 2 50% utilized cores with about 300000 rps
>
> The question is that prevents knot thread on 0% used core to serve a
> packet arrived with IRQ bounded to another core? May be you have some
> developer guide can answer this question?
>