mirror of
https://github.com/torvalds/linux.git
synced 2024-11-28 15:11:31 +00:00
15c2d45d17
I noticed ksm spending quite a lot of time in memcmp on a large KVM box. The current memcmp loop is very unoptimised - byte at a time compares with no loop unrolling. We can do much much better. Optimise the loop in a few ways: - Unroll the byte at a time loop - For large (at least 32 byte) comparisons that are also 8 byte aligned, use an unrolled modulo scheduled loop using 8 byte loads. This is similar to our glibc memcmp. A simple microbenchmark testing 10000000 iterations of an 8192 byte memcmp was used to measure the performance: baseline: 29.93 s modified: 1.70 s Just over 17x faster. v2: Incorporated some suggestions from Segher: - Use andi. instead of rdlicl. - Convert bdnzt eq, to bdnz. It's just duplicating the earlier compare and was a relic from a previous version. - Don't use cr5, we have plans to use that CR field for fast local atomics. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> |
||
---|---|---|
.. | ||
alloc.c | ||
checksum_32.S | ||
checksum_64.S | ||
checksum_wrappers_64.c | ||
code-patching.c | ||
copy_32.S | ||
copypage_64.S | ||
copypage_power7.S | ||
copyuser_64.S | ||
copyuser_power7.S | ||
crtsavres.S | ||
div64.S | ||
feature-fixups-test.S | ||
feature-fixups.c | ||
hweight_64.S | ||
ldstfp.S | ||
locks.c | ||
Makefile | ||
mem_64.S | ||
memcmp_64.S | ||
memcpy_64.S | ||
memcpy_power7.S | ||
ppc_ksyms.c | ||
rheap.c | ||
sstep.c | ||
string_64.S | ||
string.S | ||
usercopy_64.c | ||
vmx-helper.c | ||
xor_vmx.c |