linux/arch/x86
Matt Fleming bb5570ad3b x86/asm/64: Align start of __clear_user() loop to 16-bytes
x86 CPUs can suffer severe performance drops if a tight loop, such as
the ones in __clear_user(), straddles a 16-byte instruction fetch
window, or worse, a 64-byte cacheline. This issues was discovered in the
SUSE kernel with the following commit,

  1153933703 ("x86/asm/64: Micro-optimize __clear_user() - Use immediate constants")

which increased the code object size from 10 bytes to 15 bytes and
caused the 8-byte copy loop in __clear_user() to be split across a
64-byte cacheline.

Aligning the start of the loop to 16-bytes makes this fit neatly inside
a single instruction fetch window again and restores the performance of
__clear_user() which is used heavily when reading from /dev/zero.

Here are some numbers from running libmicro's read_z* and pread_z*
microbenchmarks which read from /dev/zero:

  Zen 1 (Naples)

  libmicro-file
                                        5.7.0-rc6              5.7.0-rc6              5.7.0-rc6
                                                    revert-1153933703d9+               align16+
  Time mean95-pread_z100k       9.9195 (   0.00%)      5.9856 (  39.66%)      5.9938 (  39.58%)
  Time mean95-pread_z10k        1.1378 (   0.00%)      0.7450 (  34.52%)      0.7467 (  34.38%)
  Time mean95-pread_z1k         0.2623 (   0.00%)      0.2251 (  14.18%)      0.2252 (  14.15%)
  Time mean95-pread_zw100k      9.9974 (   0.00%)      6.0648 (  39.34%)      6.0756 (  39.23%)
  Time mean95-read_z100k        9.8940 (   0.00%)      5.9885 (  39.47%)      5.9994 (  39.36%)
  Time mean95-read_z10k         1.1394 (   0.00%)      0.7483 (  34.33%)      0.7482 (  34.33%)

Note that this doesn't affect Haswell or Broadwell microarchitectures
which seem to avoid the alignment issue by executing the loop straight
out of the Loop Stream Detector (verified using perf events).

Fixes: 1153933703 ("x86/asm/64: Micro-optimize __clear_user() - Use immediate constants")
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org> # v4.19+
Link: https://lkml.kernel.org/r/20200618102002.30034-1-matt@codeblueprint.co.uk
2020-06-19 18:32:11 +02:00
..
boot Rebase locking/kcsan to locking/urgent 2020-06-11 20:02:46 +02:00
configs
crypto There are a lot of objtool changes in this cycle, all across the map: 2020-06-01 13:13:00 -07:00
entry The X86 entry, exception and interrupt code rework 2020-06-13 10:05:47 -07:00
events treewide: replace '---help---' in Kconfig files with 'help' 2020-06-14 01:57:21 +09:00
hyperv x86/entry: Convert various hypervisor vectors to IDTENTRY_SYSVEC 2020-06-11 15:15:15 +02:00
ia32 Merge branch 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2020-06-04 14:07:08 -07:00
include x86/cpu: Reinitialize IA32_FEAT_CTL MSR on BSP during wakeup 2020-06-15 14:18:37 +02:00
kernel x86/cpu: Use pinning mask for CR4 bits needing to be 0 2020-06-18 11:41:32 +02:00
kvm Kbuild updates for v5.8 (2nd) 2020-06-13 13:29:16 -07:00
lib x86/asm/64: Align start of __clear_user() loop to 16-bytes 2020-06-19 18:32:11 +02:00
math-emu
mm The X86 entry, exception and interrupt code rework 2020-06-13 10:05:47 -07:00
net bpf, i386: Remove unneeded conversion to bool 2020-05-07 16:29:14 +02:00
oprofile
pci Merge branch 'pci/virtualization' 2020-06-04 12:59:13 -05:00
platform x86/entry: Convert various system vectors 2020-06-11 15:15:14 +02:00
power x86/cpu: Reinitialize IA32_FEAT_CTL MSR on BSP during wakeup 2020-06-15 14:18:37 +02:00
purgatory
ras treewide: replace '---help---' in Kconfig files with 'help' 2020-06-14 01:57:21 +09:00
realmode Rebase locking/kcsan to locking/urgent 2020-06-11 20:02:46 +02:00
tools
um mmap locking API: use coccinelle to convert mmap_sem rwsem call sites 2020-06-09 09:39:14 -07:00
video
xen xen: Move xen_setup_callback_vector() definition to include/xen/hvm.h 2020-06-11 15:15:19 +02:00
.gitignore
Kbuild
Kconfig Kbuild updates for v5.8 (2nd) 2020-06-13 13:29:16 -07:00
Kconfig.assembler x86/delay: Introduce TPAUSE delay 2020-05-07 16:06:20 +02:00
Kconfig.cpu treewide: replace '---help---' in Kconfig files with 'help' 2020-06-14 01:57:21 +09:00
Kconfig.debug treewide: replace '---help---' in Kconfig files with 'help' 2020-06-14 01:57:21 +09:00
Makefile
Makefile_32.cpu
Makefile.um