linux/arch
Linus Torvalds c228d294f2 x86: explicitly align IO accesses in memcpy_{to,from}io
In commit 170d13ca3a ("x86: re-introduce non-generic memcpy_{to,from}io")
I made our copy from IO space use a separate copy routine rather than
rely on the generic memcpy.  I did that because our generic memory copy
isn't actually well-defined when it comes to internal access ordering or
alignment, and will in fact depend on various CPUID flags.

In particular, the default memcpy() for a modern Intel CPU will
generally be just a "rep movsb", which works reasonably well for
medium-sized memory copies of regular RAM, since the CPU will turn it
into fairly optimized microcode.

However, for non-cached memory and IO, "rep movs" ends up being
horrendously slow and will just do the architectural "one byte at a
time" accesses implied by the movsb.

At the other end of the spectrum, if you _don't_ end up using the "rep
movsb" code, you'd likely fall back to the software copy, which does
overlapping accesses for the tail, and may copy things backwards.
Again, for regular memory that's fine, for IO memory not so much.

The thinking was that clearly nobody really cared (because things
worked), but some people had seen horrible performance due to the byte
accesses, so let's just revert back to our long ago version that dod
"rep movsl" for the bulk of the copy, and then fixed up the potentially
last few bytes of the tail with "movsw/b".

Interestingly (and perhaps not entirely surprisingly), while that was
our original memory copy implementation, and had been used before for
IO, in the meantime many new users of memcpy_*io() had come about.  And
while the access patterns for the memory copy weren't well-defined (so
arguably _any_ access pattern should work), in practice the "rep movsb"
case had been very common for the last several years.

In particular Jarkko Sakkinen reported that the memcpy_*io() change
resuled in weird errors from his Geminilake NUC TPM module.

And it turns out that the TPM TCG accesses according to spec require
that the accesses be

 (a) done strictly sequentially

 (b) be naturally aligned

otherwise the TPM chip will abort the PCI transaction.

And, in fact, the tpm_crb.c driver did this:

	memcpy_fromio(buf, priv->rsp, 6);
	...
	memcpy_fromio(&buf[6], &priv->rsp[6], expected - 6);

which really should never have worked in the first place, but back
before commit 170d13ca3a it *happened* to work, because the
memcpy_fromio() would be expanded to a regular memcpy, and

 (a) gcc would expand the first memcpy in-line, and turn it into a
     4-byte and a 2-byte read, and they happened to be in the right
     order, and the alignment was right.

 (b) gcc would call "memcpy()" for the second one, and the machines that
     had this TPM chip also apparently ended up always having ERMS
     ("Enhanced REP MOVSB/STOSB instructions"), so we'd use the "rep
     movbs" for that copy.

In other words, basically by pure luck, the code happened to use the
right access sizes in the (two different!) memcpy() implementations to
make it all work.

But after commit 170d13ca3a, both of the memcpy_fromio() calls
resulted in a call to the routine with the consistent memory accesses,
and in both cases it started out transferring with 4-byte accesses.
Which worked for the first copy, but resulted in the second copy doing a
32-bit read at an address that was only 2-byte aligned.

Jarkko is actually fixing the fragile code in the TPM driver, but since
this is an excellent example of why we absolutely must not use a generic
memcpy for IO accesses, _and_ an IO-specific one really should strive to
align the IO accesses, let's do exactly that.

Side note: Jarkko also noted that the driver had been used on ARM
platforms, and had worked.  That was because on 32-bit ARM, memcpy_*io()
ends up always doing byte accesses, and on 64-bit ARM it first does byte
accesses to align to 8-byte boundaries, and then does 8-byte accesses
for the bulk.

So ARM actually worked by design, and the x86 case worked by pure luck.

We *might* want to make x86-64 do the 8-byte case too.  That should be a
pretty straightforward extension, but let's do one thing at a time.  And
generally MMIO accesses aren't really all that performance-critical, as
shown by the fact that for a long time we just did them a byte at a
time, and very few people ever noticed.

Reported-and-tested-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Tested-by: Jerry Snitselaar <jsnitsel@redhat.com>
Cc: David Laight <David.Laight@aculab.com>
Fixes: 170d13ca3a ("x86: re-introduce non-generic memcpy_{to,from}io")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01 09:07:48 -08:00
..
alpha Kbuild late updates for v4.21 2019-01-06 16:33:10 -08:00
arc ARCv2: lib: memeset: fix doing prefetchw outside of buffer 2019-01-17 16:24:39 -08:00
arm pci-v5.0-fixes-3 2019-01-31 23:06:17 -08:00
arm64 arm64/xen: fix xen-swiotlb cache flushing 2019-01-23 22:14:56 +01:00
c6x arch: restore generic-y += shmparam.h for some architectures 2019-01-06 18:16:11 -08:00
csky csky: fixup compile error with CPU 810. 2019-01-10 04:37:37 -08:00
h8300 kbuild: remove unused archmrproper 2019-01-16 23:31:17 +09:00
hexagon arch: restore generic-y += shmparam.h for some architectures 2019-01-06 18:16:11 -08:00
ia64 ia64: remove redundant 'export AWK' 2019-01-16 23:31:18 +09:00
m68k arch: restore generic-y += shmparam.h for some architectures 2019-01-06 18:16:11 -08:00
microblaze arch: restore generic-y += shmparam.h for some architectures 2019-01-06 18:16:11 -08:00
mips A few MIPS fixes for 5.0: 2019-01-20 10:33:18 +12:00
nds32 nds32: remove unneeded code in arch/nds32/Makefile 2019-01-17 23:42:37 +09:00
nios2 arch: remove redundant UAPI generic-y defines 2019-01-06 10:22:15 +09:00
openrisc openrisc: remove unneeded code in arch/openrisc/Makefile 2019-01-17 23:42:59 +09:00
parisc arch: remove redundant UAPI generic-y defines 2019-01-06 10:22:15 +09:00
powerpc A single build fix for powerpc due to device_node.type removal 2019-01-20 10:28:46 +12:00
riscv Fix a handful of audit-related issue 2019-01-07 08:45:47 -08:00
s390 s390/smp: Fix calling smp_call_ipl_cpu() from ipl CPU 2019-01-11 17:12:03 +01:00
sh Kbuild late updates for v4.21 2019-01-06 16:33:10 -08:00
sparc arch: remove redundant UAPI generic-y defines 2019-01-06 10:22:15 +09:00
um Merge branch 'akpm' (patches from Andrew) 2019-01-05 09:16:18 -08:00
unicore32 arch: restore generic-y += shmparam.h for some architectures 2019-01-06 18:16:11 -08:00
x86 x86: explicitly align IO accesses in memcpy_{to,from}io 2019-02-01 09:07:48 -08:00
xtensa arch: remove redundant UAPI generic-y defines 2019-01-06 10:22:15 +09:00
.gitignore
Kconfig jump_label: move 'asm goto' support test to Kconfig 2019-01-06 09:46:51 +09:00