__clear_user and copy_page load from the TOC and are also exported
to modules. This means we have to use _GLOBAL_TOC() so that we
create the global entry point that sets up the TOC.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
I blame Mikey for this. He elevated my slightly dubious testcase:
to benchmark status. And naturally we need to be number 1 at creating
zeros. So lets improve __clear_user some more.
As Paul suggests we can use dcbz for large lengths. This patch gets
the destination cacheline aligned then uses dcbz on whole cachelines.
Before:
10485760000 bytes (10 GB) copied, 0.414744 s, 25.3 GB/s
After:
10485760000 bytes (10 GB) copied, 0.268597 s, 39.0 GB/s
39 GB/s, a new record.
Signed-off-by: Anton Blanchard <anton@samba.org>
Tested-by: Olof Johansson <olof@lixom.net>
Acked-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
I noticed __clear_user high up in a profile of one of my RAID stress
tests. The testcase was doing a dd from /dev/zero which ends up
calling __clear_user.
__clear_user is basically a loop with a single 4 byte store which
is horribly slow. We can do much better by aligning the desination
and doing 32 bytes of 8 byte stores in a loop.
The following testcase was used to verify the patch:
http://ozlabs.org/~anton/junkcode/stress_clear_user.c
To show the improvement in performance I ran a dd from /dev/zero
to /dev/null on a POWER7 box:
Before:
# dd if=/dev/zero of=/dev/null bs=1M count=10000
10485760000 bytes (10 GB) copied, 3.72379 s, 2.8 GB/s
After:
# time dd if=/dev/zero of=/dev/null bs=1M count=10000
10485760000 bytes (10 GB) copied, 0.728318 s, 14.4 GB/s
Over 5x faster.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>