efdb25efc7
Improve the performance of the crc32() asm routines by getting rid of most of the branches and small sized loads on the common path. Instead, use a branchless code path involving overlapping 16 byte loads to process the first (length % 32) bytes, and process the remainder using a loop that processes 32 bytes at a time. Tested using the following test program: #include <stdlib.h> extern void crc32_le(unsigned short, char const*, int); int main(void) { static const char buf[4096]; srand(20181126); for (int i = 0; i < 100 * 1000 * 1000; i++) crc32_le(0, buf, rand() % 1024); return 0; } On Cortex-A53 and Cortex-A57, the performance regresses but only very slightly. On Cortex-A72 however, the performance improves from $ time ./crc32 real 0m10.149s user 0m10.149s sys 0m0.000s to $ time ./crc32 real 0m7.915s user 0m7.915s sys 0m0.000s Cc: Rui Sun <sunrui26@huawei.com> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Will Deacon <will.deacon@arm.com> |
||
---|---|---|
.. | ||
atomic_ll_sc.c | ||
clear_page.S | ||
clear_user.S | ||
copy_from_user.S | ||
copy_in_user.S | ||
copy_page.S | ||
copy_template.S | ||
copy_to_user.S | ||
crc32.S | ||
delay.c | ||
Makefile | ||
memchr.S | ||
memcmp.S | ||
memcpy.S | ||
memmove.S | ||
memset.S | ||
strchr.S | ||
strcmp.S | ||
strlen.S | ||
strncmp.S | ||
strnlen.S | ||
strrchr.S | ||
tishift.S | ||
uaccess_flushcache.c |