linux

Author	SHA1	Message	Date
Ard Biesheuvel	660d206219	crypto - shash: reduce minimum alignment of shash_desc structure Unlike many other structure types defined in the crypto API, the 'shash_desc' structure is permitted to live on the stack, which implies its contents may not be accessed by DMA masters. (This is due to the fact that the stack may be located in the vmalloc area, which requires a different virtual-to-physical translation than the one implemented by the DMA subsystem) Our definition of CRYPTO_MINALIGN_ATTR is based on ARCH_KMALLOC_MINALIGN, which may take DMA constraints into account on architectures that support non-cache coherent DMA such as ARM and arm64. In this case, the value is chosen to reflect the largest cacheline size in the system, in order to ensure that explicit cache maintenance as required by non-coherent DMA masters does not affect adjacent, unrelated slab allocations. On arm64, this value is currently set at 128 bytes. This means that applying CRYPTO_MINALIGN_ATTR to struct shash_desc is both unnecessary (as it is never used for DMA), and undesirable, given that it wastes stack space (on arm64, performing the alignment costs 112 bytes in the worst case, and the hole between the 'tfm' and '__ctx' members takes up another 120 bytes, resulting in an increased stack footprint of up to 232 bytes.) So instead, let's switch to the minimum SLAB alignment, which does not take DMA constraints into account. Note that this is a no-op for x86. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-22 14:58:01 +11:00
Daniele Alessandrelli	e2811196fb	crypto: keembay-ocs-hcu - Add dependency on HAS_IOMEM and ARCH_KEEMBAY Add the following additional dependencies for CRYPTO_DEV_KEEMBAY_OCS_HCU: - HAS_IOMEM to prevent build failures - ARCH_KEEMBAY to prevent asking the user about this driver when configuring a kernel without Intel Keem Bay platform support. Signed-off-by: Daniele Alessandrelli <daniele.alessandrelli@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:43:10 +11:00
Dan Carpenter	2aa3da2d34	crypto: keembay-ocs-hcu - Fix a WARN() message The first argument to WARN() is a condition and the messages is the second argument is the string, so this WARN() will only display the __func__ part of the message. Fixes: `ae832e329a` ("crypto: keembay-ocs-hcu - Add HMAC support") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Daniele Alessandrelli <daniele.alessandrelli@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:30 +11:00
Ard Biesheuvel	a04ea6f7ff	crypto: x86 - use local headers for x86 specific shared declarations The Camellia, Serpent and Twofish related header files only contain declarations that are shared between different implementations of the respective algorithms residing under arch/x86/crypto, and none of their contents should be used elsewhere. So move the header files into the same location, and use local #includes instead. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:30 +11:00
Ard Biesheuvel	64ca771cd6	crypto: x86 - remove glue helper module All dependencies on the x86 glue helper module have been replaced by local instantiations of the new ECB/CBC preprocessor helper macros, so the glue helper module can be retired. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:29 +11:00
Ard Biesheuvel	165f357334	crypto: x86/twofish - drop dependency on glue helper Replace the glue helper dependency with implementations of ECB and CBC based on the new CPP macros, which avoid the need for indirect calls. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:29 +11:00
Ard Biesheuvel	ea55cfc3f9	crypto: x86/cast6 - drop dependency on glue helper Replace the glue helper dependency with implementations of ECB and CBC based on the new CPP macros, which avoid the need for indirect calls. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:29 +11:00
Ard Biesheuvel	674d40abac	crypto: x86/cast5 - drop dependency on glue helper Replace the glue helper dependency with implementations of ECB and CBC based on the new CPP macros, which avoid the need for indirect calls. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:29 +11:00
Ard Biesheuvel	9ad58b46f8	crypto: x86/serpent - drop dependency on glue helper Replace the glue helper dependency with implementations of ECB and CBC based on the new CPP macros, which avoid the need for indirect calls. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:29 +11:00
Ard Biesheuvel	407d409a81	crypto: x86/camellia - drop dependency on glue helper Replace the glue helper dependency with implementations of ECB and CBC based on the new CPP macros, which avoid the need for indirect calls. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:29 +11:00
Ard Biesheuvel	827ee47228	crypto: x86 - add some helper macros for ECB and CBC modes The x86 glue helper module is starting to show its age: - It relies heavily on function pointers to invoke asm helper functions that operate on fixed input sizes that are relatively small. This means the performance is severely impacted by retpolines. - It goes to great lengths to amortize the cost of kernel_fpu_begin()/end() over as much work as possible, which is no longer necessary now that FPU save/restore is done lazily, and doing so may cause unbounded scheduling blackouts due to the fact that enabling the FPU in kernel mode disables preemption. - The CBC mode decryption helper makes backward strides through the input, in order to avoid a single block size memcpy() between chunks. Consuming the input in this manner is highly likely to defeat any hardware prefetchers, so it is better to go through the data linearly, and perform the extra memcpy() where needed (which is turned into direct loads and stores by the compiler anyway). Note that benchmarks won't show this effect, given that the memory they use is always cache hot. - It implements blockwise XOR in terms of le128 pointers, which imply an alignment that is not guaranteed by the API, violating the C standard. GCC does not seem to be smart enough to elide the indirect calls when the function pointers are passed as arguments to static inline helper routines modeled after the existing ones. So instead, let's create some CPP macros that encapsulate the core of the ECB and CBC processing, so we can wire them up for existing users of the glue helper module, i.e., Camellia, Serpent, Twofish and CAST6. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:29 +11:00
Ard Biesheuvel	c0a64926c5	crypto: x86/blowfish - drop CTR mode implementation Blowfish in counter mode is never used in the kernel, so there is no point in keeping an accelerated implementation around. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:29 +11:00
Ard Biesheuvel	768db5fee3	crypto: x86/des - drop CTR mode implementation DES or Triple DES in counter mode is never used in the kernel, so there is no point in keeping an accelerated implementation around. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:28 +11:00
Ard Biesheuvel	89b7ba5c8b	crypto: x86/glue-helper - drop CTR helper routines The glue helper's CTR routines are no longer used, so drop them. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:28 +11:00
Ard Biesheuvel	f43dcaf2c9	crypto: x86/twofish - drop CTR mode implementation Twofish in CTR mode is never used by the kernel directly, and is highly unlikely to be relied upon by dm-crypt or algif_skcipher. So let's drop the accelerated CTR mode implementation, and instead, rely on the CTR template and the bare cipher. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:28 +11:00
Ard Biesheuvel	7a6623cc68	crypto: x86/cast6 - drop CTR mode implementation CAST6 in CTR mode is never used by the kernel directly, and is highly unlikely to be relied upon by dm-crypt or algif_skcipher. So let's drop the accelerated CTR mode implementation, and instead, rely on the CTR template and the bare cipher. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:28 +11:00
Ard Biesheuvel	e2d60e2f59	crypto: x86/cast5 - drop CTR mode implementation CAST5 in CTR mode is never used by the kernel directly, and is highly unlikely to be relied upon by dm-crypt or algif_skcipher. So let's drop the accelerated CTR mode implementation, and instead, rely on the CTR template and the bare cipher. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:28 +11:00
Ard Biesheuvel	2e9440ae6e	crypto: x86/serpent - drop CTR mode implementation Serpent in CTR mode is never used by the kernel directly, and is highly unlikely to be relied upon by dm-crypt or algif_skcipher. So let's drop the accelerated CTR mode implementation, and instead, rely on the CTR template and the bare cipher. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:28 +11:00
Ard Biesheuvel	a1f91ecf81	crypto: x86/camellia - drop CTR mode implementation Camellia in CTR mode is never used by the kernel directly, and is highly unlikely to be relied upon by dm-crypt or algif_skcipher. So let's drop the accelerated CTR mode implementation, and instead, rely on the CTR template and the bare cipher. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:28 +11:00
Ard Biesheuvel	31d49c448a	crypto: x86/glue-helper - drop XTS helper routines The glue helper's XTS routines are no longer used, so drop them. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:27 +11:00
Ard Biesheuvel	da4df93a94	crypto: x86/twofish - switch to XTS template Now that the XTS template can wrap accelerated ECB modes, it can be used to implement Twofish in XTS mode as well, which turns out to be at least as fast, and sometimes even faster Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:27 +11:00
Ard Biesheuvel	9ec0af8aa6	crypto: x86/serpent- switch to XTS template Now that the XTS template can wrap accelerated ECB modes, it can be used to implement Serpent in XTS mode as well, which turns out to be at least as fast, and sometimes even faster Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:27 +11:00
Ard Biesheuvel	2cc0fedb81	crypto: x86/cast6 - switch to XTS template Now that the XTS template can wrap accelerated ECB modes, it can be used to implement CAST6 in XTS mode as well, which turns out to be at least as fast, and sometimes even faster Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:27 +11:00
Ard Biesheuvel	55a7e88f01	crypto: x86/camellia - switch to XTS template Now that the XTS template can wrap accelerated ECB modes, it can be used to implement Camellia in XTS mode as well, which turns out to be at least as fast, and sometimes even faster. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:27 +11:00
Bhaskar Chowdhury	4d6a5a4b1e	crypto: marvell/cesa - Fix a spelling s/fautly/faultly/ in comment s/fautly/faulty/p Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:27 +11:00
Kai Ye	34932a6033	crypto: hisilicon/sec - register SEC device to uacce Register SEC device to uacce framework for user space. Signed-off-by: Kai Ye <yekai13@huawei.com> Reviewed-by: Zhou Wang <wangzhou1@hisilicon.com> Reviewed-by: Zaibo Xu <xuzaibo@huawei.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:27 +11:00
Kai Ye	bedd04e4aa	crypto: hisilicon/hpre - register HPRE device to uacce Register HPRE device to uacce framework for user space. Signed-off-by: Kai Ye <yekai13@huawei.com> Reviewed-by: Zhou Wang <wangzhou1@hisilicon.com> Reviewed-by: Zaibo Xu <xuzaibo@huawei.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:26 +11:00
Kai Ye	f8408d2b79	crypto: hisilicon - add ZIP device using mode parameter Add 'uacce_mode' parameter for ZIP, which can be set as 0(default) or 1. '0' means ZIP is only registered to kernel crypto, and '1' means it's registered to both kernel crypto and UACCE. Signed-off-by: Kai Ye <yekai13@huawei.com> Reviewed-by: Zhou Wang <wangzhou1@hisilicon.com> Reviewed-by: Zaibo Xu <xuzaibo@huawei.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:26 +11:00
Kai Ye	0d61c3f144	crypto: hisilicon/qm - SVA bugfixed on Kunpeng920 Kunpeng920 SEC/HPRE/ZIP cannot support running user space SVA and kernel Crypto at the same time. Therefore, the algorithms should not be registered to Crypto as user space SVA is enabled. Signed-off-by: Kai Ye <yekai13@huawei.com> Reviewed-by: Zaibo Xu <xuzaibo@huawei.com> Reviewed-by: Zhou Wang <wangzhou1@hisilicon.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:26 +11:00
Jiri Olsa	f7f2b43eaf	crypto: bcm - Rename struct device_private to bcm_device_private Renaming 'struct device_private' to 'struct bcm_device_private', because it clashes with 'struct device_private' from 'drivers/base/base.h'. While it's not a functional problem, it's causing two distinct type hierarchies in BTF data. It also breaks build with options: CONFIG_DEBUG_INFO_BTF=y CONFIG_CRYPTO_DEV_BCM_SPU=y as reported by Qais Yousef [1]. [1] https://lore.kernel.org/lkml/20201229151352.6hzmjvu3qh6p2qgg@e107158-lin/ Fixes: `9d12ba86f8` ("crypto: brcm - Add Broadcom SPU driver") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:26 +11:00
Adam Guerin	e48767c177	crypto: qat - reduce size of mapped region Restrict size of field to what is required by the operation. This issue was detected by smatch: drivers/crypto/qat/qat_common/qat_asym_algs.c:328 qat_dh_compute_value() error: dma_map_single_attrs() '&qat_req->in.dh.in.b' too small (8 vs 64) Signed-off-by: Adam Guerin <adam.guerin@intel.com> Reviewed-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com> Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:26 +11:00
Adam Guerin	80fccf18fe	crypto: qat - change format string and cast ring size Cast ADF_SIZE_TO_RING_SIZE_IN_BYTES() so it can return a 64 bit value. This issue was detected by smatch: drivers/crypto/qat/qat_common/adf_transport_debug.c:65 adf_ring_show() warn: should '(1 << (ring->ring_size - 1)) << 7' be a 64 bit type? Signed-off-by: Adam Guerin <adam.guerin@intel.com> Reviewed-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com> Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:26 +11:00
Adam Guerin	1aaae055d4	crypto: qat - fix potential spectre issue Sanitize ring_num value coming from configuration (and potentially from user space) before it is used as index in the banks array. This issue was detected by smatch: drivers/crypto/qat/qat_common/adf_transport.c:233 adf_create_ring() warn: potential spectre issue 'bank->rings' [r] (local cap) Signed-off-by: Adam Guerin <adam.guerin@intel.com> Reviewed-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com> Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:26 +11:00
Wojciech Ziemba	0db0d797ab	crypto: qat - configure arbiter mapping based on engines enabled The hardware specific function adf_get_arbiter_mapping() modifies the static array thrd_to_arb_map to disable mappings for AEs that are disabled. This static array is used for each device of the same type. If the ae mask is not identical for all devices of the same type then the arbiter mapping returned by adf_get_arbiter_mapping() may be wrong. This patch fixes this problem by ensuring the static arbiter mapping is unchanged and the device arbiter mapping is re-calculated each time based on the static mapping. Signed-off-by: Wojciech Ziemba <wojciech.ziemba@intel.com> Reviewed-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com> Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:26 +11:00
Ard Biesheuvel	d6cbf4eaa4	crypto: aesni - replace function pointers with static branches Replace the function pointers in the GCM implementation with static branches, which are based on code patching, which occurs only at module load time. This avoids the severe performance penalty caused by the use of retpolines. In order to retain the ability to switch between different versions of the implementation based on the input size on cores that support AVX and AVX2, use static branches instead of static calls. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:25 +11:00
Ard Biesheuvel	83c83e6588	crypto: aesni - refactor scatterlist processing Currently, the gcm(aes-ni) driver open codes the scatterlist handling that is encapsulated by the skcipher walk API. So let's switch to that instead. Also, move the handling at the end of gcmaes_crypt_by_sg() that is dependent on whether we are encrypting or decrypting into the callers, which always do one or the other. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:25 +11:00
Ard Biesheuvel	2694e23ffd	crypto: aesni - clean up mapping of associated data The gcm(aes-ni) driver is only built for x86_64, which does not make use of highmem. So testing for PageHighMem is pointless and can be omitted. While at it, replace GFP_ATOMIC with the appropriate runtime decided value based on the context. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:25 +11:00
Ard Biesheuvel	30f2c18eb5	crypto: aesni - drop unused asm prototypes Drop some prototypes that are declared but never called. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:25 +11:00
Ard Biesheuvel	a13ed1d15b	crypto: aesni - prevent misaligned buffers on the stack The GCM mode driver uses 16 byte aligned buffers on the stack to pass the IV to the asm helpers, but unfortunately, the x86 port does not guarantee that the stack pointer is 16 byte aligned upon entry in the first place. Since the compiler is not aware of this, it will not emit the additional stack realignment sequence that is needed, and so the alignment is not guaranteed to be more than 8 bytes. So instead, allocate some padding on the stack, and realign the IV pointer by hand. Cc: <stable@vger.kernel.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:25 +11:00
Marco Chiappero	4f1a02e75a	crypto: qat - replace CRYPTO_AES with CRYPTO_LIB_AES in Kconfig Use CRYPTO_LIB_AES in place of CRYPTO_AES in the dependences for the QAT common code. Fixes: `c0e583ab20` ("crypto: qat - add CRYPTO_AES to Kconfig dependencies") Reported-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Marco Chiappero <marco.chiappero@intel.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:25 +11:00
Herbert Xu	81064c96d8	crypto: stm32 - Fix last sparse warning in stm32_cryp_check_ctr_counter This patch changes the cast in stm32_cryp_check_ctr_counter from u32 to __be32 to match the prototype of stm32_cryp_hw_write_iv correctly. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-14 17:10:25 +11:00
Herbert Xu	622aae879c	crypto: vmx - Move extern declarations into header file This patch moves the extern algorithm declarations into a header file so that a number of compiler warnings are silenced. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-08 15:39:47 +11:00
Ard Biesheuvel	2481104fe9	crypto: x86/aes-ni-xts - rewrite and drop indirections via glue helper The AES-NI driver implements XTS via the glue helper, which consumes a struct with sets of function pointers which are invoked on chunks of input data of the appropriate size, as annotated in the struct. Let's get rid of this indirection, so that we can perform direct calls to the assembler helpers. Instead, let's adopt the arm64 strategy, i.e., provide a helper which can consume inputs of any size, provided that the penultimate, full block is passed via the last call if ciphertext stealing needs to be applied. This also allows us to enable the XTS mode for i386. Tested-by: Eric Biggers <ebiggers@google.com> # x86_64 Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Reported-by: kernel test robot <lkp@intel.com> Reported-by: kernel test robot <lkp@intel.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-08 15:39:47 +11:00
Ard Biesheuvel	86ad60a65f	crypto: x86/aes-ni-xts - use direct calls to and 4-way stride The XTS asm helper arrangement is a bit odd: the 8-way stride helper consists of back-to-back calls to the 4-way core transforms, which are called indirectly, based on a boolean that indicates whether we are performing encryption or decryption. Given how costly indirect calls are on x86, let's switch to direct calls, and given how the 8-way stride doesn't really add anything substantial, use a 4-way stride instead, and make the asm core routine deal with any multiple of 4 blocks. Since 512 byte sectors or 4 KB blocks are the typical quantities XTS operates on, increase the stride exported to the glue helper to 512 bytes as well. As a result, the number of indirect calls is reduced from 3 per 64 bytes of in/output to 1 per 512 bytes of in/output, which produces a 65% speedup when operating on 1 KB blocks (measured on a Intel(R) Core(TM) i7-8650U CPU) Fixes: `9697fa39ef` ("x86/retpoline/crypto: Convert crypto assembler indirect jumps") Tested-by: Eric Biggers <ebiggers@google.com> # x86_64 Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-08 15:39:47 +11:00
Rob Herring	fecff3b931	crypto: picoxcell - Remove PicoXcell driver PicoXcell has had nothing but treewide cleanups for at least the last 8 years and no signs of activity. The most recent activity is a yocto vendor kernel based on v3.0 in 2015. Cc: Jamie Iles <jamie@jamieiles.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "David S. Miller" <davem@davemloft.net> Cc: linux-crypto@vger.kernel.org Signed-off-by: Rob Herring <robh@kernel.org> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-03 09:03:36 +11:00
Eric Biggers	1862eb0073	crypto: arm/blake2b - add NEON-accelerated BLAKE2b Add a NEON-accelerated implementation of BLAKE2b. On Cortex-A7 (which these days is the most common ARM processor that doesn't have the ARMv8 Crypto Extensions), this is over twice as fast as SHA-256, and slightly faster than SHA-1. It is also almost three times as fast as the generic implementation of BLAKE2b: Algorithm Cycles per byte (on 4096-byte messages) =================== ======================================= blake2b-256-neon 14.0 sha1-neon 16.3 blake2s-256-arm 18.8 sha1-asm 20.8 blake2s-256-generic 26.0 sha256-neon 28.9 sha256-asm 32.0 blake2b-256-generic 38.9 This implementation isn't directly based on any other implementation, but it borrows some ideas from previous NEON code I've written as well as from chacha-neon-core.S. At least on Cortex-A7, it is faster than the other NEON implementations of BLAKE2b I'm aware of (the implementation in the BLAKE2 official repository using intrinsics, and Andrew Moon's implementation which can be found in SUPERCOP). It does only one block at a time, so it performs well on short messages too. NEON-accelerated BLAKE2b is useful because there is interest in using BLAKE2b-256 for dm-verity on low-end Android devices (specifically, devices that lack the ARMv8 Crypto Extensions) to replace SHA-1. On these devices, the performance cost of upgrading to SHA-256 may be unacceptable, whereas BLAKE2b-256 would actually improve performance. Although BLAKE2b is intended for 64-bit platforms (unlike BLAKE2s which is intended for 32-bit platforms), on 32-bit ARM processors with NEON, BLAKE2b is actually faster than BLAKE2s. This is because NEON supports 64-bit operations, and because BLAKE2s's block size is too small for NEON to be helpful for it. The best I've been able to do with BLAKE2s on Cortex-A7 is 18.8 cpb with an optimized scalar implementation. (I didn't try BLAKE2sp and BLAKE3, which in theory would be faster, but they're more complex as they require running multiple hashes at once. Note that BLAKE2b already uses all the NEON bandwidth on the Cortex-A7, so I expect that any speedup from BLAKE2sp or BLAKE3 would come only from the smaller number of rounds, not from the extra parallelism.) For now this BLAKE2b implementation is only wired up to the shash API, since there is no library API for BLAKE2b yet. However, I've tried to keep things consistent with BLAKE2s, e.g. by defining blake2b_compress_arch() which is analogous to blake2s_compress_arch() and could be exported for use by the library API later if needed. Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Tested-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-03 08:41:39 +11:00
Eric Biggers	0cdc438e6e	crypto: blake2b - update file comment The file comment for blake2b_generic.c makes it sound like it's the reference implementation of BLAKE2b with only minor changes. But it's actually been changed a lot. Update the comment to make this clearer. Reviewed-by: David Sterba <dsterba@suse.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-03 08:41:39 +11:00
Eric Biggers	28dcca4cc0	crypto: blake2b - sync with blake2s implementation Sync the BLAKE2b code with the BLAKE2s code as much as possible: - Move a lot of code into new headers <crypto/blake2b.h> and <crypto/internal/blake2b.h>, and adjust it to be like the corresponding BLAKE2s code, i.e. like <crypto/blake2s.h> and <crypto/internal/blake2s.h>. - Rename constants, e.g. BLAKE2B__DIGEST_SIZE => BLAKE2B__HASH_SIZE. - Use a macro BLAKE2B_ALG() to define the shash_alg structs. - Export blake2b_compress_generic() for use as a fallback. This makes it much easier to add optimized implementations of BLAKE2b, as optimized implementations can use the helper functions crypto_blake2b_{setkey,init,update,final}() and blake2b_compress_generic(). The ARM implementation will use these. But this change is also helpful because it eliminates unnecessary differences between the BLAKE2b and BLAKE2s code, so that the same improvements can easily be made to both. (The two algorithms are basically identical, except for the word size and constants.) It also makes it straightforward to add a library API for BLAKE2b in the future if/when it's needed. This change does make the BLAKE2b code slightly more complicated than it needs to be, as it doesn't actually provide a library API yet. For example, __blake2b_update() doesn't really need to exist yet; it could just be inlined into crypto_blake2b_update(). But I believe this is outweighed by the benefits of keeping the code in sync. Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-03 08:41:39 +11:00
Eric Biggers	a64bfe7ad4	wireguard: Kconfig: select CRYPTO_BLAKE2S_ARM When available, select the new implementation of BLAKE2s for 32-bit ARM. This is faster than the generic C implementation. Reviewed-by: Jason A. Donenfeld <Jason@zx2c4.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-03 08:41:39 +11:00
Eric Biggers	5172d322d3	crypto: arm/blake2s - add ARM scalar optimized BLAKE2s Add an ARM scalar optimized implementation of BLAKE2s. NEON isn't very useful for BLAKE2s because the BLAKE2s block size is too small for NEON to help. Each NEON instruction would depend on the previous one, resulting in poor performance. With scalar instructions, on the other hand, we can take advantage of ARM's "free" rotations (like I did in chacha-scalar-core.S) to get an implementation get runs much faster than the C implementation. Performance results on Cortex-A7 in cycles per byte using the shash API: 4096-byte messages: blake2s-256-arm: 18.8 blake2s-256-generic: 26.0 500-byte messages: blake2s-256-arm: 20.3 blake2s-256-generic: 27.9 100-byte messages: blake2s-256-arm: 29.7 blake2s-256-generic: 39.2 32-byte messages: blake2s-256-arm: 50.6 blake2s-256-generic: 66.2 Except on very short messages, this is still slower than the NEON implementation of BLAKE2b which I've written; that is 14.0, 16.4, 25.8, and 76.1 cpb on 4096, 500, 100, and 32-byte messages, respectively. However, optimized BLAKE2s is useful for cases where BLAKE2s is used instead of BLAKE2b, such as WireGuard. This new implementation is added in the form of a new module blake2s-arm.ko, which is analogous to blake2s-x86_64.ko in that it provides blake2s_compress_arch() for use by the library API as well as optionally register the algorithms with the shash API. Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Tested-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2021-01-03 08:41:39 +11:00

1 2 3 4 5 ...

982088 Commits