Convert the x86 implementation of AEGIS-128 to use the AEAD SIMD
helpers, rather than hand-rolling the same functionality. This
simplifies the code and also fixes the bug where the user-provided
aead_request is modified.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Convert the AES-NI implementations of "gcm(aes)" and "rfc4106(gcm(aes))"
to use the AEAD SIMD helpers, rather than hand-rolling the same
functionality. This simplifies the code and also fixes the bug where
the user-provided aead_request is modified.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Convert the AES-NI glue code to use simd_register_skciphers_compat() to
create SIMD wrappers for all the internal skcipher algorithms at once,
rather than wrapping each one individually. This simplifies the code.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
1-block SSE2 variant of poly1305 stores variables s1..s4 containing key
material on the stack. This commit adds missing zeroing of the stack
memory. Benchmarks show negligible performance hit (tested on i7-3770).
Signed-off-by: Tommi Hirvola <tommi@hirvola.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
gcmaes_crypt_by_sg() dereferences the NULL pointer returned by
scatterwalk_ffwd() when encrypting an empty plaintext and the source
scatterlist ends immediately after the associated data.
Fix it by only fast-forwarding to the src/dst data scatterlists if the
data length is nonzero.
This bug is reproduced by the "rfc4543(gcm(aes))" test vectors when run
with the new AEAD test manager.
Fixes: e845520707 ("crypto: aesni - Update aesni-intel_glue to use scatter/gather")
Cc: <stable@vger.kernel.org> # v4.17+
Cc: Dave Watson <davejwatson@fb.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The x86 MORUS implementations all fail the improved AEAD tests because
they produce the wrong result with some data layouts. The issue is that
they assume that if the skcipher_walk API gives 'nbytes' not aligned to
the walksize (a.k.a. walk.stride), then it is the end of the data. In
fact, this can happen before the end.
Also, when the CRYPTO_TFM_REQ_MAY_SLEEP flag is given, they can
incorrectly sleep in the skcipher_walk_*() functions while preemption
has been disabled by kernel_fpu_begin().
Fix these bugs.
Fixes: 56e8e57fc3 ("crypto: morus - Add common SIMD glue code for MORUS")
Cc: <stable@vger.kernel.org> # v4.18+
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The x86 AEGIS implementations all fail the improved AEAD tests because
they produce the wrong result with some data layouts. The issue is that
they assume that if the skcipher_walk API gives 'nbytes' not aligned to
the walksize (a.k.a. walk.stride), then it is the end of the data. In
fact, this can happen before the end.
Also, when the CRYPTO_TFM_REQ_MAY_SLEEP flag is given, they can
incorrectly sleep in the skcipher_walk_*() functions while preemption
has been disabled by kernel_fpu_begin().
Fix these bugs.
Fixes: 1d373d4e8e ("crypto: x86 - Add optimized AEGIS implementations")
Cc: <stable@vger.kernel.org> # v4.18+
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The x86, arm, and arm64 asm implementations of crct10dif are very
difficult to understand partly because many of the comments, labels, and
macros are named incorrectly: the lengths mentioned are usually off by a
factor of two from the actual code. Many other things are unnecessarily
convoluted as well, e.g. there are many more fold constants than
actually needed and some aren't fully reduced.
This series therefore cleans up all these implementations to be much
more maintainable. I also made some small optimizations where I saw
opportunities, resulting in slightly better performance.
This patch cleans up the x86 version.
As part of this, I removed support for len < 16 from the x86 assembly;
now the glue code falls back to the generic table-based implementation
in this case. Due to the overhead of kernel_fpu_begin(), this actually
significantly improves performance on these lengths. (And even if
kernel_fpu_begin() were free, the generic code is still faster for about
len < 11.) This removal also eliminates error-prone special cases and
makes the x86, arm32, and arm64 ports of the code match more closely.
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add missing static keywords to fix the following sparse warnings:
arch/x86/crypto/aesni-intel_glue.c:197:24: warning: symbol 'aesni_gcm_tfm_sse' was not declared. Should it be static?
arch/x86/crypto/aesni-intel_glue.c:246:24: warning: symbol 'aesni_gcm_tfm_avx_gen2' was not declared. Should it be static?
arch/x86/crypto/aesni-intel_glue.c:291:24: warning: symbol 'aesni_gcm_tfm_avx_gen4' was not declared. Should it be static?
I also made the affected structures 'const', and adjusted the
indentation in the struct definition to not be insane.
Cc: Dave Watson <davejwatson@fb.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Pull crypto updates from Herbert Xu:
"API:
- Add 1472-byte test to tcrypt for IPsec
- Reintroduced crypto stats interface with numerous changes
- Support incremental algorithm dumps
Algorithms:
- Add xchacha12/20
- Add nhpoly1305
- Add adiantum
- Add streebog hash
- Mark cts(cbc(aes)) as FIPS allowed
Drivers:
- Improve performance of arm64/chacha20
- Improve performance of x86/chacha20
- Add NEON-accelerated nhpoly1305
- Add SSE2 accelerated nhpoly1305
- Add AVX2 accelerated nhpoly1305
- Add support for 192/256-bit keys in gcmaes AVX
- Add SG support in gcmaes AVX
- ESN for inline IPsec tx in chcr
- Add support for CryptoCell 703 in ccree
- Add support for CryptoCell 713 in ccree
- Add SM4 support in ccree
- Add SM3 support in ccree
- Add support for chacha20 in caam/qi2
- Add support for chacha20 + poly1305 in caam/jr
- Add support for chacha20 + poly1305 in caam/qi2
- Add AEAD cipher support in cavium/nitrox"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (130 commits)
crypto: skcipher - remove remnants of internal IV generators
crypto: cavium/nitrox - Fix build with !CONFIG_DEBUG_FS
crypto: salsa20-generic - don't unnecessarily use atomic walk
crypto: skcipher - add might_sleep() to skcipher_walk_virt()
crypto: x86/chacha - avoid sleeping under kernel_fpu_begin()
crypto: cavium/nitrox - Added AEAD cipher support
crypto: mxc-scc - fix build warnings on ARM64
crypto: api - document missing stats member
crypto: user - remove unused dump functions
crypto: chelsio - Fix wrong error counter increments
crypto: chelsio - Reset counters on cxgb4 Detach
crypto: chelsio - Handle PCI shutdown event
crypto: chelsio - cleanup:send addr as value in function argument
crypto: chelsio - Use same value for both channel in single WR
crypto: chelsio - Swap location of AAD and IV sent in WR
crypto: chelsio - remove set but not used variable 'kctx_len'
crypto: ux500 - Use proper enum in hash_set_dma_transfer
crypto: ux500 - Use proper enum in cryp_set_dma_transfer
crypto: aesni - Add scatter/gather avx stubs, and use them in C
crypto: aesni - Introduce partial block macro
..
Passing atomic=true to skcipher_walk_virt() only makes the later
skcipher_walk_done() calls use atomic memory allocations, not
skcipher_walk_virt() itself. Thus, we have to move it outside of the
preemption-disabled region (kernel_fpu_begin()/kernel_fpu_end()).
(skcipher_walk_virt() only allocates memory for certain layouts of the
input scatterlist, hence why I didn't notice this earlier...)
Reported-by: syzbot+9bf843c33f782d73ae7d@syzkaller.appspotmail.com
Fixes: 4af7826187 ("crypto: x86/chacha20 - add XChaCha20 support")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add the appropriate scatter/gather stubs to the avx asm.
In the C code, we can now always use crypt_by_sg, since both
sse and asm code now support scatter/gather.
Introduce a new struct, aesni_gcm_tfm, that is initialized on
startup to point to either the SSE, AVX, or AVX2 versions of the
four necessary encryption/decryption routines.
GENX_OPTSIZE is still checked at the start of crypt_by_sg. The
total size of the data is checked, since the additional overhead
is in the init function, calculating additional HashKeys.
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Before this diff, multiple calls to GCM_ENC_DEC will
succeed, but only if all calls are a multiple of 16 bytes.
Handle partial blocks at the start of GCM_ENC_DEC, and update
aadhash as appropriate.
The data offset %r11 is also updated after the partial block.
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Introduce READ_PARTIAL_BLOCK macro, and use it in the two existing
partial block cases: AAD and the end of ENC_DEC. In particular,
the ENC_DEC case should be faster, since we read by 8/4 bytes if
possible.
This macro will also be used to read partial blocks between
enc_update and dec_update calls.
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Prepare to handle partial blocks between scatter/gather calls.
For the last partial block, we only want to calculate the aadhash
in GCM_COMPLETE, and a new partial block macro will handle both
aadhash update and encrypting partial blocks between calls.
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Fill in aadhash, aadlen, pblocklen, curcount with appropriate values.
pblocklen, aadhash, and pblockenckey are also updated at the end
of each scatter/gather operation, to be carried over to the next
operation.
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The precompute functions differ only by the sub-macros
they call, merge them to a single macro. Later diffs
add more code to fill in the gcm_context_data structure,
this allows changes in a single place.
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
AAD hash only needs to be calculated once for each scatter/gather operation.
Move it to its own macro, and call it from GCM_INIT instead of
INITIAL_BLOCKS.
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Merge encode and decode tag calculations in GCM_COMPLETE macro.
Scatter/gather routines will call this once at the end of encryption
or decryption.
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add support for 192/256-bit keys using the avx gcm/aes routines.
The sse routines were previously updated in e31ac32d3b (Add support
for 192 & 256 bit keys to AESNI RFC4106).
Instead of adding an additional loop in the hotpath as in e31ac32d3b,
this diff instead generates separate versions of the code using macros,
and the entry routines choose which version once. This results
in a 5% performance improvement vs. adding a loop to the hot path.
This is the same strategy chosen by the intel isa-l_crypto library.
The key size checks are removed from the c code where appropriate.
Note that this diff depends on using gcm_context_data - 256 bit keys
require 16 HashKeys + 15 expanded keys, which is larger than
struct crypto_aes_ctx, so they are stored in struct gcm_context_data.
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Macro-ify function save and restore. These will be used in new functions
added for scatter/gather update operations.
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add the gcm_context_data structure to the avx asm routines.
This will be necessary to support both 256 bit keys and
scatter/gather.
The pre-computed HashKeys are now stored in the gcm_context_data
struct, which is expanded to hold the greater number of hashkeys
necessary for avx.
Loads and stores to the new struct are always done unlaligned to
avoid compiler issues, see e5b954e8 "Use unaligned loads from
gcm_context_data"
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The GCM_ENC_DEC routines for AVX and AVX2 are identical, except they
call separate sub-macros. Pass the macros as arguments, and merge them.
This facilitates additional refactoring, by requiring changes in only
one place.
The GCM_ENC_DEC macro was moved above the CONFIG_AS_AVX* ifdefs,
since it will be used by both AVX and AVX2.
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
To improve responsiveness, yield the FPU (temporarily re-enabling
preemption) every 4 KiB encrypted/decrypted, rather than keeping
preemption disabled during the entire encryption/decryption operation.
Alternatively we could do this for every skcipher_walk step, but steps
may be small in some cases, and yielding the FPU is expensive on x86.
Suggested-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Now that the x86_64 SIMD implementations of ChaCha20 and XChaCha20 have
been refactored to support varying the number of rounds, add support for
XChaCha12. This is identical to XChaCha20 except for the number of
rounds, which is 12 instead of 20. This can be used by Adiantum.
Reviewed-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
In preparation for adding XChaCha12 support, rename/refactor the x86_64
SIMD implementations of ChaCha20 to support different numbers of rounds.
Reviewed-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add an XChaCha20 implementation that is hooked up to the x86_64 SIMD
implementations of ChaCha20. This can be used by Adiantum.
An SSSE3 implementation of single-block HChaCha20 is also added so that
XChaCha20 can use it rather than the generic implementation. This
required refactoring the ChaCha permutation into its own function.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add a 64-bit AVX2 implementation of NHPoly1305, an ε-almost-∆-universal
hash function used in the Adiantum encryption mode. For now, only the
NH portion is actually AVX2-accelerated; the Poly1305 part is less
performance-critical so is just implemented in C.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add a 64-bit SSE2 implementation of NHPoly1305, an ε-almost-∆-universal
hash function used in the Adiantum encryption mode. For now, only the
NH portion is actually SSE2-accelerated; the Poly1305 part is less
performance-critical so is just implemented in C.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Go over arch/x86/ and fix common typos in comments,
and a typo in an actual function argument name.
No change in functionality intended.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This version uses the same principle as the AVX2 version by scheduling the
operations for two block pairs in parallel. It benefits from the AVX-512VL
rotate instructions and the more efficient partial block handling using
"vmovdqu8", resulting in a speedup of the raw block function of ~20%.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This version uses the same principle as the AVX2 version. It benefits
from the AVX-512VL rotate instructions and the more efficient partial
block handling using "vmovdqu8", resulting in a speedup of ~20%.
Unlike the AVX2 version, it is faster than the single block SSSE3 version
to process a single block. Hence we engage that function for (partial)
single block lengths as well.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This variant is similar to the AVX2 version, but benefits from the AVX-512
rotate instructions and the additional registers, so it can operate without
any data on the stack. It uses ymm registers only to avoid the massive core
throttling on Skylake-X platforms. Nontheless does it bring a ~30% speed
improvement compared to the AVX2 variant for random encryption lengths.
The AVX2 version uses "rep movsb" for partial block XORing via the stack.
With AVX-512, the new "vmovdqu8" can do this much more efficiently. The
associated "kmov" instructions to work with dynamic masks is not part of
the AVX-512VL instruction set, hence we depend on AVX-512BW as well. Given
that the major AVX-512VL architectures provide AVX-512BW and this extension
does not affect core clocking, this seems to be no problem at least for
now.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
In preparation for exposing a low-level Poly1305 API which implements
the ε-almost-∆-universal (εA∆U) hash function underlying the Poly1305
MAC and supports block-aligned inputs only, create structures
poly1305_key and poly1305_state which hold the limbs of the Poly1305
"r" key and accumulator, respectively.
These structures could actually have the same type (e.g. poly1305_val),
but different types are preferable, to prevent misuse.
Acked-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
In preparation for adding XChaCha12 support, rename/refactor
chacha20-generic to support different numbers of rounds. The
justification for needing XChaCha12 support is explained in more detail
in the patch "crypto: chacha - add XChaCha12 support".
The only difference between ChaCha{8,12,20} are the number of rounds
itself; all other parts of the algorithm are the same. Therefore,
remove the "20" from all definitions, structures, functions, files, etc.
that will be shared by all ChaCha versions.
Also make ->setkey() store the round count in the chacha_ctx (previously
chacha20_ctx). The generic code then passes the round count through to
chacha_block(). There will be a ->setkey() function for each explicitly
allowed round count; the encrypt/decrypt functions will be the same. I
decided not to do it the opposite way (same ->setkey() function for all
round counts, with different encrypt/decrypt functions) because that
would have required more boilerplate code in architecture-specific
implementations of ChaCha and XChaCha.
Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This variant builds upon the idea of the 2-block AVX2 variant that
shuffles words after each round. The shuffling has a rather high latency,
so the arithmetic units are not optimally used.
Given that we have plenty of registers in AVX, this version parallelizes
the 2-block variant to do four blocks. While the first two blocks are
shuffling, the CPU can do the XORing on the second two blocks and
vice-versa, which makes this version much faster than the SSSE3 variant
for four blocks. The latter is now mostly for systems that do not have
AVX2, but there it is the work-horse, so we keep it in place.
The partial XORing function trailer is very similar to the AVX2 2-block
variant. While it could be shared, that code segment is rather short;
profiling is also easier with the trailer integrated, so we keep it per
function.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This variant uses the same principle as the single block SSSE3 variant
by shuffling the state matrix after each round. With the wider AVX
registers, we can do two blocks in parallel, though.
This function can increase performance and efficiency significantly for
lengths that would otherwise require a 4-block function.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Now that all block functions support partial lengths, engage the wider
block sizes more aggressively. This prevents using smaller block
functions multiple times, where the next larger block function would
have been faster.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add a length argument to the eight block function for AVX2, so the
block function may XOR only a partial length of eight blocks.
To avoid unnecessary operations, we integrate XORing of the first four
blocks in the final lane interleaving; this also avoids some work in
the partial lengths path.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add a length argument to the quad block function for SSSE3, so the
block function may XOR only a partial length of four blocks.
As we already have the stack set up, the partial XORing does not need
to. This gives a slightly different function trailer, so we keep that
separate from the 1-block function.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add a length argument to the single block function for SSSE3, so the
block function may XOR only a partial length of the full block. Given
that the setup code is rather cheap, the function does not process more
than one block; this allows us to keep the block function selection in
the C glue code.
The required branching does not negatively affect performance for full
block sizes. The partial XORing uses simple "rep movsb" to copy the
data before and after doing XOR in SSE. This is rather efficient on
modern processors; movsw can be slightly faster, but the additional
complexity is probably not worth it.
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
aesni-intel_glue.c still calls crypto_fpu_init() and crypto_fpu_exit()
to register/unregister the "fpu" template. But these functions don't
exist anymore, causing a build error. Remove the calls to them.
Fixes: 944585a64f ("crypto: x86/aes-ni - remove special handling of AES in PCBC mode")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
For historical reasons, the AES-NI based implementation of the PCBC
chaining mode uses a special FPU chaining mode wrapper template to
amortize the FPU start/stop overhead over multiple blocks.
When this FPU wrapper was introduced, it supported widely used
chaining modes such as XTS and CTR (as well as LRW), but currently,
PCBC is the only remaining user.
Since there are no known users of pcbc(aes) in the kernel, let's remove
this special driver, and rely on the generic pcbc driver to encapsulate
the AES-NI core cipher.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
In the quest to remove all stack VLA usage from the kernel[1], this
replaces struct crypto_skcipher and SKCIPHER_REQUEST_ON_STACK() usage
with struct crypto_sync_skcipher and SYNC_SKCIPHER_REQUEST_ON_STACK(),
which uses a fixed stack size.
[1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com
Cc: x86@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch fixes gcmaes_crypt_by_sg so that it won't use memory
allocation if the data doesn't cross a page boundary.
Authenticated encryption may be used by dm-crypt. If the encryption or
decryption fails, it would result in I/O error and filesystem corruption.
The function gcmaes_crypt_by_sg is using GFP_ATOMIC allocation that can
fail anytime. This patch fixes the logic so that it won't attempt the
failing allocation if the data doesn't cross a page boundary.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
It turns out OSXSAVE needs to be checked only for AVX, not for SSE.
Without this patch the affected modules refuse to load on CPUs with SSE2
but without AVX support.
Fixes: 877ccce7cb ("crypto: x86/aegis,morus - Fix and simplify CPUID checks")
Cc: <stable@vger.kernel.org> # 4.18
Reported-by: Zdenek Kaspar <zkaspar82@gmail.com>
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
As it turns out, the AVX2 multibuffer SHA routines are currently
broken [0], in a way that would have likely been noticed if this
code were in wide use. Since the code is too complicated to be
maintained by anyone except the original authors, and since the
performance benefits for real-world use cases are debatable to
begin with, it is better to drop it entirely for the moment.
[0] https://marc.info/?l=linux-crypto-vger&m=153476243825350&w=2
Suggested-by: Eric Biggers <ebiggers@google.com>
Cc: Megha Dey <megha.dey@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Pull crypto fixes from Herbert Xu:
- Check for the right CPU feature bit in sm4-ce on arm64.
- Fix scatterwalk WARN_ON in aes-gcm-ce on arm64.
- Fix unaligned fault in aesni on x86.
- Fix potential NULL pointer dereference on exit in chtls.
- Fix DMA mapping direction for RSA in caam.
- Fix error path return value for xts setkey in caam.
- Fix address endianness when DMA unmapping in caam.
- Fix sleep-in-atomic in vmx.
- Fix command corruption when queue is full in cavium/nitrox.
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: cavium/nitrox - fix for command corruption in queue full case with backlog submissions.
crypto: vmx - Fix sleep-in-atomic bugs
crypto: arm64/aes-gcm-ce - fix scatterwalk API violation
crypto: aesni - Use unaligned loads from gcm_context_data
crypto: chtls - fix null dereference chtls_free_uld()
crypto: arm64/sm4-ce - check for the right CPU feature bit
crypto: caam - fix DMA mapping direction for RSA forms 2 & 3
crypto: caam/qi - fix error path in xts setkey
crypto: caam/jr - fix descriptor DMA unmapping
A regression was reported bisecting to 1476db2d12
"Move HashKey computation from stack to gcm_context". That diff
moved HashKey computation from the stack, which was explicitly aligned
in the asm, to a struct provided from the C code, depending on
AESNI_ALIGN_ATTR for alignment. It appears some compilers may not
align this struct correctly, resulting in a crash on the movdqa
instruction when attempting to encrypt or decrypt data.
Fix by using unaligned loads for the HashKeys. On modern
hardware there is no perf difference between the unaligned and
aligned loads. All other accesses to gcm_context_data already use
unaligned loads.
Reported-by: Mauro Rossi <issor.oruam@gmail.com>
Fixes: 1476db2d12 ("Move HashKey computation from stack to gcm_context")
Cc: <stable@vger.kernel.org>
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Pull x86 asm updates from Thomas Gleixner:
"The lowlevel and ASM code updates for x86:
- Make stack trace unwinding more reliable
- ASM instruction updates for better code generation
- Various cleanups"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/entry/64: Add two more instruction suffixes
x86/asm/64: Use 32-bit XOR to zero registers
x86/build/vdso: Simplify 'cmd_vdso2c'
x86/build/vdso: Remove unused vdso-syms.lds
x86/stacktrace: Enable HAVE_RELIABLE_STACKTRACE for the ORC unwinder
x86/unwind/orc: Detect the end of the stack
x86/stacktrace: Do not fail for ORC with regs on stack
x86/stacktrace: Clarify the reliable success paths
x86/stacktrace: Remove STACKTRACE_DUMP_ONCE
x86/stacktrace: Do not unwind after user regs
x86/asm: Use CC_SET/CC_OUT in percpu_cmpxchg8b_double() to micro-optimize code generation
It turns out I had misunderstood how the x86_match_cpu() function works.
It evaluates a logical OR of the matching conditions, not logical AND.
This caused the CPU feature checks for AEGIS to pass even if only SSE2
(but not AES-NI) was supported (or vice versa), leading to potential
crashes if something tried to use the registered algs.
This patch switches the checks to a simpler method that is used e.g. in
the Camellia x86 code.
The patch also removes the MODULE_DEVICE_TABLE declarations which
actually seem to cause the modules to be auto-loaded at boot, which is
not desired. The crypto API on-demand module loading is sufficient.
Fixes: 1d373d4e8e ("crypto: x86 - Add optimized AEGIS implementations")
Fixes: 6ecc9d9ff9 ("crypto: x86 - Add optimized MORUS implementations")
Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Tested-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Some ahash algorithms set .cra_type = &crypto_ahash_type. But this is
redundant with the C structure type ('struct ahash_alg'), and
crypto_register_ahash() already sets the .cra_type automatically.
Apparently the useless assignment has just been copy+pasted around.
So, remove the useless assignment from all the ahash algorithms.
This patch shouldn't change any actual behavior.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Gilad Ben-Yossef <gilad@benyossef.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Many ahash algorithms set .cra_flags = CRYPTO_ALG_TYPE_AHASH. But this
is redundant with the C structure type ('struct ahash_alg'), and
crypto_register_ahash() already sets the type flag automatically,
clearing any type flag that was already there. Apparently the useless
assignment has just been copy+pasted around.
So, remove the useless assignment from all the ahash algorithms.
This patch shouldn't change any actual behavior.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Acked-by: Gilad Ben-Yossef <gilad@benyossef.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Many shash algorithms set .cra_flags = CRYPTO_ALG_TYPE_SHASH. But this
is redundant with the C structure type ('struct shash_alg'), and
crypto_register_shash() already sets the type flag automatically,
clearing any type flag that was already there. Apparently the useless
assignment has just been copy+pasted around.
So, remove the useless assignment from all the shash algorithms.
This patch shouldn't change any actual behavior.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
With all the crypto modules enabled on x86, and with a CPU that supports
AVX-2 but not SHA-NI instructions (e.g. Haswell, Broadwell, Skylake),
the "multibuffer" implementations of SHA-1, SHA-256, and SHA-512 are the
highest priority. However, these implementations only perform well when
many hash requests are being submitted concurrently, filling all 8 AVX-2
lanes. Otherwise, they are incredibly slow, as they waste time waiting
for more requests to arrive before proceeding to execute each request.
For example, here are the speeds I see hashing 4096-byte buffers with a
single thread on a Haswell-based processor:
generic avx2 mb (multibuffer)
------- -------- ----------------
sha1 602 MB/s 997 MB/s 0.61 MB/s
sha256 228 MB/s 412 MB/s 0.61 MB/s
sha512 312 MB/s 559 MB/s 0.61 MB/s
So, the multibuffer implementation is 500 to 1000 times slower than the
other implementations. Note that with smaller buffers or more update()s
per digest, the difference would be even greater.
I believe the vast majority of people are in the boat where the
multibuffer code is much slower, and only a small minority are doing the
highly parallel, hashing-intensive, latency-flexible workloads (maybe
IPsec on servers?) where the multibuffer code may be beneficial. Yet,
people often aren't familiar with all the crypto config options and so
the multibuffer code may inadvertently be built into the kernel.
Also the multibuffer code apparently hasn't been very well tested,
seeing as it was sometimes computing the wrong SHA-256 digest.
So, let's make the multibuffer algorithms low priority. Users who want
to use them can either request them explicitly by driver name, or use
NETLINK_CRYPTO (crypto_user) to increase their priority at runtime.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add explicit RETs to the tail calls of AEGIS and MORUS crypto algorithms
otherwise they run into INT3 padding due to
("x86/asm: Pad assembly functions with INT3 instructions")
leading to spurious debug exceptions.
Mike Galbraith <efault@gmx.de> took care of all the remaining callsites.
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Ondrej Mosnacek <omosnacek@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The x86 assembly implementations of Salsa20 use the frame base pointer
register (%ebp or %rbp), which breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.
Recent (v4.10+) kernels will warn about this, e.g.
WARNING: kernel stack regs at 00000000a8291e69 in syzkaller047086:4677 has bad 'bp' value 000000001077994c
[...]
But after looking into it, I believe there's very little reason to still
retain the x86 Salsa20 code. First, these are *not* vectorized
(SSE2/SSSE3/AVX2) implementations, which would be needed to get anywhere
close to the best Salsa20 performance on any remotely modern x86
processor; they're just regular x86 assembly. Second, it's still
unclear that anyone is actually using the kernel's Salsa20 at all,
especially given that now ChaCha20 is supported too, and with much more
efficient SSSE3 and AVX2 implementations. Finally, in benchmarks I did
on both Intel and AMD processors with both gcc 8.1.0 and gcc 4.9.4, the
x86_64 salsa20-asm is actually slightly *slower* than salsa20-generic
(~3% slower on Skylake, ~10% slower on Zen), while the i686 salsa20-asm
is only slightly faster than salsa20-generic (~15% faster on Skylake,
~20% faster on Zen). The gcc version made little difference.
So, the x86_64 salsa20-asm is pretty clearly useless. That leaves just
the i686 salsa20-asm, which based on my tests provides a 15-20% speed
boost. But that's without updating the code to not use %ebp. And given
the maintenance cost, the small speed difference vs. salsa20-generic,
the fact that few people still use i686 kernels, the doubt that anyone
is even using the kernel's Salsa20 at all, and the fact that a SSE2
implementation would almost certainly be much faster on any remotely
modern x86 processor yet no one has cared enough to add one yet, I don't
think it's worthwhile to keep.
Thus, just remove both the x86_64 and i686 salsa20-asm implementations.
Reported-by: syzbot+ffa3a158337bbc01ff09@syzkaller.appspotmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Commit 56e8e57fc3 ("crypto: morus - Add common SIMD glue code for
MORUS") accidetally consiedered the glue code to be usable by different
architectures, but it seems to be only usable on x86.
This patch moves it under arch/x86/crypto and adds 'depends on X86' to
the Kconfig options and also removes the prompt to hide these internal
options from the user.
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Ondrej Mosnacek <omosnacek@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
AEGIS-256 key is two blocks, not one.
Fixes: 1d373d4e8e ("crypto: x86 - Add optimized AEGIS implementations")
Reported-by: Eric Biggers <ebiggers3@gmail.com>
Signed-off-by: Ondrej Mosnacek <omosnacek@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch adds optimized implementations of MORUS-640 and MORUS-1280,
utilizing the SSE2 and AVX2 x86 extensions.
For MORUS-1280 (which operates on 256-bit blocks) we provide both AVX2
and SSE2 implementation. Although SSE2 MORUS-1280 is slower than AVX2
MORUS-1280, it is comparable in speed to the SSE2 MORUS-640.
Signed-off-by: Ondrej Mosnacek <omosnacek@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
This patch adds optimized implementations of AEGIS-128, AEGIS-128L,
and AEGIS-256, utilizing the AES-NI and SSE2 x86 extensions.
Signed-off-by: Ondrej Mosnacek <omosnacek@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Trivial fix to spelling mistake in module description text
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
There are no users of the original glue_fpu_begin() anymore, so rename
glue_skwalk_fpu_begin() to glue_fpu_begin() so that it matches
glue_fpu_end() again.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Now that all glue_helper users have been switched from the blkcipher
interface over to the skcipher interface, remove the versions of the
glue_helper functions that handled the blkcipher interface.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Now that all users of lrw_crypt() have been removed in favor of the LRW
template wrapping an ECB mode algorithm, remove lrw_crypt(). Also
remove crypto/lrw.h as that is no longer needed either; and fold
'struct lrw_table_ctx' into 'struct priv', lrw_init_table() into
setkey(), and lrw_free_table() into exit_tfm().
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Convert the AESNI AVX and AESNI AVX2 implementations of Camellia from
the (deprecated) ablkcipher and blkcipher interfaces over to the
skcipher interface. Note that this includes replacing the use of
ablk_helper with crypto_simd.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Convert the x86 asm implementation of Camellia from the (deprecated)
blkcipher interface over to the skcipher interface.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The XTS template now wraps an ECB mode algorithm rather than the block
cipher directly. Therefore it is now redundant for crypto modules to
wrap their ECB code with generic XTS code themselves via xts_crypt().
Remove the xts-camellia-asm algorithm which did this. Users who request
xts(camellia) and previously would have gotten xts-camellia-asm will now
get xts(ecb-camellia-asm) instead, which is just as fast.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The LRW template now wraps an ECB mode algorithm rather than the block
cipher directly. Therefore it is now redundant for crypto modules to
wrap their ECB code with generic LRW code themselves via lrw_crypt().
Remove the lrw-camellia-asm algorithm which did this. Users who request
lrw(camellia) and previously would have gotten lrw-camellia-asm will now
get lrw(ecb-camellia-asm) instead, which is just as fast.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The LRW template now wraps an ECB mode algorithm rather than the block
cipher directly. Therefore it is now redundant for crypto modules to
wrap their ECB code with generic LRW code themselves via lrw_crypt().
Remove the lrw-camellia-aesni-avx2 algorithm which did this. Users who
request lrw(camellia) and previously would have gotten
lrw-camellia-aesni-avx2 will now get lrw(ecb-camellia-aesni-avx2)
instead, which is just as fast.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The LRW template now wraps an ECB mode algorithm rather than the block
cipher directly. Therefore it is now redundant for crypto modules to
wrap their ECB code with generic LRW code themselves via lrw_crypt().
Remove the lrw-camellia-aesni algorithm which did this. Users who
request lrw(camellia) and previously would have gotten
lrw-camellia-aesni will now get lrw(ecb-camellia-aesni) instead, which
is just as fast.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Convert the x86 asm implementation of Triple DES from the (deprecated)
blkcipher interface over to the skcipher interface.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Convert the x86 asm implementation of Blowfish from the (deprecated)
blkcipher interface over to the skcipher interface.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Convert the AVX implementation of CAST6 from the (deprecated) ablkcipher
and blkcipher interfaces over to the skcipher interface. Note that this
includes replacing the use of ablk_helper with crypto_simd.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The LRW template now wraps an ECB mode algorithm rather than the block
cipher directly. Therefore it is now redundant for crypto modules to
wrap their ECB code with generic LRW code themselves via lrw_crypt().
Remove the lrw-cast6-avx algorithm which did this. Users who request
lrw(cast6) and previously would have gotten lrw-cast6-avx will now get
lrw(ecb-cast6-avx) instead, which is just as fast.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Convert the AVX implementation of CAST5 from the (deprecated) ablkcipher
and blkcipher interfaces over to the skcipher interface. Note that this
includes replacing the use of ablk_helper with crypto_simd.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
With ecb-cast5-avx, if a 128+ byte scatterlist element followed a
shorter one, then the algorithm accidentally encrypted/decrypted only 8
bytes instead of the expected 128 bytes. Fix it by setting the
encryption/decryption 'fn' correctly.
Fixes: c12ab20b16 ("crypto: cast5/avx - avoid using temporary stack buffers")
Cc: <stable@vger.kernel.org> # v3.8+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Convert the AVX implementation of Twofish from the (deprecated)
ablkcipher and blkcipher interfaces over to the skcipher interface.
Note that this includes replacing the use of ablk_helper with
crypto_simd.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The LRW template now wraps an ECB mode algorithm rather than the block
cipher directly. Therefore it is now redundant for crypto modules to
wrap their ECB code with generic LRW code themselves via lrw_crypt().
Remove the lrw-twofish-avx algorithm which did this. Users who request
lrw(twofish) and previously would have gotten lrw-twofish-avx will now
get lrw(ecb-twofish-avx) instead, which is just as fast.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Convert the 3-way implementation of Twofish from the (deprecated)
blkcipher interface over to the skcipher interface.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The XTS template now wraps an ECB mode algorithm rather than the block
cipher directly. Therefore it is now redundant for crypto modules to
wrap their ECB code with generic XTS code themselves via xts_crypt().
Remove the xts-twofish-3way algorithm which did this. Users who request
xts(twofish) and previously would have gotten xts-twofish-3way will now
get xts(ecb-twofish-3way) instead, which is just as fast.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The LRW template now wraps an ECB mode algorithm rather than the block
cipher directly. Therefore it is now redundant for crypto modules to
wrap their ECB code with generic LRW code themselves via lrw_crypt().
Remove the lrw-twofish-3way algorithm which did this. Users who request
lrw(twofish) and previously would have gotten lrw-twofish-3way will now
get lrw(ecb-twofish-3way) instead, which is just as fast.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Convert the AVX and AVX2 implementations of Serpent from the
(deprecated) ablkcipher and blkcipher interfaces over to the skcipher
interface. Note that this includes replacing the use of ablk_helper
with crypto_simd.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The LRW template now wraps an ECB mode algorithm rather than the block
cipher directly. Therefore it is now redundant for crypto modules to
wrap their ECB code with generic LRW code themselves via lrw_crypt().
Remove the lrw-serpent-avx algorithm which did this. Users who request
lrw(serpent) and previously would have gotten lrw-serpent-avx will now
get lrw(ecb-serpent-avx) instead, which is just as fast.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The LRW template now wraps an ECB mode algorithm rather than the block
cipher directly. Therefore it is now redundant for crypto modules to
wrap their ECB code with generic LRW code themselves via lrw_crypt().
Remove the lrw-serpent-avx2 algorithm which did this. Users who request
lrw(serpent) and previously would have gotten lrw-serpent-avx2 will now
get lrw(ecb-serpent-avx2) instead, which is just as fast.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Convert the SSE2 implementation of Serpent from the (deprecated)
ablkcipher and blkcipher interfaces over to the skcipher interface.
Note that this includes replacing the use of ablk_helper with
crypto_simd.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The XTS template now wraps an ECB mode algorithm rather than the block
cipher directly. Therefore it is now redundant for crypto modules to
wrap their ECB code with generic XTS code themselves via xts_crypt().
Remove the xts-serpent-sse2 algorithm which did this. Users who request
xts(serpent) and previously would have gotten xts-serpent-sse2 will now
get xts(ecb-serpent-sse2) instead, which is just as fast.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The LRW template now wraps an ECB mode algorithm rather than the block
cipher directly. Therefore it is now redundant for crypto modules to
wrap their ECB code with generic LRW code themselves via lrw_crypt().
Remove the lrw-serpent-sse2 algorithm which did this. Users who request
lrw(serpent) and previously would have gotten lrw-serpent-sse2 will now
get lrw(ecb-serpent-sse2) instead, which is just as fast.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add ECB, CBC, and CTR functions to glue_helper which use skcipher_walk
rather than blkcipher_walk. This will allow converting the remaining
x86 algorithms from the blkcipher interface over to the skcipher
interface, after which we'll be able to remove the blkcipher_walk
versions of these functions.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Add gcmaes_crypt_by_sg routine, that will do scatter/gather
by sg. Either src or dst may contain multiple buffers, so
iterate over both at the same time if they are different.
If the input is the same as the output, iterate only over one.
Currently both the AAD and TAG must be linear, so copy them out
with scatterlist_map_and_copy. If first buffer contains the
entire AAD, we can optimize and not copy. Since the AAD
can be any size, if copied it must be on the heap. TAG can
be on the stack since it is always < 16 bytes.
Only the SSE routines are updated so far, so leave the previous
gcmaes_en/decrypt routines, and branch to the sg ones if the
keysize is inappropriate for avx, or we are SSE only.
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
The asm macros are all set up now, introduce entry points.
GCM_INIT and GCM_COMPLETE have arguments supplied, so that
the new scatter/gather entry points don't have to take all the
arguments, and only the ones they need.
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
We can fast-path any < 16 byte read if the full message is > 16 bytes,
and shift over by the appropriate amount. Usually we are
reading > 16 bytes, so this should be faster than the READ_PARTIAL
macro introduced in b20209c91e for the average case.
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Before this diff, multiple calls to GCM_ENC_DEC will
succeed, but only if all calls are a multiple of 16 bytes.
Handle partial blocks at the start of GCM_ENC_DEC, and update
aadhash as appropriate.
The data offset %r11 is also updated after the partial block.
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
HashKey computation only needs to happen once per scatter/gather operation,
save it between calls in gcm_context struct instead of on the stack.
Since the asm no longer stores anything on the stack, we can use
%rsp directly, and clean up the frame save/restore macros a bit.
Hashkeys actually only need to be calculated once per key and could
be moved to when set_key is called, however, the current glue code
falls back to generic aes code if fpu is disabled.
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>