Commit Graph

209 Commits

Author SHA1 Message Date
Jussi Kivilinna
6574e6c64e crypto: des_3des - add x86-64 assembly implementation
Patch adds x86_64 assembly implementation of Triple DES EDE cipher algorithm.
Two assembly implementations are provided. First is regular 'one-block at
time' encrypt/decrypt function. Second is 'three-blocks at time' function that
gains performance increase on out-of-order CPUs.

tcrypt test results:

Intel Core i5-4570:

des3_ede-asm vs des3_ede-generic:
size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec
16B     1.21x   1.22x   1.27x   1.36x   1.25x   1.25x
64B     1.98x   1.96x   1.23x   2.04x   2.01x   2.00x
256B    2.34x   2.37x   1.21x   2.40x   2.38x   2.39x
1024B   2.50x   2.47x   1.22x   2.51x   2.52x   2.51x
8192B   2.51x   2.53x   1.21x   2.56x   2.54x   2.55x

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-06-20 21:27:58 +08:00
George Spelvin
473946e674 crypto: crc32c-pclmul - Shrink K_table to 32-bit words
There's no need for the K_table to be made of 64-bit words.  For some
reason, the original authors didn't fully reduce the values modulo the
CRC32C polynomial, and so had some 33-bit values in there.  They can
all be reduced to 32 bits.

Doing that cuts the table size in half.  Since the code depends on both
pclmulq and crc32, SSE 4.1 is obviously present, so we can use pmovzxdq
to fetch it in the correct format.

This adds (measured on Ivy Bridge) 1 cycle per main loop iteration
(CRC of up to 3K bytes), less than 0.2%.  The hope is that the reduced
D-cache footprint will make up the loss in other code.

Two other related fixes:
* K_table is read-only, so belongs in .rodata, and
* There's no need for more than 8-byte alignment

Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: George Spelvin <linux@horizon.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-06-20 21:27:57 +08:00
Herbert Xu
0ea481466d crypto: ghash-clmulni-intel - Use u128 instead of be128 for internal key
The internal key isn't actually in big-endian format so let's switch
to u128 which also happens to allow us to remove a sparse warning.

Based on suggestion by Ard Biesheuvel.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
2014-04-04 21:06:14 +08:00
Ard Biesheuvel
8ceee72808 crypto: ghash-clmulni-intel - use C implementation for setkey()
The GHASH setkey() function uses SSE registers but fails to call
kernel_fpu_begin()/kernel_fpu_end(). Instead of adding these calls, and
then having to deal with the restriction that they cannot be called from
interrupt context, move the setkey() implementation to the C domain.

Note that setkey() does not use any particular SSE features and is not
expected to become a performance bottleneck.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Fixes: 0e1227d356 (crypto: ghash - Add PCLMULQDQ accelerated implementation)
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-04-01 17:22:47 +08:00
Mathias Krause
37b2894717 crypto: x86/sha1 - reduce size of the AVX2 asm implementation
There is really no need to page align sha1_transform_avx2. The default
alignment is just fine. This is not the hot code but only the entry
point, after all.

Cc: Chandramouli Narayanan <mouli@linux.intel.com>
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: H. Peter Anvin <hpa@linux.intel.com>
Reviewed-by: Marek Vasut <marex@denx.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-03-25 20:25:43 +08:00
Mathias Krause
6c8c17cc7a crypto: x86/sha1 - fix stack alignment of AVX2 variant
The AVX2 implementation might waste up to a page of stack memory because
of a wrong alignment calculation. This will, in the worst case, increase
the stack usage of sha1_transform_avx2() alone to 5.4 kB -- way to big
for a kernel function. Even worse, it might also allocate *less* bytes
than needed if the stack pointer is already aligned bacause in that case
the 'sub %rbx, %rsp' is effectively moving the stack pointer upwards,
not downwards.

Fix those issues by changing and simplifying the alignment calculation
to use a 32 byte alignment, the alignment really needed.

Cc: Chandramouli Narayanan <mouli@linux.intel.com>
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: H. Peter Anvin <hpa@linux.intel.com>
Reviewed-by: Marek Vasut <marex@denx.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-03-25 20:25:43 +08:00
Mathias Krause
6ca5afb8c2 crypto: x86/sha1 - re-enable the AVX variant
Commit 7c1da8d0d0 "crypto: sha - SHA1 transform x86_64 AVX2"
accidentally disabled the AVX variant by making the avx_usable() test
not only fail in case the CPU doesn't support AVX or OSXSAVE but also
if it doesn't support AVX2.

Fix that regression by splitting up the AVX/AVX2 test into two
functions. Also test for the BMI1 extension in the avx2_usable() test
as the AVX2 implementation not only makes use of BMI2 but also BMI1
instructions.

Cc: Chandramouli Narayanan <mouli@linux.intel.com>
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: H. Peter Anvin <hpa@linux.intel.com>
Reviewed-by: Marek Vasut <marex@denx.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-03-25 20:25:42 +08:00
chandramouli narayanan
7c1da8d0d0 crypto: sha - SHA1 transform x86_64 AVX2
This git patch adds x86_64 AVX2 optimization of SHA1
transform to crypto support. The patch has been tested with 3.14.0-rc1
kernel.

On a Haswell desktop, with turbo disabled and all cpus running
at maximum frequency, tcrypt shows AVX2 performance improvement
from 3% for 256 bytes update to 16% for 1024 bytes update over
AVX implementation.

This patch adds sha1_avx2_transform(), the glue, build and
configuration changes needed for AVX2 optimization of
SHA1 transform to crypto support.

sha1-ssse3 is one module which adds the necessary optimization
support (SSSE3/AVX/AVX2) for the low-level SHA1 transform function.
With better optimization support, transform function is overridden
as the case may be. In the case of AVX2, due to performance reasons
across datablock sizes, the AVX or AVX2 transform function is used
at run-time as it suits best. The Makefile change therefore appends
the necessary objects to the linkage. Due to this, the patch merely
appends AVX2 transform to the existing build mix and Kconfig support
and leaves the configuration build support as is.

Signed-off-by: Chandramouli Narayanan <mouli@linux.intel.com>
Reviewed-by: Marek Vasut <marex@denx.de>
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-03-21 21:54:30 +08:00
Dan Carpenter
b3bd5869fd crypto: remove a duplicate checks in __cbc_decrypt()
We checked "nbytes < bsize" before so it can't happen here.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Acked-by: Johannes Götzfried <johannes.goetzfried@cs.fau.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-02-27 05:56:54 +08:00
Tim Chen
79ba451d66 crypto: aesni - fix build on x86 (32bit)
We rename aesni-intel_avx.S to aesni-intel_avx-x86_64.S to indicate
that it is only used by x86_64 architecture.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2014-01-15 11:36:34 +08:00
Andy Shevchenko
8610d7bf60 crypto: aesni - fix build on x86 (32bit)
It seems commit d764593a "crypto: aesni - AVX and AVX2 version of AESNI-GCM
encode and decode" breaks a build on x86_32 since it's designed only for
x86_64. This patch makes a compilation unit conditional to CONFIG_64BIT and
functions usage to CONFIG_X86_64.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-12-31 19:47:46 +08:00
Tim Chen
d764593af9 crypto: aesni - AVX and AVX2 version of AESNI-GCM encode and decode
We have added AVX and AVX2 routines that optimize AESNI-GCM encode/decode.
These routines are optimized for encrypt and decrypt of large buffers.
In tests we have seen up to 6% speedup for 1K, 11% speedup for 2K and
18% speedup for 8K buffer over the existing SSE version.  These routines
should provide even better speedup for future Intel x86_64 cpus.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-12-20 20:06:24 +08:00
Daniel Borkmann
fed286110f crypto: arch - use crypto_memneq instead of memcmp
Replace remaining occurences (just as we did in crypto/) under arch/*/crypto/
that make use of memcmp() for comparing keys or authentication tags for
usage with crypto_memneq(). It can simply be used as a drop-in replacement
for the normal memcmp().

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: James Yonan <james@openvpn.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-12-20 20:06:24 +08:00
Oliver Neukum
16c0c4e165 crypto: sha256_ssse3 - also test for BMI2
The AVX2 implementation also uses BMI2 instructions,
but doesn't test for their availability. The assumption
that AVX2 and BMI2 always go together is false. Some
Haswells have AVX2 but not BMI2.

Signed-off-by: Oliver Neukum <oneukum@suse.de>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-10-07 14:17:10 +08:00
Ard Biesheuvel
801201aa25 crypto: move x86 to the generic version of ablk_helper
Move all users of ablk_helper under x86/ to the generic version
and delete the x86 specific version.

Acked-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-09-24 06:02:24 +10:00
Jussi Kivilinna
58497204aa crypto: x86 - restore avx2_supported check
Commit 3d387ef08c (Revert "crypto: blowfish - add AVX2/x86_64 implementation
of blowfish cipher") reverted too much as it removed the 'assembler supports
AVX2' check and therefore disabled remaining AVX2 implementations of Camellia
and Serpent. Patch restores the check and enables these implementations.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-09-13 21:43:52 +10:00
Jussi Kivilinna
7d444909a2 crypto: sha256_ssse3 - use correct module alias for sha224
Commit a710f761f (crypto: sha256_ssse3 - add sha224 support) attempted to add
MODULE_ALIAS for SHA-224, but it ended up being "sha384", probably because
mix-up with previous commit 340991e30 (crypto: sha512_ssse3 - add sha384
support). Patch corrects module alias to "sha224".

Reported-by: Pierre-Mayeul Badaire <pierre-mayeul.badaire@m4x.org>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-09-13 21:43:52 +10:00
Herbert Xu
68411521cc Reinstate "crypto: crct10dif - Wrap crc_t10dif function all to use crypto transform framework"
This patch reinstates commits
	67822649d7
	39761214ee
	0b95a7f857
	31d939625a
	2d31e518a4

Now that module softdeps are in the kernel we can use that to resolve
the boot issue which cause the revert.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-09-07 12:56:26 +10:00
Herbert Xu
eeca9fad52 Merge git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux
Merge upstream tree in order to reinstate crct10dif.
2013-09-07 12:53:35 +10:00
Julia Lawall
2a128b4b74 crypto: camellia-x86-64 - replace commas by semicolons and adjust code alignment
Adjust alignment and replace commas by semicolons in automatically
generated code.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-08-21 21:08:32 +10:00
Andi Kleen
f22d08111a crypto: make tables used from assembler __visible
Tables used from assembler should be marked __visible to let
the compiler know.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-08-14 20:42:03 +10:00
Linus Torvalds
b48a97be8e Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:
 "This push fixes a memory corruption issue in caam, as well as
  reverting the new optimised crct10dif implementation as it breaks boot
  on initrd systems.

  Hopefully crct10dif will be reinstated once the supporting code is
  added so that it doesn't break boot"

* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  Revert "crypto: crct10dif - Wrap crc_t10dif function all to use crypto transform framework"
  crypto: caam - Fixed the memory out of bound overwrite issue
2013-07-24 11:05:18 -07:00
Herbert Xu
e70308ec0e Revert "crypto: crct10dif - Wrap crc_t10dif function all to use crypto transform framework"
This reverts commits
    67822649d7
    39761214ee
    0b95a7f857
    31d939625a
    2d31e518a4

Unfortunately this change broke boot on some systems that used an
initrd which does not include the newly created crct10dif modules.
As these modules are required by sd_mod under certain configurations
this is a serious problem.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-07-24 17:04:16 +10:00
Linus Torvalds
b2c311075d Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto update from Herbert Xu:
 - Do not idle omap device between crypto operations in one session.
 - Added sha224/sha384 shims for SSSE3.
 - More optimisations for camellia-aesni-avx2.
 - Removed defunct blowfish/twofish AVX2 implementations.
 - Added unaligned buffer self-tests.
 - Added PCLMULQDQ optimisation for CRCT10DIF.
 - Added support for Freescale's DCP co-processor
 - Misc fixes.

* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (44 commits)
  crypto: testmgr - test hash implementations with unaligned buffers
  crypto: testmgr - test AEADs with unaligned buffers
  crypto: testmgr - test skciphers with unaligned buffers
  crypto: testmgr - check that entries in alg_test_descs are in correct order
  Revert "crypto: twofish - add AVX2/x86_64 assembler implementation of twofish cipher"
  Revert "crypto: blowfish - add AVX2/x86_64 implementation of blowfish cipher"
  crypto: camellia-aesni-avx2 - tune assembly code for more performance
  hwrng: bcm2835 - fix MODULE_LICENSE tag
  hwrng: nomadik - use clk_prepare_enable()
  crypto: picoxcell - replace strict_strtoul() with kstrtoul()
  crypto: dcp - Staticize local symbols
  crypto: dcp - Use NULL instead of 0
  crypto: dcp - Use devm_* APIs
  crypto: dcp - Remove redundant platform_set_drvdata()
  hwrng: use platform_{get,set}_drvdata()
  crypto: omap-aes - Don't idle/start AES device between Encrypt operations
  crypto: crct10dif - Use PTR_RET
  crypto: ux500 - Cocci spatch "resource_size.spatch"
  crypto: sha256_ssse3 - add sha224 support
  crypto: sha512_ssse3 - add sha384 support
  ...
2013-07-05 12:12:33 -07:00
Linus Torvalds
92616ee654 Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fix from Herbert Xu:
 "This fixes an unaligned crash in XTS mode when using aseni_intel"

* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: aesni_intel - fix accessing of unaligned memory
2013-06-21 06:28:39 -10:00
Herbert Xu
02c0241b60 Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto
Merge crypto to resolve conflict in crypto/Kconfig.
2013-06-21 15:13:27 +08:00
Jussi Kivilinna
99f42f937a Revert "crypto: twofish - add AVX2/x86_64 assembler implementation of twofish cipher"
This reverts commit cf1521a1a5.

Instruction (vpgatherdd) that this implementation relied on turned out to be
slow performer on real hardware (i5-4570). The previous 8-way twofish/AVX
implementation is therefore faster and this implementation should be removed.

Converting this implementation to use the same method as in twofish/AVX for
table look-ups would give additional ~3% speed up vs twofish/AVX, but would
hardly be worth of the added code and binary size.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-06-21 14:44:29 +08:00
Jussi Kivilinna
3d387ef08c Revert "crypto: blowfish - add AVX2/x86_64 implementation of blowfish cipher"
This reverts commit 6048801070.

Instruction (vpgatherdd) that this implementation relied on turned out to be
slow performer on real hardware (i5-4570). The previous 4-way blowfish
implementation is therefore faster and this implementation should be removed.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-06-21 14:44:28 +08:00
Jussi Kivilinna
acfffdb803 crypto: camellia-aesni-avx2 - tune assembly code for more performance
Add implementation tuned for more performance on real hardware. Changes are
mostly around the part mixing 128-bit extract and insert instructions and
AES-NI instructions. Also 'vpbroadcastb' instructions have been change to
'vpshufb with zero mask'.

Tests on Intel Core i5-4570:

tcrypt ECB results, old-AVX2 vs new-AVX2:

size    128bit key      256bit key
        enc     dec     enc     dec
256     1.00x   1.00x   1.00x   1.00x
1k      1.08x   1.09x   1.05x   1.06x
8k      1.06x   1.06x   1.06x   1.06x

tcrypt ECB results, AVX vs new-AVX2:

size    128bit key      256bit key
        enc     dec     enc     dec
256     1.00x   1.00x   1.00x   1.00x
1k      1.51x   1.50x   1.52x   1.50x
8k      1.47x   1.48x   1.48x   1.48x

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-06-21 14:44:23 +08:00
Jussi Kivilinna
fe6510b5d6 crypto: aesni_intel - fix accessing of unaligned memory
The new XTS code for aesni_intel uses input buffers directly as memory operands
for pxor instructions, which causes crash if those buffers are not aligned to
16 bytes.

Patch changes XTS code to handle unaligned memory correctly, by loading memory
with movdqu instead.

Reported-by: Dave Jones <davej@redhat.com>
Tested-by: Dave Jones <davej@redhat.com>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-06-13 14:57:42 +08:00
Linus Torvalds
484b002e28 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Peter Anvin:

 - Three EFI-related fixes

 - Two early memory initialization fixes

 - build fix for older binutils

 - fix for an eager FPU performance regression -- currently we don't
   allow the use of the FPU at interrupt time *at all* in eager mode,
   which is clearly wrong.

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86: Allow FPU to be used at interrupt time even with eagerfpu
  x86, crc32-pclmul: Fix build with older binutils
  x86-64, init: Fix a possible wraparound bug in switchover in head_64.S
  x86, range: fix missing merge during add range
  x86, efi: initial the local variable of DataSize to zero
  efivar: fix oops in efivar_update_sysfs_entries() caused by memory reuse
  efivarfs: Never return ENOENT from firmware again
2013-05-31 09:44:10 +09:00
Jan Beulich
2baad6121e x86, crc32-pclmul: Fix build with older binutils
binutils prior to 2.18 (e.g. the ones found on SLE10) don't support
assembling PEXTRD, so a macro based approach like the one for PCLMULQDQ
in the same file should be used.

This requires making the helper macros capable of recognizing 32-bit
general purpose register operands.

[ hpa: tagging for stable as it is a low risk build fix ]

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Link: http://lkml.kernel.org/r/51A6142A02000078000D99D8@nat28.tlf.novell.com
Cc: Alexander Boyko <alexander_boyko@xyratex.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Huang Ying <ying.huang@intel.com>
Cc: <stable@vger.kernel.org> v3.9
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-05-30 16:36:23 -07:00
Jussi Kivilinna
a710f761fc crypto: sha256_ssse3 - add sha224 support
Add sha224 implementation to sha256_ssse3 module.

This also fixes sha256_ssse3 module autoloading issue when 'sha224' is used
before 'sha256'. Previously in such case, just sha256_generic was loaded and
not sha256_ssse3 (since it did not provide sha224). Now if 'sha256' was used
after 'sha224' usage, sha256_ssse3 would remain unloaded.

Cc: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-05-28 15:43:05 +08:00
Jussi Kivilinna
340991e30c crypto: sha512_ssse3 - add sha384 support
Add sha384 implementation to sha512_ssse3 module.

This also fixes sha512_ssse3 module autoloading issue when 'sha384' is used
before 'sha512'. Previously in such case, just sha512_generic was loaded and
not sha512_ssse3 (since it did not provide sha384). Now if 'sha512' was used
after 'sha384' usage, sha512_ssse3 would remain unloaded. For example, this
happens with tcrypt testing module since it tests 'sha384' before 'sha512'.

Cc: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-05-28 15:43:05 +08:00
Jussi Kivilinna
de614e561b crypto: sha256_ssse3 - fix stack corruption with SSSE3 and AVX implementations
The _XFER stack element size was set too small, 8 bytes, when it needs to be
16 bytes. As _XFER is the last stack element used by these implementations,
the 16 byte stores with 'movdqa' corrupt the stack where the value of register
%r12 is temporarily stored. As these implementations align the stack pointer
to 16 bytes, this corruption did not happen every time.

Patch corrects this issue.

Reported-by: Julian Wollrath <jwollrath@web.de>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Tested-by: Julian Wollrath <jwollrath@web.de>
Acked-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-05-28 13:46:47 +08:00
Tim Chen
0b95a7f857 crypto: crct10dif - Glue code to cast accelerated CRCT10DIF assembly as a crypto transform
Glue code that plugs the PCLMULQDQ accelerated CRC T10 DIF hash into the
crypto framework.  The config CRYPTO_CRCT10DIF_PCLMUL should be turned
on to enable the feature.  The crc_t10dif crypto library function will
use this faster algorithm when crct10dif_pclmul module is loaded.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-05-24 17:55:27 +08:00
Tim Chen
31d939625a crypto: crct10dif - Accelerated CRC T10 DIF computation with PCLMULQDQ instruction
This is the x86_64 CRC T10 DIF transform accelerated with the PCLMULQDQ
instructions.  Details discussing the implementation can be found in the
paper:

"Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction"
http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-05-20 20:11:06 +08:00
Jussi Kivilinna
f3f935a76a crypto: camellia - add AVX2/AES-NI/x86_64 assembler implementation of camellia cipher
Patch adds AVX2/AES-NI/x86-64 implementation of Camellia cipher, requiring
32 parallel blocks for input (512 bytes). Compared to AVX implementation, this
version is extended to use the 256-bit wide YMM registers. For AES-NI
instructions data is split to two 128-bit registers and merged afterwards.
Even with this additional handling, performance should be higher compared
to the AES-NI/AVX implementation.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-04-25 21:09:07 +08:00
Jussi Kivilinna
56d76c96a9 crypto: serpent - add AVX2/x86_64 assembler implementation of serpent cipher
Patch adds AVX2/x86-64 implementation of Serpent cipher, requiring 16 parallel
blocks for input (256 bytes). Implementation is based on the AVX implementation
and extends to use the 256-bit wide YMM registers. Since serpent does not use
table look-ups, this implementation should be close to two times faster than
the AVX implementation.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-04-25 21:09:07 +08:00
Jussi Kivilinna
cf1521a1a5 crypto: twofish - add AVX2/x86_64 assembler implementation of twofish cipher
Patch adds AVX2/x86-64 implementation of Twofish cipher, requiring 16 parallel
blocks for input (256 bytes). Table look-ups are performed using vpgatherdd
instruction directly from vector registers and thus should be faster than
earlier implementations. Implementation also uses 256-bit wide YMM registers,
which should give additional speed up compared to the AVX implementation.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-04-25 21:09:05 +08:00
Jussi Kivilinna
6048801070 crypto: blowfish - add AVX2/x86_64 implementation of blowfish cipher
Patch adds AVX2/x86-64 implementation of Blowfish cipher, requiring 32 parallel
blocks for input (256 bytes). Table look-ups are performed using vpgatherdd
instruction directly from vector registers and thus should be faster than
earlier implementations.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-04-25 21:09:04 +08:00
Jussi Kivilinna
c456a9cd1a crypto: aesni_intel - add more optimized XTS mode for x86-64
Add more optimized XTS code for aesni_intel in 64-bit mode, for smaller stack
usage and boost for speed.

tcrypt results, with Intel i5-2450M:
256-bit key
        enc     dec
16B     0.98x   0.99x
64B     0.64x   0.63x
256B    1.29x   1.32x
1024B   1.54x   1.58x
8192B   1.57x   1.60x

512-bit key
        enc     dec
16B     0.98x   0.99x
64B     0.60x   0.59x
256B    1.24x   1.25x
1024B   1.39x   1.42x
8192B   1.38x   1.42x

I chose not to optimize smaller than block size of 256 bytes, since XTS is
practically always used with data blocks of size 512 bytes. This is why
performance is reduced in tcrypt for 64 byte long blocks.

Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-04-25 21:01:53 +08:00
Jussi Kivilinna
b5c5b072dc crypto: x86/camellia-aesni-avx - add more optimized XTS code
Add more optimized XTS code for camellia-aesni-avx, for smaller stack usage
and small boost for speed.

tcrypt results, with Intel i5-2450M:
        enc     dec
16B     1.10x   1.01x
64B     0.82x   0.77x
256B    1.14x   1.10x
1024B   1.17x   1.16x
8192B   1.10x   1.11x

Since XTS is practically always used with data blocks of size 512 bytes or
more, I chose to not make use of camellia-2way for block sized smaller than
256 bytes. This causes slower result in tcrypt for 64 bytes.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-04-25 21:01:52 +08:00
Jussi Kivilinna
70177286e1 crypto: cast6-avx: use new optimized XTS code
Change cast6-avx to use the new XTS code, for smaller stack usage and small
boost to performance.

tcrypt results, with Intel i5-2450M:
        enc     dec
16B     1.01x   1.01x
64B     1.01x   1.00x
256B    1.09x   1.02x
1024B   1.08x   1.06x
8192B   1.08x   1.07x

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-04-25 21:01:52 +08:00
Jussi Kivilinna
18be45270a crypto: x86/twofish-avx - use optimized XTS code
Change twofish-avx to use the new XTS code, for smaller stack usage and small
boost to performance.

tcrypt results, with Intel i5-2450M:
        enc     dec
16B     1.03x   1.02x
64B     0.91x   0.91x
256B    1.10x   1.09x
1024B   1.12x   1.11x
8192B   1.12x   1.11x

Since XTS is practically always used with data blocks of size 512 bytes or
more, I chose to not make use of twofish-3way for block sized smaller than
128 bytes. This causes slower result in tcrypt for 64 bytes.

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-04-25 21:01:51 +08:00
Jussi Kivilinna
a05248ed2d crypto: x86 - add more optimized XTS-mode for serpent-avx
This patch adds AVX optimized XTS-mode helper functions/macros and converts
serpent-avx to use the new facilities. Benefits are slightly improved speed
and reduced stack usage as use of temporary IV-array is avoided.

tcrypt results, with Intel i5-2450M:
        enc     dec
16B     1.00x   1.00x
64B     1.00x   1.00x
256B    1.04x   1.06x
1024B   1.09x   1.09x
8192B   1.10x   1.09x

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-04-25 21:01:51 +08:00
Sandy Wu
57ae1b0532 crypto: crc32-pclmul - Use gas macro for pclmulqdq
Occurs when CONFIG_CRYPTO_CRC32C_INTEL=y and CONFIG_CRYPTO_CRC32C_INTEL=y.
Older versions of bintuils do not support the pclmulqdq instruction. The
PCLMULQDQ gas macro is used instead.

Signed-off-by: Sandy Wu <sandyw@twitter.com>
Cc: stable@vger.kernel.org # 3.8+
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-04-25 21:01:44 +08:00
Tim Chen
87de4579f9 crypto: sha512 - Create module providing optimized SHA512 routines using SSSE3, AVX or AVX2 instructions.
We added glue code and config options to create crypto
module that uses SSE/AVX/AVX2 optimized SHA512 x86_64 assembly routines.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-04-25 21:01:42 +08:00
Tim Chen
5663535b69 crypto: sha512 - Optimized SHA512 x86_64 assembly routine using AVX2 RORX instruction.
Provides SHA512 x86_64 assembly routine optimized with SSE, AVX and
AVX2's RORX instructions.  Speedup of 70% or more has been
measured over the generic implementation.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-04-25 21:00:58 +08:00
Tim Chen
e01d69cb01 crypto: sha512 - Optimized SHA512 x86_64 assembly routine using AVX instructions.
Provides SHA512 x86_64 assembly routine optimized with SSE and AVX instructions.
Speedup of 60% or more has been measured over the generic implementation.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2013-04-25 21:00:58 +08:00