Rewrite the AES-NI implementations of AES-GCM, taking advantage of
things I learned while writing the VAES-AVX10 implementations. This is
a complete rewrite that reduces the AES-NI GCM source code size by about
70% and the binary code size by about 95%, while not regressing
performance and in fact improving it significantly in many cases.
The following summarizes the state before this patch:
- The aesni-intel module registered algorithms "generic-gcm-aesni" and
"rfc4106-gcm-aesni" with the crypto API that actually delegated to one
of three underlying implementations according to the CPU capabilities
detected at runtime: AES-NI, AES-NI + AVX, or AES-NI + AVX2.
- The AES-NI + AVX and AES-NI + AVX2 assembly code was in
aesni-intel_avx-x86_64.S and consisted of 2804 lines of source and
257 KB of binary. This massive binary size was not really
appropriate, and depending on the kconfig it could take up over 1% the
size of the entire vmlinux. The main loops did 8 blocks per
iteration. The AVX code minimized the use of carryless multiplication
whereas the AVX2 code did not. The "AVX2" code did not actually use
AVX2; the check for AVX2 was really a check for Intel Haswell or later
to detect support for fast carryless multiplication. The long source
length was caused by factors such as significant code duplication.
- The AES-NI only assembly code was in aesni-intel_asm.S and consisted
of 1501 lines of source and 15 KB of binary. The main loops did 4
blocks per iteration and minimized the use of carryless multiplication
by using Karatsuba multiplication and a multiplication-less reduction.
- The assembly code was contributed in 2010-2013. Maintenance has been
sporadic and most design choices haven't been revisited.
- The assembly function prototypes and the corresponding glue code were
separate from and were not consistent with the new VAES-AVX10 code I
recently added. The older code had several issues such as not
precomputing the GHASH key powers, which hurt performance.
This rewrite achieves the following goals:
- Much shorter source and binary sizes. The assembly source shrinks
from 4300 lines to 1130 lines, and it produces about 9 KB of binary
instead of 272 KB. This is achieved via a better designed AES-GCM
implementation that doesn't excessively unroll the code and instead
prioritizes the parts that really matter. Sharing the C glue code
with the VAES-AVX10 implementations also saves 250 lines of C source.
- Improve performance on most (possibly all) CPUs on which this code
runs, for most (possibly all) message lengths. Benchmark results are
given in Tables 1 and 2 below.
- Use the same function prototypes and glue code as the new VAES-AVX10
algorithms. This fixes some issues with the integration of the
assembly and results in some significant performance improvements,
primarily on short messages. Also, the AVX and non-AVX
implementations are now registered as separate algorithms with the
crypto API, which makes them both testable by the self-tests.
- Keep support for AES-NI without AVX (for Westmere, Silvermont,
Goldmont, and Tremont), but unify the source code with AES-NI + AVX.
Since 256-bit vectors cannot be used without VAES anyway, this is made
feasible by just using the non-VEX coded form of most instructions.
- Use a unified approach where the main loop does 8 blocks per iteration
and uses Karatsuba multiplication to save one pclmulqdq per block but
does not use the multiplication-less reduction. This strikes a good
balance across the range of CPUs on which this code runs.
- Don't spam the kernel log with an informational message on every boot.
The following tables summarize the improvement in AES-GCM throughput on
various CPU microarchitectures as a result of this patch:
Table 1: AES-256-GCM encryption throughput improvement,
CPU microarchitecture vs. message length in bytes:
| 16384 | 4096 | 4095 | 1420 | 512 | 500 |
-------------------+-------+-------+-------+-------+-------+-------+
Intel Broadwell | 2% | 8% | 11% | 18% | 31% | 26% |
Intel Skylake | 1% | 4% | 7% | 12% | 26% | 19% |
Intel Cascade Lake | 3% | 8% | 10% | 18% | 33% | 24% |
AMD Zen 1 | 6% | 12% | 6% | 15% | 27% | 24% |
AMD Zen 2 | 8% | 13% | 13% | 19% | 26% | 28% |
AMD Zen 3 | 8% | 14% | 13% | 19% | 26% | 25% |
| 300 | 200 | 64 | 63 | 16 |
-------------------+-------+-------+-------+-------+-------+
Intel Broadwell | 35% | 29% | 45% | 55% | 54% |
Intel Skylake | 25% | 19% | 28% | 33% | 27% |
Intel Cascade Lake | 36% | 28% | 39% | 49% | 54% |
AMD Zen 1 | 27% | 22% | 23% | 29% | 26% |
AMD Zen 2 | 32% | 24% | 22% | 25% | 31% |
AMD Zen 3 | 30% | 24% | 22% | 23% | 26% |
Table 2: AES-256-GCM decryption throughput improvement,
CPU microarchitecture vs. message length in bytes:
| 16384 | 4096 | 4095 | 1420 | 512 | 500 |
-------------------+-------+-------+-------+-------+-------+-------+
Intel Broadwell | 3% | 8% | 11% | 19% | 32% | 28% |
Intel Skylake | 3% | 4% | 7% | 13% | 28% | 27% |
Intel Cascade Lake | 3% | 9% | 11% | 19% | 33% | 28% |
AMD Zen 1 | 15% | 18% | 14% | 20% | 36% | 33% |
AMD Zen 2 | 9% | 16% | 13% | 21% | 26% | 27% |
AMD Zen 3 | 8% | 15% | 12% | 18% | 23% | 23% |
| 300 | 200 | 64 | 63 | 16 |
-------------------+-------+-------+-------+-------+-------+
Intel Broadwell | 36% | 31% | 40% | 51% | 53% |
Intel Skylake | 28% | 21% | 23% | 30% | 30% |
Intel Cascade Lake | 36% | 29% | 36% | 47% | 53% |
AMD Zen 1 | 35% | 31% | 32% | 35% | 36% |
AMD Zen 2 | 31% | 30% | 27% | 38% | 30% |
AMD Zen 3 | 27% | 23% | 24% | 32% | 26% |
The above numbers are percentage improvements in single-thread
throughput, so e.g. an increase from 3000 MB/s to 3300 MB/s would be
listed as 10%. They were collected by directly measuring the Linux
crypto API performance using a custom kernel module. Note that indirect
benchmarks (e.g. 'cryptsetup benchmark' or benchmarking dm-crypt I/O)
include more overhead and won't see quite as much of a difference. All
these benchmarks used an associated data length of 16 bytes. Note that
AES-GCM is almost always used with short associated data lengths.
I didn't test Intel CPUs before Broadwell, AMD CPUs before Zen 1, or
Intel low-power CPUs, as these weren't readily available to me.
However, based on the design of the new code and the available
information about these other CPU microarchitectures, I wouldn't expect
any significant regressions, and there's a good chance performance is
improved just as it is above.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
124 lines
5.3 KiB
Makefile
124 lines
5.3 KiB
Makefile
# SPDX-License-Identifier: GPL-2.0
|
|
#
|
|
# x86 crypto algorithms
|
|
|
|
obj-$(CONFIG_CRYPTO_TWOFISH_586) += twofish-i586.o
|
|
twofish-i586-y := twofish-i586-asm_32.o twofish_glue.o
|
|
obj-$(CONFIG_CRYPTO_TWOFISH_X86_64) += twofish-x86_64.o
|
|
twofish-x86_64-y := twofish-x86_64-asm_64.o twofish_glue.o
|
|
obj-$(CONFIG_CRYPTO_TWOFISH_X86_64_3WAY) += twofish-x86_64-3way.o
|
|
twofish-x86_64-3way-y := twofish-x86_64-asm_64-3way.o twofish_glue_3way.o
|
|
obj-$(CONFIG_CRYPTO_TWOFISH_AVX_X86_64) += twofish-avx-x86_64.o
|
|
twofish-avx-x86_64-y := twofish-avx-x86_64-asm_64.o twofish_avx_glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_SERPENT_SSE2_586) += serpent-sse2-i586.o
|
|
serpent-sse2-i586-y := serpent-sse2-i586-asm_32.o serpent_sse2_glue.o
|
|
obj-$(CONFIG_CRYPTO_SERPENT_SSE2_X86_64) += serpent-sse2-x86_64.o
|
|
serpent-sse2-x86_64-y := serpent-sse2-x86_64-asm_64.o serpent_sse2_glue.o
|
|
obj-$(CONFIG_CRYPTO_SERPENT_AVX_X86_64) += serpent-avx-x86_64.o
|
|
serpent-avx-x86_64-y := serpent-avx-x86_64-asm_64.o serpent_avx_glue.o
|
|
obj-$(CONFIG_CRYPTO_SERPENT_AVX2_X86_64) += serpent-avx2.o
|
|
serpent-avx2-y := serpent-avx2-asm_64.o serpent_avx2_glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_DES3_EDE_X86_64) += des3_ede-x86_64.o
|
|
des3_ede-x86_64-y := des3_ede-asm_64.o des3_ede_glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_CAMELLIA_X86_64) += camellia-x86_64.o
|
|
camellia-x86_64-y := camellia-x86_64-asm_64.o camellia_glue.o
|
|
obj-$(CONFIG_CRYPTO_CAMELLIA_AESNI_AVX_X86_64) += camellia-aesni-avx-x86_64.o
|
|
camellia-aesni-avx-x86_64-y := camellia-aesni-avx-asm_64.o camellia_aesni_avx_glue.o
|
|
obj-$(CONFIG_CRYPTO_CAMELLIA_AESNI_AVX2_X86_64) += camellia-aesni-avx2.o
|
|
camellia-aesni-avx2-y := camellia-aesni-avx2-asm_64.o camellia_aesni_avx2_glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_BLOWFISH_X86_64) += blowfish-x86_64.o
|
|
blowfish-x86_64-y := blowfish-x86_64-asm_64.o blowfish_glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_CAST5_AVX_X86_64) += cast5-avx-x86_64.o
|
|
cast5-avx-x86_64-y := cast5-avx-x86_64-asm_64.o cast5_avx_glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_CAST6_AVX_X86_64) += cast6-avx-x86_64.o
|
|
cast6-avx-x86_64-y := cast6-avx-x86_64-asm_64.o cast6_avx_glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_AEGIS128_AESNI_SSE2) += aegis128-aesni.o
|
|
aegis128-aesni-y := aegis128-aesni-asm.o aegis128-aesni-glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_CHACHA20_X86_64) += chacha-x86_64.o
|
|
chacha-x86_64-y := chacha-avx2-x86_64.o chacha-ssse3-x86_64.o chacha_glue.o
|
|
chacha-x86_64-$(CONFIG_AS_AVX512) += chacha-avx512vl-x86_64.o
|
|
|
|
obj-$(CONFIG_CRYPTO_AES_NI_INTEL) += aesni-intel.o
|
|
aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o
|
|
aesni-intel-$(CONFIG_64BIT) += aes_ctrby8_avx-x86_64.o \
|
|
aes-gcm-aesni-x86_64.o \
|
|
aes-xts-avx-x86_64.o
|
|
ifeq ($(CONFIG_AS_VAES)$(CONFIG_AS_VPCLMULQDQ),yy)
|
|
aesni-intel-$(CONFIG_64BIT) += aes-gcm-avx10-x86_64.o
|
|
endif
|
|
|
|
obj-$(CONFIG_CRYPTO_SHA1_SSSE3) += sha1-ssse3.o
|
|
sha1-ssse3-y := sha1_avx2_x86_64_asm.o sha1_ssse3_asm.o sha1_ssse3_glue.o
|
|
sha1-ssse3-$(CONFIG_AS_SHA1_NI) += sha1_ni_asm.o
|
|
|
|
obj-$(CONFIG_CRYPTO_SHA256_SSSE3) += sha256-ssse3.o
|
|
sha256-ssse3-y := sha256-ssse3-asm.o sha256-avx-asm.o sha256-avx2-asm.o sha256_ssse3_glue.o
|
|
sha256-ssse3-$(CONFIG_AS_SHA256_NI) += sha256_ni_asm.o
|
|
|
|
obj-$(CONFIG_CRYPTO_SHA512_SSSE3) += sha512-ssse3.o
|
|
sha512-ssse3-y := sha512-ssse3-asm.o sha512-avx-asm.o sha512-avx2-asm.o sha512_ssse3_glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_BLAKE2S_X86) += libblake2s-x86_64.o
|
|
libblake2s-x86_64-y := blake2s-core.o blake2s-glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL) += ghash-clmulni-intel.o
|
|
ghash-clmulni-intel-y := ghash-clmulni-intel_asm.o ghash-clmulni-intel_glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_POLYVAL_CLMUL_NI) += polyval-clmulni.o
|
|
polyval-clmulni-y := polyval-clmulni_asm.o polyval-clmulni_glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_CRC32C_INTEL) += crc32c-intel.o
|
|
crc32c-intel-y := crc32c-intel_glue.o
|
|
crc32c-intel-$(CONFIG_64BIT) += crc32c-pcl-intel-asm_64.o
|
|
|
|
obj-$(CONFIG_CRYPTO_CRC32_PCLMUL) += crc32-pclmul.o
|
|
crc32-pclmul-y := crc32-pclmul_asm.o crc32-pclmul_glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_CRCT10DIF_PCLMUL) += crct10dif-pclmul.o
|
|
crct10dif-pclmul-y := crct10dif-pcl-asm_64.o crct10dif-pclmul_glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_POLY1305_X86_64) += poly1305-x86_64.o
|
|
poly1305-x86_64-y := poly1305-x86_64-cryptogams.o poly1305_glue.o
|
|
targets += poly1305-x86_64-cryptogams.S
|
|
|
|
obj-$(CONFIG_CRYPTO_NHPOLY1305_SSE2) += nhpoly1305-sse2.o
|
|
nhpoly1305-sse2-y := nh-sse2-x86_64.o nhpoly1305-sse2-glue.o
|
|
obj-$(CONFIG_CRYPTO_NHPOLY1305_AVX2) += nhpoly1305-avx2.o
|
|
nhpoly1305-avx2-y := nh-avx2-x86_64.o nhpoly1305-avx2-glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_CURVE25519_X86) += curve25519-x86_64.o
|
|
|
|
obj-$(CONFIG_CRYPTO_SM3_AVX_X86_64) += sm3-avx-x86_64.o
|
|
sm3-avx-x86_64-y := sm3-avx-asm_64.o sm3_avx_glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_SM4_AESNI_AVX_X86_64) += sm4-aesni-avx-x86_64.o
|
|
sm4-aesni-avx-x86_64-y := sm4-aesni-avx-asm_64.o sm4_aesni_avx_glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_SM4_AESNI_AVX2_X86_64) += sm4-aesni-avx2-x86_64.o
|
|
sm4-aesni-avx2-x86_64-y := sm4-aesni-avx2-asm_64.o sm4_aesni_avx2_glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_ARIA_AESNI_AVX_X86_64) += aria-aesni-avx-x86_64.o
|
|
aria-aesni-avx-x86_64-y := aria-aesni-avx-asm_64.o aria_aesni_avx_glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_ARIA_AESNI_AVX2_X86_64) += aria-aesni-avx2-x86_64.o
|
|
aria-aesni-avx2-x86_64-y := aria-aesni-avx2-asm_64.o aria_aesni_avx2_glue.o
|
|
|
|
obj-$(CONFIG_CRYPTO_ARIA_GFNI_AVX512_X86_64) += aria-gfni-avx512-x86_64.o
|
|
aria-gfni-avx512-x86_64-y := aria-gfni-avx512-asm_64.o aria_gfni_avx512_glue.o
|
|
|
|
quiet_cmd_perlasm = PERLASM $@
|
|
cmd_perlasm = $(PERL) $< > $@
|
|
$(obj)/%.S: $(src)/%.pl FORCE
|
|
$(call if_changed,perlasm)
|
|
|
|
# Disable GCOV in odd or sensitive code
|
|
GCOV_PROFILE_curve25519-x86_64.o := n
|