Unrolling the LOAD and BLEND loops improves performance by ~8% on x86_64
(tested on Broadwell Xeon) while not increasing code size too much.
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>