tg at gmplib.org
Wed Apr 10 18:29:53 CEST 2013
nisse at lysator.liu.se (Niels Möller) writes:
for (rl = 0, i = 0; i < nelems; i++)
rl += table[...] & -(mp_limb_t) (i == k);
rp[...] = rl;
Reduces the number of stores from O(n^2) to O(n), and instead increases
the mask creation from O(n) to O(n^2). Loads are the same, O(n^2), in
Yes. Or more accurately, the number of leads are halved, and the number
of stores go from n*m to n, if n is the vector size and m is the number
At least in assembly, it would make sense to unroll it a couple of
times (depending on how many r limbs one can fit in registers), to
reduce the number of times we have to construct the mask.
Unolling is crucial. I am writing a few 4-way unrolled variants now.
The multitude and pattern of mask computations make side channel leakage
worse if the mask computation is made stupidly. I don't trust compilers
here, since they might use a conditional move or other leaky method.
This new algorithm would make such leakage very pronounced.
More information about the gmp-devel