nisse at lysator.liu.se
Wed Apr 10 15:06:41 CEST 2013
Torbjorn Granlund <tg at gmplib.org> writes:
> It seems better two work in the opposite direction, i.e., to load and
> select first element 0 of each vector, and store the result, then do the
> same with all element 1, etc.
So for each limb position, something like
for (rl = 0, i = 0; i < nelems; i++)
rl += table[...] & -(mp_limb_t) (i == k);
rp[...] = rl;
Reduces the number of stores from O(n^2) to O(n), and instead increases
the mask creation from O(n) to O(n^2). Loads are the same, O(n^2), in
At least in assembly, it would make sense to unroll it a couple of times
(depending on how many r limbs one can fit in registers), to reduce the
number of times we have to construct the mask.
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
More information about the gmp-devel