Better tabselect

Wed Apr 10 18:29:53 CEST 2013

nisse at lysator.liu.se (Niels Möller) writes:

    mp_limb_t rl;

    for (rl = 0, i = 0; i < nelems; i++)
       rl += table[...] & -(mp_limb_t) (i == k);
    rp[...] = rl;

  Reduces the number of stores from O(n^2) to O(n), and instead increases
  the mask creation from O(n) to O(n^2). Loads are the same, O(n^2), in
  either case.

Yes.  Or more accurately, the number of leads are halved, and the number
of stores go from n*m to n, if n is the vector size and m is the number
of vectors.

  At least in assembly, it would make sense to unroll it a couple of
  times (depending on how many r limbs one can fit in registers), to
  reduce the number of times we have to construct the mask.

Unolling is crucial.  I am writing a few 4-way unrolled variants now.

The multitude and pattern of mask computations make side channel leakage
worse if the mask computation is made stupidly.  I don't trust compilers
here, since they might use a conditional move or other leaky method.

This new algorithm would make such leakage very pronounced.

-- 
Torbjörn