addmul_k with toom (was: Re: GMP's x86-32 performance)

Sat Jun 17 22:38:19 UTC 2017

tg at gmplib.org (Torbjörn Granlund) writes:

> nisse at lysator.liu.se (Niels Möller) writes:

>   we might also try doing addmul_2 using toom32, which
>   would save 1/3 of the mul instructions. Toom32 is nice because we can
>   use the four easiest evaluation points: 0, infinity, and +/-1.
>   
> Give it a shot!  A concern with any toom stuff is side-channel leakage.
> We currently assume _basecase is non-leaky.  So any low-level trickery
> needs to be applied above the _basecase cutoff points (unless we split
> out _basecase_sec functions).

If we do this with toom32, we'd need a separate basecase_sec. I think
there's little chance to gain any speedup if we enforce side-channel
silence. Looks better for toom2, there I think it costs only a single
instruction in the innerloop to handle all signs in a side-channel
silent fashion. If we don't trust cmov to be side-channel silent, we'd
also need to do conditional negation using a mask and xor + add instead
of cmov, unclear what that means for performance.

>   Or addmul_3 using toom32, which has the additional advantage that more
>   of the evaluation work is loop-invariant, and we could also jump to
>   separate innerloops depending on the carry bits from evaluation.

I've looked a bit more at this, writing some pseudo C code for the
simplest case where both evaluated values v0 + v1 + v2 and v0 - v1 + v2
are even, non-negative, and fit in a single limb after division by two.

One iteration of addmul_3 would then need roughly 5 alu instructions for
evaluation, 4 multiplies, 8 instructions to combine products into odd
and even terms (u0 v0 + u1 v1 + u0 v2) and (u0 v1 + u1 v0 + u1 v2), and
another 14 for the largish final additions, including the three carry
words from the previous iteration.

If we divide by 6 (since this corresponds to the work of 6 iterations of
addmul_1), that's 2/3 multiply instructions and 4.5 alu instructions.

A sketch for addmul_2 using toom2 (and evaluating at the -1 point) gives
3 instructions for evaluation, 3 multiplies, 13 instructions for
interpolation and addition. Divide by 4, to get 3/4 multiply and 4 alu
instructions to do the same amount of work as a single addmul_1
iteration. 

To compare to x86_64/aorsmul_1.asm, which uses one multiply and 3 adc
instructions per iteration, and at best runs at 2.5 cycles per limb, and
x86_64/coreibwl/addmul_1.asm, which uses an iteration consisting of
mulx, adox, adcx, and runs at about 1.66 cycles per limb.

> Slow hardware multiplication helps.  :-)

At the list https://gmplib.org/devel/asm.html, we have the "Intel noc"
(what's that?) with current addmul_1 running at 15 cycles per limb, and
Intel Atom where it runs at 19 cycles / limb.

Writing up an addmul_2 for intel atom using toom2 is probably the most
promising step in this direction.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.