Shared toom evaluation functions

Torbjorn Granlund tg at gmplib.org
Sun Nov 15 22:14:32 CET 2009

bodrato at mail.dm.unipi.it writes:

Ciao,

> Several have mpn_addlsh1_n, and they run up to 2x faster than separate
> lshift and add_n.  (Same goes for sub.)  No machine or almost no has
> mpn_addlsh_n, since it has proven tricky to make fast.

and also a speed-up of 1.5x can be better than nothing...

A addlsh2_n would be slightly easier to write than a general addlsh_n,
since the variable-size shifts need the count in the rcx register, which
means we need some copying.

We can actually use lea for computing a+4b.  This is a possible
inner loop:

L(top):
mov     (vp,i,8), %r8
lea     (%r11,%r8,4), %r12
shl     $62, %r8 mov 8(vp,i,8), %r9 lea (%r8,%r9,4), %r13 shl$62, %r9

mov     16(vp,i,8), %r10
lea     (%r9,%r10,4), %r14
shl     $62, %r10 mov 24(vp,i,8), %r11 lea (%r10,%r11,4), %r15 shl$62, %r11

add     %rax, %rax              C restore carry

sbb     %rax, %rax              C save carry for next

mov     %r12, (rp,i,8)
mov     %r13, 8(rp,i,8)
mov     %r14, 16(rp,i,8)
mov     %r15, 24(rp,i,8)

js      L(top)

With the unrolling above, this should approach 2 c/l.

> We should use mpn_addlsh1_n in more places I think, even for the s,t
> related computations, such as pm2 in toom72.  That will be a bit tricky,

All evaluations in \pm2 and \pm1/2 for operands split in two should better
use the simple trick:
b0 + 2*b1 = (b0 + b1) + b1
b0 - 2*b1 = (b0 - b1) - b1
2*b0 + b1 = (b0 + b1) + b0
2*b0 - b1 = (b0 - b1) + b0
where one needs the evaluations in \pm1, but avoid any shift.

I think we do that already, at least on some of the toom files.

And if we want to continue adding assembly primitives, we could of
course do the above much faster using one single loop, doing two reads
and two writes per limb, for both \pm1 and \pm2.

--
Torbjörn