Shared toom evaluation functions
tg at gmplib.org
Sun Nov 15 22:14:32 CET 2009
bodrato at mail.dm.unipi.it writes:
> Several have mpn_addlsh1_n, and they run up to 2x faster than separate
> lshift and add_n. (Same goes for sub.) No machine or almost no has
> mpn_addlsh_n, since it has proven tricky to make fast.
In Toom evaluations addlsh2_n would be much more useful than addlsh1_n,
and also a speed-up of 1.5x can be better than nothing...
A addlsh2_n would be slightly easier to write than a general addlsh_n,
since the variable-size shifts need the count in the rcx register, which
means we need some copying.
We can actually use lea for computing a+4b. This is a possible
mov (vp,i,8), %r8
lea (%r11,%r8,4), %r12
shl $62, %r8
mov 8(vp,i,8), %r9
lea (%r8,%r9,4), %r13
shl $62, %r9
mov 16(vp,i,8), %r10
lea (%r9,%r10,4), %r14
shl $62, %r10
mov 24(vp,i,8), %r11
lea (%r10,%r11,4), %r15
shl $62, %r11
add %rax, %rax C restore carry
adc (up,i,8), %r12
adc 8(up,i,8), %r13
adc 16(up,i,8), %r14
adc 24(up,i,8), %r15
sbb %rax, %rax C save carry for next
mov %r12, (rp,i,8)
mov %r13, 8(rp,i,8)
mov %r14, 16(rp,i,8)
mov %r15, 24(rp,i,8)
add $4, i
With the unrolling above, this should approach 2 c/l.
> We should use mpn_addlsh1_n in more places I think, even for the s,t
> related computations, such as pm2 in toom72. That will be a bit tricky,
All evaluations in \pm2 and \pm1/2 for operands split in two should better
use the simple trick:
b0 + 2*b1 = (b0 + b1) + b1
b0 - 2*b1 = (b0 - b1) - b1
2*b0 + b1 = (b0 + b1) + b0
2*b0 - b1 = (b0 - b1) + b0
where one needs the evaluations in \pm1, but avoid any shift.
I think we do that already, at least on some of the toom files.
And if we want to continue adding assembly primitives, we could of
course do the above much faster using one single loop, doing two reads
and two writes per limb, for both \pm1 and \pm2.
More information about the gmp-devel