Shared toom evaluation functions

Sun Nov 15 22:14:32 CET 2009

bodrato at mail.dm.unipi.it writes:

  Ciao,

  > Several have mpn_addlsh1_n, and they run up to 2x faster than separate
  > lshift and add_n.  (Same goes for sub.)  No machine or almost no has
  > mpn_addlsh_n, since it has proven tricky to make fast.

  In Toom evaluations addlsh2_n would be much more useful than addlsh1_n,
  and also a speed-up of 1.5x can be better than nothing...

A addlsh2_n would be slightly easier to write than a general addlsh_n,
since the variable-size shifts need the count in the rcx register, which
means we need some copying.

We can actually use lea for computing a+4b.  This is a possible
inner loop:

L(top):
        mov     (vp,i,8), %r8
        lea     (%r11,%r8,4), %r12
        shl     $62, %r8

        mov     8(vp,i,8), %r9
        lea     (%r8,%r9,4), %r13
        shl     $62, %r9

        mov     16(vp,i,8), %r10
        lea     (%r9,%r10,4), %r14
        shl     $62, %r10

        mov     24(vp,i,8), %r11
        lea     (%r10,%r11,4), %r15
        shl     $62, %r11

        add     %rax, %rax              C restore carry

        adc     (up,i,8), %r12
        adc     8(up,i,8), %r13
        adc     16(up,i,8), %r14
        adc     24(up,i,8), %r15

        sbb     %rax, %rax              C save carry for next

        mov     %r12, (rp,i,8)
        mov     %r13, 8(rp,i,8)
        mov     %r14, 16(rp,i,8)
        mov     %r15, 24(rp,i,8)

        add     $4, i
        js      L(top)

With the unrolling above, this should approach 2 c/l.

  > We should use mpn_addlsh1_n in more places I think, even for the s,t
  > related computations, such as pm2 in toom72.  That will be a bit tricky,

  All evaluations in \pm2 and \pm1/2 for operands split in two should better
  use the simple trick:
  b0 + 2*b1 = (b0 + b1) + b1
  b0 - 2*b1 = (b0 - b1) - b1
  2*b0 + b1 = (b0 + b1) + b0
  2*b0 - b1 = (b0 - b1) + b0
  where one needs the evaluations in \pm1, but avoid any shift.

I think we do that already, at least on some of the toom files.

And if we want to continue adding assembly primitives, we could of
course do the above much faster using one single loop, doing two reads
and two writes per limb, for both \pm1 and \pm2.

-- 
Torbjörn