[PATCH] Optimize 32-bit sparc T1 multiply routines.

David Miller davem at davemloft.net
Fri Jan 18 21:02:29 CET 2013


From: Torbjorn Granlund <tg at gmplib.org>
Date: Sun, 06 Jan 2013 13:48:42 +0100

> I recommend 4-way unrolling.
> 
> The summation method of mpn/powerpc64/mode64/aorsmul_1.asm might be
> best.

Thanks for all of these pointers and suggestions.

While waiting for the FSF to execute my assignment, I tweaked my
existing 2-way unrolled mul_1 and addmul_1 loops.  Currently on T4 I'm
at:

	mul_1		3.8 cycles/limb

L(top):
        ldx     [up+0], %g1
        sub     n, 2, n
        ldx     [up+8], %o4
        mulx    %g1, v0, %g3
        add     up, 16, up
        umulxhi %g1, v0, %g2
        mulx    %o4, v0, %g1
        add     rp, 16, rp
        addxccc %g3, %o5, %g3
        umulxhi %o4, v0, %o5
        stx     %g3, [rp-16]
        addxccc %g1, %g2, %g1
        brgz    n, L(top)
         stx    %g1, [rp-8]

	addmul_1	5.5 cycles/limb

L(top):
        ldx     [up+0], %l0
        ldx     [up+8], %l1
        ldx     [rp+0], %l2
        ldx     [rp+8], %l3
        mulx    %l0, v0, %o0
        add     up, 16, up
        umulxhi %l0, v0, %o1
        add     rp, 16, rp
        mulx    %l1, v0, %o2
        sub     n, 2, n
        umulxhi %l1, v0, %o3
        addxccc %o0, %o5, %o0
        addxccc %o2, %o1, %o2
        addxc   %g0, %o3, %o5
        addcc   %o0, %l2, %o0
        stx     %o0, [rp-16]
        addxccc %o2, %l3, %o2
        brgz    n, L(top)
         stx    %o2, [rp-8]

Just FYI...


More information about the gmp-devel mailing list