gcdext_1 questions

Torbjorn Granlund tg at gmplib.org
Thu Dec 3 23:21:15 CET 2009

nisse at lysator.liu.se (Niels Möller) writes:

    while (u != v)
        unsigned count;
        if (u > v)
            u -= v;
            count_trailing_zeros (count, u);
            u >>= count;
            t0 += t1; t1 <<= count;
            s0 += s1; s1 <<= count;
            v -= u;
            count_trailing_zeros (count, v);
            v >>= count;
            t1 += t0; t0 <<= count;
            s1 += s0; s0 <<= count;
        shift += count;

This code will be speed limited by branch mispredicts (1/2 per
iteration, costing roughly 14 cycles each on current x86 machines) plus
the cost of count_trailing_zeros (7-9 cycles on AMD K7,K8,K9, 4 cycles
on AMD K10, 2-3 cycles on all Intel P6).

This really ought to be written in assembly, just like gcd_1 is.  Then
we can play with various things like cmov, or perhaps put t0*B+s0 in one
128-bit xmm register and t1*B+s1 in another.  The four last statements
in reach if-else arm would then become two.


More information about the gmp-devel mailing list