nisse at lysator.liu.se
Tue May 11 14:38:40 CEST 2010
Torbjorn Granlund <tg at gmplib.org> writes:
> You actually wrote an x86_64 assembly mod_1_1 some time ago that you
> claimed ran in 6 c/l.
It took me some time to dig it up, but I found this in an email I wrote
Aug 8 last year.
mul b2 C 0, 6
and b3, r2 C 1
mov u[j], t0 C
mov r0, t1 C
add r2, t0 C 2
adc $0, t1 C 3
sbb r2, r2 C 4
add %rax, t0 C 4
adc %rdx, t1 C 5
sbb $0, r2 C 6
mov t0, r0
mov t1, %rax
The algorithm is really simple, the iteration is
<r2, r1, r0> <-- <r0, u[j]> + r1 b2 + r2 b3
where b2 = B^2 (mod d), b3 = B^3 (mod d). r2 is one-bit number,
represented so that r2 b3 is an and operation, and the right hand side
is less than B^2 + B d.
Expessing the other algorithm in similar form, it was
<r2, r1, r0> <-- <r0, u[j]> + r1 b2 + r2 b2 (B - d)
We see that it differs only in the constant coefficient for r2, and
clearly b2 (B - d) = b3 (mod d).
The assembler code above needs some rearrangement to get the mov t1,
%rax off the recurrency chain. This variant might work better:
mul b2 C 0 6
and b3, r2 C 0 7
mov u[j], t0 C
add r2, t0 C 1 8
adc $0, t1 C 2 9
sbb r2, r2 C 3 10 *
add %rax, t0 C 4 10 *
mov t1, %rax C 4 10 *
adc %rdx, %rax C 5 11
sbb $0, r2 C 6 12
mov t0, t1 C 7
So this should run in 6 cycles if we're lucky (and like the other code,
there are three independent instructions scheduled in the same cycle,
marked with * above. It would be fine to postpone the first of them, but
the out-of-order machinery won't know that, I'm afraid).
If we let B^2 = m d + b2 (m one limb and a bit) and B^3 = m2 d + b3
(m2 two limbs and a bit) the corresponding q piece is
r2 m2 + r1 m
which is one multiply, one conditional add of the two-limb number m3,
and handling of the most significant implicit bits.
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
More information about the gmp-devel