Alternative div_qr_1
Torbjorn Granlund
tg at gmplib.org
Tue Jun 15 17:30:11 CEST 2010
nisse at lysator.liu.se (Niels Möller) writes:
nisse at lysator.liu.se (Niels Möller) writes:
> The algorithm is really simple, the iteration is
>
> <r2, r1, r0> <-- <r0, u[j]> + r1 b2 + r2 b3
I've now implemented a slightly different variant,
<r2, r1, r0> <-- <r0, u[j]> + (r2 b2 - c d) B + r1 b2
where c is the carry from r0 + r2 b2 (here, r2 is a single bit, and in
fact <r2, r1> < B + d - 1). Note that the b3 constant is no longer
needed.
What is d? I don't see it in the code. What is B2mb of the code?
I assume one may write your iteration slightly simpler like this:
<r2, r1, r0> <- (r0 + r2 b2 - c d) B + r1 b2 + u[j]
I'm attaching a new x86_64/mod_1_1.asm. The function mpn_mod_1_1p is
rewritten from scratch, while mpn_mod_1_1p_cps is unchanged.
I have rewritten the cps function, and pushed it. I might also have
made a few cosmetic changes to the main function. Beware of conflicts
if you update!
(I added a SHLD_SLOW m4 parameter, the idea being that we should set
that in gmp-mparam.h for VIA Nano and Intel atom, instead of having more
separate files. You might want to do the same.)
Runs at 6 cycles per limb on AMD (panther.gmplib.org).
Nice! It runs faster than the old code also on corei and core2.
On Intel corei (shell.gmplib.org) results are a little odd; it runs at
11 cycles per limb when the divisor is unnormalized, and 11.5 when the
divisor is normalized. Which is strange, since there's no instruction in
the loop which I'd expect to have data-dependent timing.
I can only reproduce *some* data dependency, but not quite like that:
unnorm norm
1053 #11.1112 11.1693
1158 #11.1265 11.1803
1273 #11.1508 11.1606
1400 11.2054 11.0793
1540 11.2064 11.1621
1694 11.1866 #11.0681
1863 #11.0945 11.1504
2049 11.1406 #11.1297
2253 #11.0609 11.3940
2478 #11.1096 11.1361
2725 #11.1079 11.2948
2997 #11.1052 11.2135
So for n=2253 we're at about 11.4, for other sizes things run well.
This seems very reproducible. Since we're reading from just one vector,
alignment should not be an issue, so it ought to just be data related.
I suspect we enter the loop with different pipeline state. Remember all
P6 processors' register file read port challenges; I suspect that's
where we ought to look for the explanation.
For repentium.gmplib.org, a core2, the effect is more pronounced:
unnorm norm
1053 #12.3559 12.5564
1158 #12.3411 12.5681
1273 #12.3414 12.5888
1400 #12.3232 12.6318
1540 #12.3447 12.5185
1694 #12.3239 12.5408
1863 #12.3456 12.5639
2049 #12.3635 12.5597
2253 #12.2976 12.4961
2478 #12.2884 12.5203
2725 #12.2947 12.5466
2997 #12.3461 12.4860
I'd like to extend this to also compute the quotient, then in each
iteration one needs to also compute (in the normalized case; I haven't
thought about the unnormalized case yet):
<q2, q1, q0> <-- <r2, r1> <1, v> + c B
= r2 <1, v> + r1 v + (r1 + c) B
where c is the same carry out as in the r update. And then accumulate
those limbs. Unfortunately, the simple way of doing those additions
takes an awful lot of instructions. The critical path is still the
dependency on r1, which will be 7 cycles (one additional move on the
path, to free up %rax for the r1 v multiply in the q computation). So
we're going to be limited by decoder bandwidth rather than latency. I
hope that one should be able to get a loop running in 11 or 10 cycles.
I need to think more about this.
We have a 'classical' div_qr_1_pi2 running at 10.3 c/l since long. (It
is not yet in the repo.)
--
Torbjörn
More information about the gmp-devel
mailing list