div_qr_1n_pi1

Thu Jun 3 10:40:28 UTC 2021

nisse at lysator.liu.se (Niels Möller) writes:

  The critical path, via the u1 variable, is 

        umul_ppmm (p1, p0, u1, B2);
        add_mssaaaa (cy, u1, u0, u0, up[j], p1, p0);
        u1 -= cy & d;

Nice!

The (cy & d) term is multiplied in the next iteration by B2, i.e., we
have either the invariant d * B2 or 0 as the contribution to the p1,p0
product.  If we can somehow add that to u0,up[j] during the umul_ppmm
then we could save several more cycles.

Your snippet above should translate to the following x86 instructions
(assuming we unroll by 2x to avoid the read-after-write of u0 in the
add_mssaaaa macro; u0 will have to alternate between two registers):

       INSN         CYCLE
        mul         0,8
        add         3
        adc         4
        sbb         5
        and         6
        sub         7

(A mov or two might be missing; these are free with modern x86 CPUs as
they only modify the rename map.)

Judging from https://gmplib.org/devel/asm, this should give about 20%
boosts for current AMD and Intel CPUs.

If we dare use cmov (and its presumed side-channel leakage) we could
probably shorten the critical path by a cycle.  The "sbb" and "and"
would go away.

The Arm code (both v7 and v8) should get really neat using their
conditional execution.  Again, that might be a side-channel leakage
hazard.

(I am a bit fixated with side-channel leakage; our present
implementations of these particular functions are not side-channel
silent.)

-- 
Torbjörn
Please encrypt, key id 0xC8601622