div_qr_1n_pi1
Torbjörn Granlund
tg at gmplib.org
Thu Jun 3 10:40:28 UTC 2021
nisse at lysator.liu.se (Niels Möller) writes:
The critical path, via the u1 variable, is
umul_ppmm (p1, p0, u1, B2);
add_mssaaaa (cy, u1, u0, u0, up[j], p1, p0);
u1 -= cy & d;
Nice!
The (cy & d) term is multiplied in the next iteration by B2, i.e., we
have either the invariant d * B2 or 0 as the contribution to the p1,p0
product. If we can somehow add that to u0,up[j] during the umul_ppmm
then we could save several more cycles.
Your snippet above should translate to the following x86 instructions
(assuming we unroll by 2x to avoid the read-after-write of u0 in the
add_mssaaaa macro; u0 will have to alternate between two registers):
INSN CYCLE
mul 0,8
add 3
adc 4
sbb 5
and 6
sub 7
(A mov or two might be missing; these are free with modern x86 CPUs as
they only modify the rename map.)
Judging from https://gmplib.org/devel/asm, this should give about 20%
boosts for current AMD and Intel CPUs.
If we dare use cmov (and its presumed side-channel leakage) we could
probably shorten the critical path by a cycle. The "sbb" and "and"
would go away.
The Arm code (both v7 and v8) should get really neat using their
conditional execution. Again, that might be a side-channel leakage
hazard.
(I am a bit fixated with side-channel leakage; our present
implementations of these particular functions are not side-channel
silent.)
--
Torbjörn
Please encrypt, key id 0xC8601622
More information about the gmp-devel
mailing list