Sandybridge addmul_N challenge

Niels Möller nisse at
Wed Feb 22 19:16:23 CET 2012

Torbjorn Granlund <tg at> writes:

> I recently played with karatsuba-based addmul_2 for s390.  For MUL-
> challenged machines, that might be an option.  Sandybridge is OK there,
> but Bull-dozer is not.  A disadvantage is that such code would be
> unsuitable for code which wants to be "side-channel silent".

I'm pretty sure one can do Karatsuba branchfree. Not sure one can do
that and do it fast (but if the alternative is branches, they're costly


  mov	u0, a
  sub	u1, a
  lea	(u0, u1), t
  cmovc	t, a
  sbb	mask, mask

Now a = |u0 - u1|, with sign in mask. Can easily be done also without
cmov, but with a longer chain of dependent instructions:

  mov	u0, a
  sub	u1, a
  sbb	mask, mask
  xor	mask, a
  sub	mask, a
Get b = |v0 - v1| in the same way, and arrange so that the final mask is
all ones if the term |u0 - u1| * |v0 - v1| should be subtracted, i.e., if
(u0 - u1)(v0 - v1) >= 0.


Add in the signed term (u0 - u1) * (v0 - v1) to the three limbs
<r3,r2,r1>, using two's complement:

  mov	a, %rax
  mul	b
  xor	mask, %rax
  xor	mask, %rdx
  bt	$0, mask	C Set carry from mask. Any better way?
  C <mask, %rdx, %rax> + c = - (u0 - u1) (v0 - v1), in two's complement.
  adc	%rax, r1
  adc	%rdx, r2
  adc	mask, r3

Alternatively, one could use (u0 + u1) * (v0 + v1), with a couple of
conditional adds instead.

Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

More information about the gmp-devel mailing list