Please update addaddmul_1msb0.asm to support ABI in mingw64

Wed Oct 6 19:54:56 UTC 2021

nisse at lysator.liu.se (Niels Möller) writes:

> If we have adox/adcx, use same strategy as suggested for
> addaddmul_1msb0, but subtract rather than add in the chain with long
> lived carry.

Here's a sketch of a loop, that should work for both addaddmul_1msb0 and
addsubmul_1msb0:

L(top):
	mov	(ap, n, 8), %rdx
	mulx 	%r8, alo, hi
	adox	ahi, alo
	mov	hi, ahi			C 2-way unroll.
	adox	zero, ahi		C Clears O

	mov	(bp, n), %rdx
	mulx 	%r9, blo, hi
	adox	bhi, blo
	mov	hi, bhi
	adox	zero, bhi		C clears O

	adc	blo, alo		C Or sbb, for addsubmul_1msb0
	mov	alo, (rp, n, 8)
	inc	n
	jnz	top

L(done):
	adc	bhi, ahi		C No carry out, thanks to msb0
	mov	ahi, %rax		C Return value

(BTW, do I get operand order right for mulx? I'm confused by the docs
that use the generally different intel conventions).

Note that in this form, I think we could allow full limb inputs (%r8,
%r9), except that we would get a final carry, and we'd need to return a
65-bit value.

For the addadd case, this could be simplified by adding ahi and bhi
together early (since there can be no overflow), eliminating a few of
the adox instructions. 

Now, question is if it can beat mul_1 + addmul_1. I don't know.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.