Please update addaddmul_1msb0.asm to support ABI in mingw64

Thu Oct 7 21:35:40 UTC 2021

  And will cause an interesting failure if one can ever afford enough RAM
  to use an input size larger than 2^63 limbs ;-)

Nobody in his right mind will ever need more than 2 EiB of memory.  :-)

  Attaching a version that actually passes some tests (I should commit the
  unit tests, but not today). The loop is only 2-way, and there are three
  spare registers:

  L(top):
  	mov	(ap, n, 8), %rdx
  	mulx	u0, l0, hi
  	mov	(bp, n, 8), %rdx
  	mulx	v0, l1, c1
  	adox	c0, l0
  	adox	hi, c1
  	adc	l0, l1
  	mov	l1, (rp, n, 8)

  	inc	n
  	mov	(ap, n, 8), %rdx
  	mulx	u0, l0, hi
  	mov	(bp, n, 8), %rdx
  	mulx	v0, l1, c0
  	adox	c1, l0
  	adox	hi, c0
  	adc	l0, l1
  	mov	l1, (rp, n, 8)
  	inc	n

  	jnz	L(top)

Neat!

  I've eliminated the zero adds by adding together the high half of the
  products earlier (thanks to the "msb0" condition, there's no carry out
  from the second adox in the pairs). I think the critical recurrency
  involves the alternating carry limbs c0, c1, and the OF flag which is
  live only between the adjacent adox instructions.

  The 3 cycles/limb means 6 cycles for the two-way loop of 19
  instructions, or 3.1 instructions per cycle. To reach the ideal of 2
  cycles/limb (critical path limit, one iteration of this 2-way loop in 4
  cycles), one would need to issue almost 5 instructions per cycle, and
  that's not very realistic, right?

4 non-regmov insn/cycle is the limit.  Regmove are usually free.

  A few instructions can be shaved off by going to k-way for larger k, and
  moving the invariant u0/v0 to %rdx, as you suggested. Is that the way to
  go?

Perhaps.

  This structure doesn't work as is for addsubmul_1msb0, one would need to
  either organize it differently, or negate on the fly (but adding 4 not
  instructions into the loop sounds rather expensive, if speed is
  already limited by the number of instruction).

For many recent CPUs, submul_1 is slow.  So the competition is not that
fierce.  :-)

  Ah, and one final question: Where should mulx-code go? I tried putting
  this in x86_64/mulx/adx/, but it wasn't picked up automagically by
  configure, I had to link it in manually.

It is messy, and I don't think it could be any other way.

You could place it in the x86_64/coreibwl directory.  Then all later
Intel CPUs will find it.  You then need to put an include_mpn file for
bd4 and later.

Two versions which I cobbled together attached.  The first works for all
sizes, the 2nd only for even sizes.  The first one has false
dependencies, the second one fixes that.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: addaddmul_1msb0-mulx-nisse.asm
Type: application/octet-stream
Size: 2051 bytes
Desc: not available
URL: <https://gmplib.org/list-archives/gmp-devel/attachments/20211007/24c660eb/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: addaddmul_1msb0-mulx-nisse-nofdep.asm
Type: application/octet-stream
Size: 1787 bytes
Desc: not available
URL: <https://gmplib.org/list-archives/gmp-devel/attachments/20211007/24c660eb/attachment-0001.obj>
-------------- next part --------------

-- 
Torbj?rn
Please encrypt, key id 0xC8601622