div_qr_1 interface

Mon Oct 21 13:56:19 CEST 2013

Torbjorn Granlund <tg at gmplib.org> writes:

> On Intel chips, op-to-mem is expensive.  Even op-from-memory is often
> slower than load+op.  (I understand the register shortage problem.)

The following (untested) variant needs one register too many.

  UP, QP, UN:           Load, store, loop counter.
  DINV, B2, B2md:       Loop invariant constants.
  U2, U1I, U0, Q1I, Q0: Inputs.
  U1O, Q1O:		Outputs.
  Q2, %rax, %rdx:	Locals.

Also U1I -> U1O recurrency chain (with opteron cycle counts)

	mov	U2, Q2
	mov	U2, Q1O
	neg	Q2
	mov	DINV, %rax
	and	DINV, Q1O
	mul	U1I
	add	Q0, Q1O
	adc	$0, Q2
	mov	%rax, Q0
	add	%rdx, Q1O
	adc	$0, Q2

	mov	B2, %rax
	and	B2, U2
	mul	U1I		C 0 6
	lea	(U0, B2md), U1O
	add	U0, U2
	cmovnc	U2, U1O
	adc	U1I, Q1O
	adc	Q1I, Q2
	mov	Q2, (QP, UN, 8)
	jc	L(incr)

L(incr_done):
	mov	(UP, UN, 8), U0
	add	%rax, U0	C 4
	adc	%rdx, U1O	C 5
	sbb	U2, U2		C 6

25 instructions (27 K10 decoder slots) excluding loop overhead.

But one variable must be moved out of the registers. Maybe B2md (used
once) is the best candidate. Then

	lea	(U0, B2md), U1O

would have to be replaced by

	mov	(%rsp), U1O	C Can be done very early
        ...
        add	U0, U1O

We then have 26 instructions + loop overhead, or 54 instructions for 2
iterations. Or possibly DINV, if one thinks the quotient logic is less
critical.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.