More div_qr_2 assembler

Mon Apr 4 14:09:41 CEST 2011

nisse at lysator.liu.se (Niels Möller) writes:

  > I see.  When writing in assembly, we should probably provide
  > mpn_div_qr_2n as one entry point and mpn_div_qr_2n_pi1 as another, with
  > just a plain jmp to common code.

  An alternative is to have an assembly inversion function, analogous to
  the mod_1_*_cps functions.

Sure, this is a good candidate for assembly implementation.

  We have discussed earlier to have separate,
  preferably short, functions for _qr, _q and _r. And also the
  unnormalized functions would use an inverse of the same type, just with
  shifted input. So if we do dual entry points for all these functions,
  the inversion code will be duplicated in half a dozen places.

Such duplication would be unfortunate, but if it gives a significant
performance boost, then we should consider it nevertheless.

  I think the name mpn_invert_3by2 is reasonable, do you agree?

Sure.

  BTW, for assembly implementation, I wonder if one could have invert_limb
  have the remainder as a second return value, put in some other register
  than %rax. I think one may be able to save a multiply in
  mpn_invert_3by2 (assuming it calls invert_limb, rather than copies that
  code inline).

Are you suggesting that a separate assembly implementation of
mpn_invert_3by2 should call invert_limb?  Should we not put it inline?

Returing extra bits in a register might work, but we need to consult the
ABI (ELF).  It is possible that glue code is generated (at least for
shared libs) that could clobber some of the registers.

-- 
Torbjörn