divrem_1 and mod_1 (Was: Re: gmp-devel list)

Wed Mar 6 11:13:15 CET 2013

Zimmermann Paul <Paul.Zimmermann at loria.fr> writes:

  Finally to increase the technical content on this list Alexander Kruppa
  told me about the following preprint of Ernst Mayer: arxiv.org/abs/1303.0328.
  On a Core 2, he claims a mpn_mod_1 running at 6 cycles per limb, and a
  mpn_divrem_1 at 12.5 cycles per limb.

  On my computer GMP 5.1.1 runs at about 12 and 24 cycles per limb respectively:

  tarte% ./speed -s 1000,2000,5000,10000 -c mpn_mod_1.16031552068505892269 mpn_divrem_1.16031552068505892269
  overhead 6.13 cycles, precision 10000 units of 5.00e-10 secs, CPU freq 2000.00 MHz
          mpn_mod_1.16031552068505892269 mpn_divrem_1.16031552068505892269
  1000        #12682.00      24064.00
  2000        #25024.00      48187.00
  5000        #62084.00     119298.00
  10000      #123786.00     238280.00

  tarte% cat /proc/cpuinfo | grep "model name" | head -1
  model name      : Intel(R) Core(TM)2 Quad CPU    Q9550  @ 2.83GHz

I assume that what he is suggesting is what we do for mod_1 already do
(since GMP 5.0) and for divrem_1 have publicly proposed to do, and for
which we publish cycle numbers at https://gmplib.org/devel/asm.html.

You have chosen very specific mod_1 operands above, which triggers
slower behaviour.  Does Mayer claim to handle those faster than us, or
why else did you choose 16031552068505892269?

Our mod_1 on (the 5 generations old) Conroe CPU runs at 5 cycles/limb
with GMP since several years, except for operands > 2^62.

Our fastest divrem_1 (see asm.html) indeed runs at 12.5 c/l.  This is
code which we did not optimise for this very obsolete processor, but for
less obsolete processors...  The paper he refers ("Improved invariant
division") would describe what we do.

-- 
Torbjörn