divrem_1 and mod_1 (Was: Re: gmp-devel list)

Wed Mar 6 11:37:40 CET 2013

       Torbjörn,

> From: Torbjorn Granlund <tg at gmplib.org>
> Date: Wed, 06 Mar 2013 11:13:15 +0100
> 
> Zimmermann Paul <Paul.Zimmermann at loria.fr> writes:
> 
>   Finally to increase the technical content on this list Alexander Kruppa
>   told me about the following preprint of Ernst Mayer: arxiv.org/abs/1303.0328.
>   On a Core 2, he claims a mpn_mod_1 running at 6 cycles per limb, and a
>   mpn_divrem_1 at 12.5 cycles per limb.
>   
>   On my computer GMP 5.1.1 runs at about 12 and 24 cycles per limb respectively:
>   
>   tarte% ./speed -s 1000,2000,5000,10000 -c mpn_mod_1.16031552068505892269 mpn_divrem_1.16031552068505892269
>   overhead 6.13 cycles, precision 10000 units of 5.00e-10 secs, CPU freq 2000.00 MHz
>           mpn_mod_1.16031552068505892269 mpn_divrem_1.16031552068505892269
>   1000        #12682.00      24064.00
>   2000        #25024.00      48187.00
>   5000        #62084.00     119298.00
>   10000      #123786.00     238280.00
>   
>   tarte% cat /proc/cpuinfo | grep "model name" | head -1
>   model name      : Intel(R) Core(TM)2 Quad CPU    Q9550  @ 2.83GHz
>   
> I assume that what he is suggesting is what we do for mod_1 already do
> (since GMP 5.0) and for divrem_1 have publicly proposed to do, and for
> which we publish cycle numbers at https://gmplib.org/devel/asm.html.

ok fine.

> You have chosen very specific mod_1 operands above, which triggers
> slower behaviour.  Does Mayer claim to handle those faster than us, or
> why else did you choose 16031552068505892269?

I didn't yet read the details of the paper, but the abstract says
"64-bit-word divisor" without any restriction.

I had first chosen a smaller divisor, and noticed mod_1 was faster (around
5 cycles/limb). Then I tried 2^64-1 and it was slower. Then I generated a
random odd divisor in [2^63, 2^64].

> Our mod_1 on (the 5 generations old) Conroe CPU runs at 5 cycles/limb
> with GMP since several years, except for operands > 2^62.

is divrem_1 also faster for divisors of 62 bits or less?

> Our fastest divrem_1 (see asm.html) indeed runs at 12.5 c/l.  This is
> code which we did not optimise for this very obsolete processor, but for
> less obsolete processors...  The paper he refers ("Improved invariant
> division") would describe what we do.

thank you. Good to see GMP is up-to-date with the state-of-art!

Paul

PS: I wonder why you write "https" above, anyway the certificate expired in
2010.