GCD via Stein's algoriithm

Sun Nov 6 22:38:50 UTC 2016

GMP folks,

Hi, I do work on the computational library ATLAS:
    http://math-atlas.sourceforge.net/

I mainly do floating point optimization, and so am pretty ignorant of 
integral computation, and am entirely innocent of number theory.

However, I found myself needing a relatively fast least common multiple 
for small integer numbers.  I did some websearches, and ya'll's 
documentation was about all I found before I gave up and just used 
Stein's Algorithm for GCD:
    https://en.wikipedia.org/wiki/Binary_GCD_algorithm

Anyway, I produced something that's about 50% faster than straight 
Stein's algorithm (optimized using some bitlevel ops, but no inline 
assembly) on my haswell-E and around 25% faster on my ARM64 a53 for 
small problems.  The main improvement over my earlier algorithm is 
probably using some inline assembly and a loop peel allowing me to skip 
some main-loop checks.

If I remember the doc of yours I read in my searches, you guys are using 
some modified version of this algorithm for your *small* GCD 
computations as well, is that correct?

If it is, I thought you might want to see how badly your implementation 
crushes mine.  I'm happy to donate the code with GPL or BSD license if 
you are interested, or if you don't already have the inline assembly, 
feel free to steal for your better algorithm.

Anyway, I feel sure you already have something a lot better than I could 
come up with in a day and half effort, but I thought I'd check since as 
far as I can tell your main emphasis is precision, not performance. 
Feel free to ignore if this is more hassle than it is worth.

I attach the atlas_iopt.h that defines ATL_iGCD as an inline function, 
as well as the rather embarrassingly primitive tester, that includes the 
more normal Stein's (that one was written with my colleague Antoine 
Petitet back in the 90s) I'm comparing against.

To get the inline assembly, you need to define ATL_GAS_x8664 or 
ATL_GAS_x8632 on x86, and ATL_GAS_ARM64 for any 64-bit ARM (my configure 
autodefines these guys: just define them to anything on compilation).

I'm not a member of this mailing list, so you'll have to send to me 
directly for me to see any replies.  If you want to suggest I RTFM about 
how to do this less stupidly, I might learn something, assuming the 
number theory doesn't defeat me at the beginning . . .

If this is not the right place to post, sorry.  Feel free to post to 
right place, or to not approve the submission.  Just sending this on the 
off chance it might be helpful, not because I need a reply.

Regards,
Clint
-- 
**********************************************************************
** R. Clint Whaley, PhD * Assoc Prof, LSU * www.csc.lsu.edu/~whaley **
*******************************************************************