Improvements for power64/mode64

Sun Mar 26 15:28:59 CEST 2006

On Mar 26, 2006, at 6:40 AM, Torbjorn Granlund wrote:

> Mark Rodenkirch <mgrogue at wi.rr.com> writes:
>
>   I would like to submit the following sources to replace
>   addmul_1.asm  and submul_1.asm for the next release, whether 4.2
>   or a patch for  4.2.  These sources take full advantage of the
>   G5's pipeline.  I had  integrated these into GMP 4.1.4 early in
>   2005 and have used them  extensively with GMP-ECM since then.
>   With them I have found dozens  of new factors.
>
> These contributions come too late for 4.2, hwoever much tested
> they are.
>
> Does addmul_1 really run at 10 cycles/limb, as comments is the
> file say?  Then it is no faster than the current, simpler code.
> Or did you not update the headers?  In hat case, what are the
> cycle counts for your addmul and submul?

I never modified the headers.  I know that they are faster based upon  
gmpbench as gmpbench gets a 20% improvement with my code.

Assuming I understand how to use speed correctly I am getting about  
between 8.6 and 8.7 cycles per limb for both addmul and submul.   
sqr_diagonal is between 7.8 and 7.9 cycles per limb.  If you have a  
single use of speed that lets both of us know that I am comparing  
apples to apples, that would be great.

BTW, I would like to know where the 10 cycles per limb for addmul and  
submul came from.  Using the same values for speed I get between 13.4  
and 13.5 cycles per limb for those.  The original sqr_diagonal is  
between 8.0 and 8.1.  I think more improvements are possible with  
sqr_diagonal.  I just haven't worked on them.

--Mark