binvert_limb speedup on 64 bit machines with UHWtype

Sun Feb 27 16:52:13 CET 2022

Ciao John Gartell,

Il 2022-02-25 17:06 John Gatrell ha scritto:
> Hi everyone. I'm new here so I don't know how to submit a new 
> gmp-impl.h

This list is the correct place for such a discussion.

> I tested using UHWtype in the macro for binvert_limb. On a 64 bit 
> machine
> one of my programs gained a 3% speedup. On a 32 bit machine, there was 
> no

Unfortunately, on different platforms, we have different speeds.
Should we use uint8_fast_t, uint16_fast_t, uint32_fast_t for the 
different levels, and let the compiler choose? :-D

Anyway, if you are curious, you can also try what happens with a 
table-lookup-free binvert_limb function. You can find it, named 
sec_binvert_limb, in the file mpn/generic/sec_powm.c .

There are also architectures where smaller operations are NOT faster, 
but, on the other side, more than one mul instruction can run 
concurrently. For those architectures I'd suggest the following, any 
couple of multiplications beyond the 32-bits limit can be computed in 
parallel.

#define binvert_limb(inv,n)                                             
\
   do {                                                                  
\
     mp_limb_t  __n = (n);                                               
\
     mp_limb_t  __inv;                                                   
\
     ASSERT ((__n & 1) == 1);                                            
\

\
     __inv = binvert_limb_table[(__n/2) & 0x7F]; /*  8 */                
\
     if (GMP_NUMB_BITS <= 32)                                            
\
       {                                                                 
\
         if (GMP_NUMB_BITS > 8)                                          
\
           __inv = 2 * __inv - __inv * __inv * __n;                      
\
         if (GMP_NUMB_BITS > 16)                                         
\
           __inv = 2 * __inv - __inv * __inv * __n;                      
\
       }                                                                 
\
     else                                                                
\
       {                                                                 
\
         mp_limb_t  __t, __t2, __p;                                      
\
         int  __invbits = 32;                                            
\
         __t = __n * __inv - 1;  /* x */                                 
\
         __t2= __t * __t;        /* x^2 */                               
\
         __p = __inv * (__t - __t2);     /* (1-x^2)x/(x+1) */            
\

\
         while (__invbits < GMP_NUMB_BITS - 8)                           
\
           {                                                             
\
             __p  *= __t2 + 1;   /* (1-x^{2^n})x/(x+1) */                
\
             __t2 *= __t2;       /* x^{2^n} */                           
\
             __invbits <<= 1;                                            
\
           }                                                             
\
         /* __inv * (x^{2^{n+2}+1}-1)/(x+1) */                           
\
         __inv -= __p * (__t2 + 1);                                      
\
       }                                                                 
\

\
     ASSERT ((__inv * __n & GMP_NUMB_MASK) == 1);                        
\
     (inv) = __inv & GMP_NUMB_MASK;                                      
\
   } while (0)

In any case, for 64-bits machines, all the described functions use 6 
multiplications :-) but with very different paths toward the desired 
result! :-D

Maybe you can try them on your platform... and tell us exactly which CPU 
are you using for your tests, and how do they behave.

Ĝis,
m