binvert_limb speedup on 64 bit machines with UHWtype

John Gatrell gatrelljm at gmail.com
Wed Mar 2 17:42:09 CET 2022


It looks like my suggestions for gmp-impl.h are unfortunately cpu-specific.
So I can no longer recommend changing it. The following notes may help with
cpu specific assembler routines.

In assembler, I suggest the following
1). Replace a divide by 2 and a mask of 0x7F, with load 8-bit register and
shift right
2). If the binvert_limb_table is not available, create a 4-bit inverse from
the formula for a 5-bit inverse, since the 5-bit inverse is simpler and
faster, namely inv5 = (3*n ^ 2) where '^' is the xor function.

Discussing the AMD A8-5500 only:-

I added options for binvert_limb to 'speed' program on cpu type AMD
A8-5500. Results:
overhead 5.81 cycles, precision 1410065408 units of 7.14e-10 secs, CPU freq
1400.00 MHz
         binvert_limb binvert_limb_htype binvert_limb_orig
binvert_limb_5bit binvert_limb_4bit
1               45.74       #0.8200        0.9592        0.9724
 0.9810

Fastest at 0.82 is
u64 __inv;
UHWtype __hinv = binvert_limb_table[(__n & 0xFF)/2];
__hinv *= 2 - __hinv * (UHWtype)__n; /* 16 */
__hinv *= 2 - __hinv * (UHWtype)__n; /* 32 */
__inv = __hinv * (2 - __hinv * __n); /* 64 */

Good is
u64 __inv;
UHWtype __hinv = binvert_limb_table[(__n & 0xFF)/2];
__hinv = 2 * __hinv - __hinv * __hinv * (UHWtype)__n; /* 16 */
__hinv = 2 * __hinv - __hinv * __hinv * (UHWtype)__n; /* 32 */
__inv = 2 * (mp_limb_t)__hinv - (mp_limb_t)__hinv * (mp_limb_t)__hinv *
__n; /* 64 */

In the absence of binvert_limb_table, it is quicker to calculate a 5-bit
inverse and pretend it is 4-bits
UHWtype __hinv = 3 * (UHWtype) __n ^ 2; /* caret is xor function. 5-bits */
__hinv = 2 * __hinv - __hinv * __hinv * (UHWtype)__n; /* 8 of 10 bits */

John Gatrell


More information about the gmp-devel mailing list