Why assembler version of addmul_1 is so fast?

Sat Feb 1 23:45:05 UTC 2020

Andy <borucki.andrzej at gmail.com> writes:

  version C: https://github.com/wbhart/mpir/blob/master/mpn/generic/addmul_1.c
  versus asm:
  https://github.com/wbhart/mpir/blob/master/mpn/x86_64w/haswell/addmul_1.asm
  In my computer assembler version is twice fast as best optimized C version,
  and my assembler trials. What is the riddle of speed?
  loop are partially expanded, but this is not enough. This code is specific
  to Haswell but how obtained this speedup?

The operation of addmul_1, which is the cornerstone of many GMP
opertions, is not well supported by most high-level languages.

The main problems are that the full integer product of two multiplied
integer variables is not accessible even if the underlying hardware can
provide the full product.  Most high-level languages only return the low
half of such products.

Another problem is that summing multi-word values is hard in high-level
languages, while many instruction sets provide special instructions for
that, making use of special hardware carry bit(s).

You measured a 2x speedup from the assembly code.  That means that you
had inline asm for the multiply, which GMP indeed uses (unless
configured with --disable-assembly).  Without inline asm, the speedup
from assembly code would be more like 10x.

Note that I'm not advocating the use of assembly code in general.  GMP
is a very special case.

-- 
Torbjörn
Please encrypt, key id 0xC8601622