Why assembler version of addmul_1 is so fast?

Sat Feb 1 21:30:53 UTC 2020

version C: https://github.com/wbhart/mpir/blob/master/mpn/generic/addmul_1.c
versus asm:
https://github.com/wbhart/mpir/blob/master/mpn/x86_64w/haswell/addmul_1.asm
In my computer assembler version is twice fast as best optimized C version,
and my assembler trials. What is the riddle of speed?
loop are partially expanded, but this is not enough. This code is specific
to Haswell but how obtained this speedup?