The assembly subroutines in GMP are the most significant source of speed at small to moderate sizes. At larger sizes algorithm selection becomes more important, but of course speedups in low level routines will still speed up everything proportionally.
Carry handling and widening multiplies that are important for GMP can’t be
easily expressed in C. GCC asm
blocks help a lot and are provided in
longlong.h, but hand coding low level routines invariably offers a
speedup over generic C by a factor of anything from 2 to 10.