Performance of addlsh_n and sublsh_n

Thu Feb 3 21:32:16 CET 2011

I think I understand your intended algorithm now.

You want to add an up[] limb to the left-shifted vp[] part, and
propagate carry to the next higher right-shifted vp[] part.

Your algorithm makes a lot of sense, it would not use more operations
than my current shl+shr+or code, but simplify cross-iteration carry
handling.

Having said that, AMD K8-K10 will surely be best made with mul (as you
so rightly also said).

In the meantime, I have loopmixed shrd based code.  The numbers are good
for some Intel procressors, awful for AMD processors as well as Intel
Atom and VIA Nano.  Results:

dnl Core c/l
dnl PNR  2.9
dnl NHM  2.8
dnl SBR  2.7

These are not bad numbers.  (Only SBR might get an addmul_1 that
competes with this.)

-- 
Torbjörn