Sandybridge addmul_N challenge

Wed Feb 22 17:03:54 CET 2012

I doubt we can make addmul_1 run faster on sandybridge.

But I'd like mul_basecase to run much faster than 3 c/l.  Then
sqr_basecase and redc_1, redc_2 should be fixed.

An addmul_2 running better at 3 c/l or better would be great.  That
means we need to handle a "tick" in it using <= 17 insns, probably
avoiding "op reg,mem" an "op mem.reg".  (If we use 18 insn, loop
handling will bring us over 3 c/l.)

For mul_basecase and sqr_basecase we could perhaps work vertically,
summing into 3 registers.  I.e., pretend we really multiply polynomials,
performing no (recurrency) carry propagation until we reach the bottom.
I havn't tried this, but I think this might be really promising for
Intel's last *two* main generations (Nehalem/Westmere and
Sandybridge/Ivybridge).

Perhaps would could get close to 2 c/l with this approach, unless
register shortage messes things up to badly.

I haven't thought about doing redc_1/redc_2 using this approach.
Hensel lifting on-the-fly could be interesting...

-- 
Torbjörn