Sandybridge addmul_N challenge
tg at gmplib.org
Wed Feb 22 17:03:54 CET 2012
I doubt we can make addmul_1 run faster on sandybridge.
But I'd like mul_basecase to run much faster than 3 c/l. Then
sqr_basecase and redc_1, redc_2 should be fixed.
An addmul_2 running better at 3 c/l or better would be great. That
means we need to handle a "tick" in it using <= 17 insns, probably
avoiding "op reg,mem" an "op mem.reg". (If we use 18 insn, loop
handling will bring us over 3 c/l.)
For mul_basecase and sqr_basecase we could perhaps work vertically,
summing into 3 registers. I.e., pretend we really multiply polynomials,
performing no (recurrency) carry propagation until we reach the bottom.
I havn't tried this, but I think this might be really promising for
Intel's last *two* main generations (Nehalem/Westmere and
Perhaps would could get close to 2 c/l with this approach, unless
register shortage messes things up to badly.
I haven't thought about doing redc_1/redc_2 using this approach.
Hensel lifting on-the-fly could be interesting...
More information about the gmp-devel