Sandybridge addmul_N challenge

Thu Feb 23 07:51:26 CET 2012

nisse at lysator.liu.se (Niels Möller) writes:

  The best I find is

          mov     (up, n, 8), %rax
          mul     v
          mov     %rdx, c1
          add     (rp, n, 8), %rax
          adc     $0, c1
          add     %rax, c0
          adc     $0, c1
          mov     c0, (rp, n, 8)

  Unrolled four times, that's 34 instructions. The best result from the
  loop mixer so far has been 3.24 cycles / limb. See
  shell:~nisse/hack/loopmix/lms/addmul_1-2.nlms. Which is the same as
  your code, I guess.

My code looked like 3.16 in the loopmixer, then runs at 3.25 outside of
it.  If you can make your smaller code actually run at 3.25, it is an
improvement.

I think we should focus not on addmul_1 but on mul_basecase,
sqr_basecase, redc_1, perhaps redc_2.  I.e., please focus on addmul_2
(or addmul_N, N > 2) or vertical multiplication primitives.

-- 
Torbjörn