div_qr_1n_pi1

Thu Jun 3 20:29:04 UTC 2021

Ciao,

Il 2021-06-03 12:40 Torbjörn Granlund ha scritto:
> If we dare use cmov (and its presumed side-channel leakage) we could
> probably shorten the critical path by a cycle.  The "sbb" and "and"
> would go away.

Using masks does not always give the fastest code. I tried the following 
variation on Niels' code, and, on my laptop with "g++-10 -O2 
-mtune=icelake-client -march=icelake-client", the resulting code is 
comparable (faster?) with the current asm.

*************** mpn_div_qr_1n_pi1 (mp_ptr qp, mp_srcptr
*** 245,266 ****
          * +      | q0|
          *   -+---+---+---+
          *    | q2| q1| q0|
          *    +---+---+---+
         */
!       umul_ppmm (p1, t, u1, dinv);
!       add_ssaaaa (q2, q1, -u2, u2 & dinv, CNST_LIMB(0), u1);
!       add_ssaaaa (q2, q1, q2, q1, CNST_LIMB(0), p1);
!       add_ssaaaa (q2, q1, q2, q1, CNST_LIMB(0), q0);
!       q0 = t;

         umul_ppmm (p1, p0, u1, B2);
-       ADDC_LIMB (cy, u0, u0, u2 & B2);
-       u0 -= (-cy) & d;

         /* Final q update */
!       add_ssaaaa (q2, q1, q2, q1, CNST_LIMB(0), cy);
         qp[j+1] = q1;
         MPN_INCR_U (qp+j+2, n-j-2, q2);

         add_mssaaaa (u2, u1, u0, u0, up[j], p1, p0);
       }
--- 245,264 ----
          * +      | q0|
          *   -+---+---+---+
          *    | q2| q1| q0|
          *    +---+---+---+
         */
!       ADDC_LIMB (q2, q1, q0, u1);
!       umul_ppmm (t, q0, u1, dinv);
!       ADDC_LIMB (cy, u0, u0, u2 ? B2 : 0);
!       u0 -= cy ? d : 0;
!       add_ssaaaa (q2, q1, q2, q1, -u2, u2 ? dinv : 0);

         umul_ppmm (p1, p0, u1, B2);

         /* Final q update */
!       add_ssaaaa (q2, q1, q2, q1, CNST_LIMB(0), t + cy);
         qp[j+1] = q1;
         MPN_INCR_U (qp+j+2, n-j-2, q2);

         add_mssaaaa (u2, u1, u0, u0, up[j], p1, p0);
       }

$ build/tune/speed -p10000000 -s1-100 -f1.6 -C 
mpn_div_qr_1n_pi1.9999999999999999999 ...
                ASM-code                C-code
1              2.1227	1              3.6125
2              3.1758	2              3.9425
3              3.4567	3              3.8861
4              3.4758	4              3.8606
6              3.7857	6              3.8764
9              3.9912	9              3.9676
14             4.0304	14             4.0531
22             4.3461	22             4.1798
35             4.4161	35             4.2080
56             4.4744	56             4.2833
89             4.4896	89             4.2950

> (I am a bit fixated with side-channel leakage; our present
> implementations of these particular functions are not side-channel
> silent.)

We should write a fast version, and then a sec_ one :-)

Ĝis,
m