Alternative div_qr_1

Sat Jun 19 16:08:31 CEST 2010

nisse at lysator.liu.se (Niels Möller) writes:

> seems to be compiled to a branch rather than a cmov by gcc-4.3.2. Maybe
> gcc-4.4.4 or gcc-4.5.0 is smarter.

Now I've tested it on k7, using gcc-4.4.4.

Old C code:

  $ ./speed -c -s 2-10 mpn_mod_1_1.0xabcdef01 mpn_mod_1_1.0xabcdef1
  overhead 6.06 cycles, precision 10000 units of 7.15e-10 secs, CPU freq 1398.94 MHz
          mpn_mod_1_1.0xabcdef01 mpn_mod_1_1.0xabcdef1
  2              158.30       #147.70
  3              168.33       #161.66
  4              185.58       #174.25
  5              197.98       #186.25
  6              210.61       #198.02
  7              228.48       #210.13
  8              239.98       #222.27
  9              252.36       #234.64
  10             264.12       #246.82

  $ ./speed -C -s 1500 mpn_mod_1_1.0xabcdef01 mpn_mod_1_1.0xabcdef1
  overhead 6.07 cycles, precision 10000 units of 7.15e-10 secs, CPU freq 1398.94 MHz
          mpn_mod_1_1.0xabcdef01 mpn_mod_1_1.0xabcdef1
  1500         #12.1487       12.1553

New C code:

  $ ./speed -c -s 2-10 mpn_mod_1_1.0xabcdef01 mpn_mod_1_1.0xabcdef1
  overhead 6.05 cycles, precision 10000 units of 7.15e-10 secs, CPU freq 1398.94 MHz
          mpn_mod_1_1.0xabcdef01 mpn_mod_1_1.0xabcdef1
  2             #113.06        131.28
  3             #124.21        143.37
  4             #134.32        153.46
  5             #148.41        163.58
  6             #156.46        176.68
  7             #165.58        185.76
  8             #177.02        212.00
  9             #198.96        219.04
  10            #211.00        228.17

  $ ./speed -C -s 1500 mpn_mod_1_1.0xabcdef01 mpn_mod_1_1.0xabcdef1
  overhead 6.05 cycles, precision 10000 units of 7.15e-10 secs, CPU freq 1398.94 MHz
          mpn_mod_1_1.0xabcdef01 mpn_mod_1_1.0xabcdef1
  1500          10.3747       #9.5360

I have no explanation for the difference in cycles per limb for normalized vs
unnormalized divisor.

For comparison, current	 K7 assembler code:

  $ ./speed -c -s 2-10 mpn_mod_1_1.0xabcdef01 mpn_mod_1_1.0xabcdef1
  overhead 6.06 cycles, precision 10000 units of 7.15e-10 secs, CPU freq 1398.94 MHz
          mpn_mod_1_1.0xabcdef01 mpn_mod_1_1.0xabcdef1
  2              170.84       #136.43
  3              179.93       #145.55
  4              189.04       #153.64
  5              196.11       #161.71
  6              205.40       #169.89
  7              213.45       #178.16
  8              221.65       #186.05
  9              229.71       #194.16
  10             243.76       #197.49

  $ ./speed -C -s 1500 mpn_mod_1_1.0xabcdef01 mpn_mod_1_1.0xabcdef1
  overhead 6.06 cycles, precision 10000 units of 7.15e-10 secs, CPU freq 1398.94 MHz
          mpn_mod_1_1.0xabcdef01 mpn_mod_1_1.0xabcdef1

  1500           8.1747       #8.1527

I don't understand why the normalized case seem to have more expensive
precomputation (but I haven't looked at the code).

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.