> SIZE  | speed-p2 | speed-at
> 500   | 3.80     | 2.81

Is that right?  I'd wanted to get the p6 times down, but never found
anything better than about 3.5, which is no great improvement on
mpn/x86/aors_n.asm at about 3.7.

> It seems that GMP for Athlon is faster in low and huge precision (due
> to overhead and cache, I think).

That's possible.

A good way to compare code is to use tune/ to make a speed
program with both routines in it.  See the comments in that script.

