Possible new T3-T5 mul_1

Tue Apr 2 05:32:04 CEST 2013

For the most critical functions, i.e., mul_1, addmul_1, submul_1, mul_2,
and addmul_2, we should not stick to 2-way unrolling.

I played with a 4-way unrolled mul_1, but not using your multi-pointer
trick, meaning that we will spend two cycles instead of one cycle for
bookkeeping.

Our current is stated as running at 3.8 c/l.  If this new code runs
ideally, it will be 3 c/l.  We need to watch out for small count speed
too.  Surely, this code will add some overhead compared to the old code.

If you could please run something like

  tune/speed -p10000000 -s1-1000 -f1.1 -C mpn_mul_1.3

for both the old code and this code, that would be great!

-------------- next part --------------
A non-text attachment was scrubbed...
Name: sparc64-mul_1.asm
Type: application/octet-stream
Size: 3616 bytes
Desc: not available
URL: <http://gmplib.org/list-archives/gmp-devel/attachments/20130402/2ca2b0fa/attachment.obj>
-------------- next part --------------

The addmul_2 and mul_2 code takes several cycles more per iteration than
ideally.  I'll try to schedule mulx and umulxhi further from their
consumers, since that scheduling issue is the suspected prime stalling
reason.

In the end, I don't think addmul_2 should be the mul_basecase workworse
for T4/T5.  We should go to at least addmul_4, perhaps as far as
addmul_8.  We can generate such beats with a small program; I have such
code already for other architectures.

An ideal addmul_8 would run at 2.25 c/l, I think.  Damned be the zany
mpmul instruction!

-- 
Torbj?rn