Better tabselect

Fri Apr 12 10:04:35 CEST 2013

David Miller <davem at davemloft.net> writes:

  The existing C code approaches 6 cycles/limb on T4, the best I can do
  without pipelining with this new approach at 4 way unrolling is ~4.5
  cycles/limb:

This function need to leak no side-channel information of actual operand
contents.  Conditional execution is a no-no.  Instead, we need to for
the mask using arithmetic operations.

The critical information not to be leaked for tabselect is which vector
is chosen, i.e., the value iof the `which' parameter.

We need to use something like "sub x,y,z, negcc z, subc %g0,%g0,%mask"
rather than your mask creation code.  Just like my x86-sse code,
this will not allow table widthes (strange plural, does it exist?)
greater than 2^32, but that's OK.

I am quite sure your code runs in the neighbourhood of 9/4 = 2.25 cycles
per limb on T4, BTW.  On US1-2 it might run at 7/4 c/l and on US3-4 it
again probably runs at 9/4.  (Both these CPUs would benefit from sparser
ldx scheduling, of course.  I don't think it is worth the effort, at
least not at this point.)

I am quite ignorant about VIS, but doesn't that allow 128-bit
operations?

Here us a tabselect test program:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test-tabselect.c
Type: application/octet-stream
Size: 2495 bytes
Desc: not available
URL: <http://gmplib.org/list-archives/gmp-devel/attachments/20130412/3d3d42c0/attachment.obj>
-------------- next part --------------

-- 
Torbj?rn