Better tabselect
Torbjorn Granlund
tg at gmplib.org
Fri Apr 12 10:04:35 CEST 2013
David Miller <davem at davemloft.net> writes:
The existing C code approaches 6 cycles/limb on T4, the best I can do
without pipelining with this new approach at 4 way unrolling is ~4.5
cycles/limb:
This function need to leak no side-channel information of actual operand
contents. Conditional execution is a no-no. Instead, we need to for
the mask using arithmetic operations.
The critical information not to be leaked for tabselect is which vector
is chosen, i.e., the value iof the `which' parameter.
We need to use something like "sub x,y,z, negcc z, subc %g0,%g0,%mask"
rather than your mask creation code. Just like my x86-sse code,
this will not allow table widthes (strange plural, does it exist?)
greater than 2^32, but that's OK.
I am quite sure your code runs in the neighbourhood of 9/4 = 2.25 cycles
per limb on T4, BTW. On US1-2 it might run at 7/4 c/l and on US3-4 it
again probably runs at 9/4. (Both these CPUs would benefit from sparser
ldx scheduling, of course. I don't think it is worth the effort, at
least not at this point.)
I am quite ignorant about VIS, but doesn't that allow 128-bit
operations?
Here us a tabselect test program:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test-tabselect.c
Type: application/octet-stream
Size: 2495 bytes
Desc: not available
URL: <http://gmplib.org/list-archives/gmp-devel/attachments/20130412/3d3d42c0/attachment.obj>
-------------- next part --------------
--
Torbj?rn
More information about the gmp-devel
mailing list