PS: mpn_sqrtrem1

Tue Dec 20 21:02:36 UTC 2016

Ciao,

Il Mar, 20 Dicembre 2016 7:09 pm, Adrien Prost-Boucle ha scritto:
> I'm not sure using a table of invroot*invroot would bring speedup.

> But using a table involves adding a table to the binary + doing a memory
> access.

There is a table already. My proposal is: using a single table, with
fatter values:

static const uint32 invsqrt8___ [] =
 {(511*511<<10)|511, (509*509<<10)|509, ...}

...

 uint64_t invroot = invsqrt8___[(vsh >> 55) - 128];
 uint64_t squaredinvroot = invroot >> 10;
 invroot &= 0x1f;

> Maybe worth a test?

Maybe.

> On Tue, 2016-12-20 at 15:50 +0100, Torbjörn Granlund wrote:
>> Surely possible.  The cost would be at least 768 bytes.
>>
>> The x0 value which is squared is 9 bits, with the msb predictably 1.
>> With some extra instructions, 384 to handle that msb 16-bit entries
>> would work.

16 bits only... clever! but we can not load the 3 bytes at once, I fear...
Without wasting an extra byte anyway... or using it for another interlaced
table... no, too tricky.

Regards,
m

-- 
http://bodrato.it/