386 optimized bitblit code

Wed Feb 11 08:11:21 CET 2004

Brian Hurt <bhurt at spnz.org> writes:
>
> The 128-bit xmm registers 
> were introduced with SSE-1, and so long as they aren't 2x as slow as the 
> same mmx isntructions, should provide some further speedup.

They're 2x as slow on pentium4, in that with mmx you can issue a new
one after 1 cycle, but the xmm forms can only issue another after 2
cycles.  :-(

So you get twice the data width but half the throughput, meaning
basically no gain, as far as I can tell.  I guess it frees up some
decode if you're doing integer stuff at the same time though.

> 3) lshift and rshift, since they determine both the direction of the shift 
> and the direction of the copy, fail on (some) overlapping source and 
> destinations.

They're designed to support the sort of movements mul_2exp and
div_2exp make, ie. a move by bits and also by limbs up or down
respectively.

> I dislike routines that fail when it's not absolutely 
> necessary for them to do so.  lshift and rshift together are approximately 
> the complexity of bitblit by itself.  But by seperating them you've lost 
> flexibility.

In current uses, you know which direction you're moving and can go
straight to the correct direction loop (upwards or downwards).

> 5) The core loop of GF2 multiplication in an ONB looks like this:

That's probably outside our scope at the moment.  There'd be nothing
wrong with it in principle, but we're more concerned about integers
than gf2 polys.  Is it NTL which has some stuff in that area?

> 6) If people are attempting to subtly hint that it'll be a cold day in 
> hell before my code is accepted, I wish they'd come out and say it, and 
> save me the work.

There might be a place for a bit sequence extract and/or insert.  But
start by thinking what it would look like at the mpz level.  mpn
exists primarily to support mpz.