386 optimized bitblit code

Wed Feb 11 10:12:28 CET 2004

1) I have not.

2) Your code will be same, or slower.  SSE2 is twice as slow. ;)

3) True, many will appreciate greater flexibility

4) See (3)

5) Your work would be a key part of implementing an arbitrary machine code -
customizable word size computer - which would be a very cool toy to play
with!

6) I don't know about code acceptance - I'm trying to help you be more
effective at achieving your goals.

Rickey

Live free or don't

-----Original Message-----
From: Brian Hurt [mailto:bhurt at spnz.org] 
Sent: Tuesday, February 10, 2004 12:38 PM
To: Bowers, Rickey N,,DMDCWEST
Cc: Kevin Ryde; Gnu MP mailing list
Subject: RE: 386 optimized bitblit code

On Tue, 10 Feb 2004, Bowers, Rickey N,,DMDCWEST wrote:

> Brian,
> 
> An arbitrary rotate would consist of only two lshift or rshift, and
special
> handling of end and center limb.  Memory copy is needed when the
destination
> is the same as the source.  Bitblit does not avoid this problem in any
way.
> 
> As Kevin has said and I stated in emails, mpn_lshift/mpn_rshift are three
> times faster than your present implementation.  The bitblit and rotate
code
> should use this to its advantage.  Looking over the MMX x86 code should be
> convincing enough - the MMX speedup already exists.

1) Do not assume rotates are less than one word in length.

2) It's no surprise to me that the mmx versions are faster than non-mmx 
versions.  The core loops are identical, however, so when I finish 
debugging my code and post it, bitblit should be about the same speed 
(bitblit may take a constant extra overhead).  The 128-bit xmm registers 
were introduced with SSE-1, and so long as they aren't 2x as slow as the 
same mmx isntructions, should provide some further speedup.

3) lshift and rshift, since they determine both the direction of the shift 
and the direction of the copy, fail on (some) overlapping source and 
destinations.  I dislike routines that fail when it's not absolutely 
necessary for them to do so.  lshift and rshift together are approximately 
the complexity of bitblit by itself.  But by seperating them you've lost 
flexibility.

4) Long term, I'd like to see gmp support not just big number uses, but
also big bit set uses.  Rather like C treats integers.  C can get away
with requiring that more complex bitwise operations be built up out of 
several primitives, as the primitives are fast.  For larger bitset 
operations, this shouldn't be assumed.

5) The core loop of GF2 multiplication in an ONB looks like this:
    for (int j = 1; j < m; ++j) {
        rot_right(b, 1);
        zidx = lambda[0][j];
        oidx = lambda[1][j];
        for (i = 0; i < n; ++i) {
             c[i] ^= b[i] & (amatrix[zidx][i] ^ amatrix[oidx][i]);
        }
    }

The more I think about it, the more I think that being able to do bitwise 
operations from arbitrary bit sequences to arbitrary bit sequences is even 
more usefull than simple fast rotates.  The question is where to draw the 
line.

6) If people are attempting to subtly hint that it'll be a cold day in 
hell before my code is accepted, I wish they'd come out and say it, and 
save me the work.

-- 
"Usenet is like a herd of performing elephants with diarrhea -- massive,
difficult to redirect, awe-inspiring, entertaining, and a source of
mind-boggling amounts of excrement when you least expect it."
                                - Gene Spafford 
Brian