386 optimized bitblit code

Mon Feb 9 11:29:48 CET 2004

On Tue, 10 Feb 2004, Kevin Ryde wrote:

> Brian Hurt <bhurt at spnz.org> writes:
> >
> > Attached is a first cut at 386-optimized code.
> 
> Unless I've missed something, what you're doing looks like either
> mpn_lshift or mpn_rshift, according to the overlap direction.  See
> that code for what it takes to go fast on x86s.  (In particular mmx is
> usually what you want for the mmx chips, shrdl is sub-optimal on say
> athlon, and very slow on pentium4.)
> 

lshift and rshift not only specify the direction of the shift, but also 
the direction of the copy- forwards or backwards.  For overlapping shifts, 
this is often wrong, especially if you're wanting to shift more than a 
word distance.  lshift and rshift have limitations bitblit doesn't have- 
they aren't good replacements for bitblit.  Nor is bitblit a good 
replacement for lshift/rshift, I comment- lshift and rshift zero the bits 
shifted in, something bitblit doesn't do.

I got started on bitblit when I looked at what was required to do a rotate 
of more than one word's length.  Using just lshift/rshift, it was 
(assuming a rotate left):
	- memcpy the low part of the number into a temporary array
	- memmove the high part down to the low part
	- lshift the high part so the bits are in the right places
	- rshift the temporary array so the bits are in the right places
	- memcpy all but one word of the temprorary array back into the 
	  number
	- or the last word together

Every bit in the low part of the word is moved three times, every bit in 
the high part of the word is move twice.  Now, compare this with the same 
operation using bitblit:
	- memcpy the low part of the word into a temp array
	- bitblit the high part of the word down into place
	- bitblit the low part of the temp array into place
Every bit of the high part of the word is moved only once, while every bit 
of the low part of the word is moved only twice.

I'm also looking at variations of bitblit that perform various bitwise 
logical operations, in addition to just moving bits.  I'm thinking, as 
operations- copy (dst = src), and (dst &= src), or (dst |= src), xor (dst 
^= src), not-and (dst = ~(dst & src)), not-or (dst = ~(dst | src)), equ 
aka not-xor (dst = ~(dst ^ src)), and-not (dst &= ~src), and or-not (dst 
|= ~src).

I agree that MMX is going to be a speedup, and SSE better yet.  I'm
working on that code.  The 386 code I submitted is faster than the C code
on my athlon XP 2200+ system.  It's just that the MMX/SSE code should be
even faster yet.

As a side note, I haven't gone through and scheduled the instructions in 
the 386 optimized code yet- there may still be clock cycles that can be 
squeezed out of that code.

-- 
"Usenet is like a herd of performing elephants with diarrhea -- massive,
difficult to redirect, awe-inspiring, entertaining, and a source of
mind-boggling amounts of excrement when you least expect it."
                                - Gene Spafford 
Brian