386 optimized bitblit code

Tue Feb 10 10:14:31 CET 2004

Brian,

An arbitrary rotate would consist of only two lshift or rshift, and special
handling of end and center limb.  Memory copy is needed when the destination
is the same as the source.  Bitblit does not avoid this problem in any way.

As Kevin has said and I stated in emails, mpn_lshift/mpn_rshift are three
times faster than your present implementation.  The bitblit and rotate code
should use this to its advantage.  Looking over the MMX x86 code should be
convincing enough - the MMX speedup already exists.

Rickey

Live free or don't

-----Original Message-----
From: Brian Hurt [mailto:bhurt at spnz.org] 
Sent: Monday, February 09, 2004 9:30 AM
To: Kevin Ryde
Cc: Gnu MP mailing list
Subject: Re: 386 optimized bitblit code

On Tue, 10 Feb 2004, Kevin Ryde wrote:

> Brian Hurt <bhurt at spnz.org> writes:
> >
> > Attached is a first cut at 386-optimized code.
> 
> Unless I've missed something, what you're doing looks like either
> mpn_lshift or mpn_rshift, according to the overlap direction.  See
> that code for what it takes to go fast on x86s.  (In particular mmx is
> usually what you want for the mmx chips, shrdl is sub-optimal on say
> athlon, and very slow on pentium4.)
> 

lshift and rshift not only specify the direction of the shift, but also 
the direction of the copy- forwards or backwards.  For overlapping shifts, 
this is often wrong, especially if you're wanting to shift more than a 
word distance.  lshift and rshift have limitations bitblit doesn't have- 
they aren't good replacements for bitblit.  Nor is bitblit a good 
replacement for lshift/rshift, I comment- lshift and rshift zero the bits 
shifted in, something bitblit doesn't do.

I got started on bitblit when I looked at what was required to do a rotate 
of more than one word's length.  Using just lshift/rshift, it was 
(assuming a rotate left):
	- memcpy the low part of the number into a temporary array
	- memmove the high part down to the low part
	- lshift the high part so the bits are in the right places
	- rshift the temporary array so the bits are in the right places
	- memcpy all but one word of the temprorary array back into the 
	  number
	- or the last word together

Every bit in the low part of the word is moved three times, every bit in 
the high part of the word is move twice.  Now, compare this with the same 
operation using bitblit:
	- memcpy the low part of the word into a temp array
	- bitblit the high part of the word down into place
	- bitblit the low part of the temp array into place
Every bit of the high part of the word is moved only once, while every bit 
of the low part of the word is moved only twice.

I'm also looking at variations of bitblit that perform various bitwise 
logical operations, in addition to just moving bits.  I'm thinking, as 
operations- copy (dst = src), and (dst &= src), or (dst |= src), xor (dst 
^= src), not-and (dst = ~(dst & src)), not-or (dst = ~(dst | src)), equ 
aka not-xor (dst = ~(dst ^ src)), and-not (dst &= ~src), and or-not (dst 
|= ~src).

I agree that MMX is going to be a speedup, and SSE better yet.  I'm
working on that code.  The 386 code I submitted is faster than the C code
on my athlon XP 2200+ system.  It's just that the MMX/SSE code should be
even faster yet.

As a side note, I haven't gone through and scheduled the instructions in 
the 386 optimized code yet- there may still be clock cycles that can be 
squeezed out of that code.

-- 
"Usenet is like a herd of performing elephants with diarrhea -- massive,
difficult to redirect, awe-inspiring, entertaining, and a source of
mind-boggling amounts of excrement when you least expect it."
                                - Gene Spafford 
Brian

_______________________________________________
gmp-devel mailing list
gmp-devel at swox.com
https://gmplib.org/mailman/listinfo/gmp-devel