386 optimized bitblit code
Brian Hurt
bhurt at spnz.org
Mon Feb 9 11:29:48 CET 2004
On Tue, 10 Feb 2004, Kevin Ryde wrote:
> Brian Hurt <bhurt at spnz.org> writes:
> >
> > Attached is a first cut at 386-optimized code.
>
> Unless I've missed something, what you're doing looks like either
> mpn_lshift or mpn_rshift, according to the overlap direction. See
> that code for what it takes to go fast on x86s. (In particular mmx is
> usually what you want for the mmx chips, shrdl is sub-optimal on say
> athlon, and very slow on pentium4.)
>
lshift and rshift not only specify the direction of the shift, but also
the direction of the copy- forwards or backwards. For overlapping shifts,
this is often wrong, especially if you're wanting to shift more than a
word distance. lshift and rshift have limitations bitblit doesn't have-
they aren't good replacements for bitblit. Nor is bitblit a good
replacement for lshift/rshift, I comment- lshift and rshift zero the bits
shifted in, something bitblit doesn't do.
I got started on bitblit when I looked at what was required to do a rotate
of more than one word's length. Using just lshift/rshift, it was
(assuming a rotate left):
- memcpy the low part of the number into a temporary array
- memmove the high part down to the low part
- lshift the high part so the bits are in the right places
- rshift the temporary array so the bits are in the right places
- memcpy all but one word of the temprorary array back into the
number
- or the last word together
Every bit in the low part of the word is moved three times, every bit in
the high part of the word is move twice. Now, compare this with the same
operation using bitblit:
- memcpy the low part of the word into a temp array
- bitblit the high part of the word down into place
- bitblit the low part of the temp array into place
Every bit of the high part of the word is moved only once, while every bit
of the low part of the word is moved only twice.
I'm also looking at variations of bitblit that perform various bitwise
logical operations, in addition to just moving bits. I'm thinking, as
operations- copy (dst = src), and (dst &= src), or (dst |= src), xor (dst
^= src), not-and (dst = ~(dst & src)), not-or (dst = ~(dst | src)), equ
aka not-xor (dst = ~(dst ^ src)), and-not (dst &= ~src), and or-not (dst
|= ~src).
I agree that MMX is going to be a speedup, and SSE better yet. I'm
working on that code. The 386 code I submitted is faster than the C code
on my athlon XP 2200+ system. It's just that the MMX/SSE code should be
even faster yet.
As a side note, I haven't gone through and scheduled the instructions in
the 386 optimized code yet- there may still be clock cycles that can be
squeezed out of that code.
--
"Usenet is like a herd of performing elephants with diarrhea -- massive,
difficult to redirect, awe-inspiring, entertaining, and a source of
mind-boggling amounts of excrement when you least expect it."
- Gene Spafford
Brian
More information about the gmp-devel
mailing list