A little warning in GMP.h

Mon Sep 27 17:22:26 CEST 2004

On Mon, 27 Sep 2004, Brian Gladman wrote:

> Torbjorn Granlund wrote:
> 
> > Brian Gladman <brg at gladman.plus.com> writes:
> > 
> >   Designing GMP to work well with poor compilers at the expense of good ones
> >   is fine if you really think that this makes sense. But in my view this is a
> >   design philosophy that is from a bygone age.
> > 
> > You would know more about those datk age than me, but it baffles
> > me that people ever behaved that irrationally.  :-)
> > 
> >   And, of course, you can always trust someone on this list to turn a serious
> >   point into a cheap jibe at the Microsoft compiler.
> >   
> > Oh, I misunerstood you.  I thought you were showing a horror
> > example of the poor optimizer of the Microsoft compiler, when you,
> > in fact, where making a serious point.
> 
> Ok, now that we have got that misunderstanding out of the way, I think 
> it is worth asking Brian Hurt's question but 'in reverse'.
> 
> That is, are there still compilers around that: (a) are important for 
> GMP, and (b) cannot produce branchless code from:
> 
> 	return (__gmp_n != 0) ? __gmp_l : 0;
> 
> on those architectures where this matters?
> 
> All the x86 compilers that I use regularly have no problem producing 
> branchless code in this situation.  I don't use GCC but can I assume 
> that it is now good enough on the X86 to do the same?

Yes, it is (at least for GCC 3.2.2), if it feels that it's an advantage.

Note that there can be an advantage to branching: branching doesn't load 
__gmp_l if it doesn't need to.

So given:

mp_limb_t foo(mp_limb_t __gmp_n, mp_limb_t __gmp_l) {
 	return (__gmp_n != 0) ? __gmp_l : 0;
}

does in fact produce a branchless version.  But:

extern mp_limb_t __gmp_n;
extern mp_limb_t __gmp_l;

mp_limb_t foo(void) {
 	return (__gmp_n != 0) ? __gmp_l : 0;
}

produces branching code.

Another comment I'll make is that the cost of a branch is a function of 
how predictable it is.  A correctly predicted branch is very cheap- a 
clock cycle or two at worst.  It's a mispredicted branch which is 
expensive.

So if, for example 90% of the time __gmp_n is non-zero (for example), the
total cost for the branch is 0.9*predicted_cost + 0.1*mispredicted_cost.  
Assuming that predicted_cost is 1 clock, and mispredicted_cost is 20
clocks, the total cost of the branch at this point is 0.9 + 2 = 2.9
clocks.  Note that taking the L-1 cache *hit* of a load is two clocks.  At 
this level of predictability, avoiding a load may be worthwhile, even if 
the load is from cache.

It's only if the branch is mispredicted often that it becomes a problem-
if __gmp_n is non-zero only 60% of the time, the cost is then 0.6 + 
(0.4*20) = 8.6 clocks.  Now it's worthwhile to take the load from L1 cache 
(but not from L2 cache) to avoid the branch.

The worst case scenario is a 50% split- a completely unpredictable branch.  
At this point, the branch takes 10.5 clocks on average.  Note that an L2 
cache *hit* costs 20-30 clock cycles.

Does someone actually have measurements about how often this branch is 
mispredicted?

-- 
"Usenet is like a herd of performing elephants with diarrhea -- massive,
difficult to redirect, awe-inspiring, entertaining, and a source of
mind-boggling amounts of excrement when you least expect it."
                                - Gene Spafford 
Brian