Overhead vs cache effectiveness
Torbjorn Granlund
tege@swox.com
23 Nov 2002 04:08:06 +0100
In the past, the GMP assembly hackers have aimed for maximum
speed for big operands, often adding a constant overhead in
the process. The code has been made to run as fast as
possible for operands that barely fit into L1 cache.
This isn't always clever. Take mpn_addmul_1 as an example.
It is clearly the most important loop of GMP. And the most
important usage for that routine is as a building block for
basecase multiplication.
But basecase multiplication will typically not use operand
sizes bigger than the karatsuba threshold parameters, which
is between typically 10 and 60. We therefore primarily want
our mpn_addmul_1 to run fast for such small numbers. (There
are other usages of mpn_addmul_1 too, where operands are
potentially bigger, but that is mainly for the more and more
scarce quadratic algorithms in GMP.)
A different example is mpn_add_n. It is commonly used for
all sizes of operands, from small to really huge. In the
latter case, it will have 100% L1 cache miss rate. It is
not written to handle that well.
I believe it will be a good idea to start separating loops
for some of these functions, one minimal-overhad loop for
small operands, and one for the cache miss case. For the
latter case, loop startup overhead is negligible. The code
for huge operands should basically just load operands very
early, and perhaps issue prefetch instructions. This adds
overhead, since such a loop will need many values to be
preloaded into registers before it is entered, and many
values to be stored after the loop exits.
What should the cutoff point be? Perhaps that is not
terribly important, as long as it is chosen to be over, say,
100 limbs. A few cycles extra overhead for such large
operands will not matter much.
The potential of a change like this is that we get
considerably better speed both for small operands and for
operands in the FFT range. Experiments on Alpha ev6
indicate that we can double the mpn_add_n speed for huge
operands. Likewise, ev6 mpn_addmul_1 can become 20% faster
for the basecase multiply range.
--
Torbjörn