[PATCH] Optimize 32-bit sparc T1 multiply routines.

Fri Jan 4 14:12:23 CET 2013

Ciao,

Il Ven, 4 Gennaio 2013 10:07 am, David Miller ha scritto:
> 	mpmul	3			! The immediate field is "N - 1"

Does the immediate means that, to write e.g. sqr_basecase (it should be
far simpler than writing mul_basecase), you need a branch for each
different N?

> The circuit does scale very well, for example here are cycle counts
> for just the 'mpmul' instruction itself for N from 1 to 16:
> 79
> 84
[...]
> 318
> 350

(Almost) exactly 78 + N + N*N, a big latency plus a quadratic algorithm.
When 2n > 26, we have (78 + 2n + 2n*2n) > 3*(78 + n + n*n)... I'm curious
to see where the Karatsuba threshold will be.

> Anyways I obviously have a lot of experimenting and tinkering to do,
> so I'll come back here once I have a better idea of how we might use
> 'mpmul' most effectively.

We hope to see you soon ;-)

Il Ven, 4 Gennaio 2013 3:54 am, David Miller ha scritto:
> Yes, the mpmul instruction is limited to balanced NxN multiplies.
>
> Well, actually, we could use this mpmul instruction for NxM cases by
> padding the unused parameters with zeros.  That way we could support
> any case where N <= 32 and M <= 32.

It seems better to perform a single 8x4 product (8x8, 150 cycles) than two
4x4 (98+98 cycles); also a 12x4 (12x12, 234 cycles) seems better than
three 4x4 (98+98+98= 294)... maybe mul_4 is not worth writing, but mul_6
is...

> Making this for crypto would be of no value for T4, because as
> mentioned the chip has other instructions that more directly support
> modular arithmetic in the form of 'montmul' and 'montsqr'
> instructions.

Cryptography beyond 2048-bits is possible :-)

-- 
http://bodrato.it/