divrem_1 and mod_1 (Was: Re: gmp-devel list)
E. Mayer
ewmayer at aol.com
Fri Nov 22 02:34:58 UTC 2013
Back in March, Paul posted some Core2 cycle counts for a data-parallel right-to-left divrem algorithm I posted to ArXiv. Moving from prehistoric to newer CPU architectures, here are average cycle-per-limb counts for the DIV_NBY1 functionality at issue on a single core of an Intel Haswell system running Linux/GCC.
First, the unmodified 4-way-split loop inline-asm code I got the best results with on Core2 - numbers ruounded to nearest 0.1 cycle/limb:
Rem-only: 3.0 cycle/limb
Div+Rem: 6.8 cycle/limb
This compares to 6 and 12.5 cycle/limb, resp., on Core2. I do not have ready access to a Sandy/Ivy-Bridge system but expect similar cycle counts there, as the main boost is due to the significantly improved throughput of the integer MUL instructions in Intel's post-Core2 chip families, that is, already present in *y Bridge.
The x86 implementation of the algorithm has no interactions between the MUL instructions of the loop body and the add/sub carry flags, thus modifying the code to use the new-in-Haswell MULX instruction is only of modest benefit, in that it obviates some register-data moves related to the old x86 MUL ax:dx register-pair constraints. Here are numbers using MULX:
Rem-only: 2.8 cycle/limb
Div+Rem: 6.8 cycle/limb
Best regards,
-Ernst
More information about the gmp-devel
mailing list