divrem_1 and mod_1 (Was: Re: gmp-devel list)

Fri Nov 22 02:34:58 UTC 2013

Back in March, Paul posted some Core2 cycle counts for a data-parallel right-to-left divrem algorithm I posted to ArXiv.  Moving from prehistoric to newer CPU architectures, here are average cycle-per-limb counts for the DIV_NBY1 functionality at issue on a single core of an Intel Haswell system running Linux/GCC.

First, the unmodified 4-way-split loop inline-asm code I got the best results with on Core2 - numbers ruounded to nearest 0.1 cycle/limb:

Rem-only: 3.0 cycle/limb
Div+Rem: 6.8 cycle/limb

This compares to 6 and 12.5 cycle/limb, resp., on Core2. I do not have ready access to a Sandy/Ivy-Bridge system but expect similar cycle counts there, as the main boost is due to the significantly improved throughput of the integer MUL instructions in Intel's post-Core2 chip families, that is, already present in *y Bridge.

The x86 implementation of the algorithm has no interactions between the MUL instructions of the loop body and the add/sub carry flags, thus modifying the code to use the new-in-Haswell MULX instruction is only of modest benefit, in that it obviates some register-data moves related to the old x86 MUL ax:dx register-pair constraints. Here are numbers using MULX:

Rem-only: 2.8 cycle/limb
Div+Rem: 6.8 cycle/limb

Best regards,
-Ernst