caching of transforms used for large multiplications

Wed Jun 12 15:47:44 CEST 2013

Daniel Lichtblau <danl at wolfram.com> writes:

> My question(s): Are these NTT intermediate results cached?

No. I have a tentative interface for this in my small-prime fft code
which haven't yet been integrated with gmp. I think Paul Zimmermann may
also have some patches along those lines for the Schönhage-Strassen fft
which is used in GMP now.

> If not, is there a way to "tell" GMP to keep them around for
> subsequent reuse? I suspect such caching might be a generally useful
> thing to have if it is not already present.

Interfaces for operations on invariant numbers have been discussed for a
long time, but no public interfaces have been implemented (and few
internal). Caching transforms may be useful also for smaller
multiplications (I think I saw a speedup of a few % already with
Karatsuba, when I played with that). And for division for all sizes.

> For fast GCD it could amount to a notable speed gain, I think. (I
> believe Niels remarked specifically on this but do not have the
> reference handy. Sorry.)

In GCD, there are certain large unbalanced multiplication. Splitting the
large factor and reusing the transform of the small factor is one
possible approach to optimization. Another is transforming the 2x2
matrix elements once (since they're used twice for a matrix x vector
multiplication), that's done in the small-prime fft code. And yet
another is using wraparound arithmetic. (And there are surely other
tricks as well).

Hmm, looking at the GCD code now, it makes some limited use of wraparound
arithmetic. Only for the matrix x vector multiplication for the
first of the two recursive reductions in hgcd. The gmp function to look
for is mpn_mulmod_bnm1.

I also dug up some old slides. Below are some estimates extracted from a
presentation I made at KTH, May 15, 2008. I don't remember much of the
details, but for the largest savings, you'd want to not only cache
transforms, but also do some linear operations in the transform domain.

E.g,, to multiply two 2x2 matrices, first FFT-transform the 4
elements of each matrix, *then* expand them linearly to two 7-element
vectors for Strassen multiplication. Do the 7 pointwise multiplications,
linearly transform them back to four elements, and finally, four inverse
FFTs, one for each element of the result matrix.

Regards,
/Niels

\section{Further work}
\titleslide{Further work}

\begin{frame}
  \frametitle{Matrix multiplication}
  \begin{equation*}
    M_1 \cdot M_2 \quad \text{$2\times2$ matrices}
  \end{equation*}
  \bigskip

  Assume \FFT and sizes such that transforms and pointwise
  multiplication take equal time.

  \bigskip
  \begin{tabular}{l|rrrr}
    & \FFT & \IFFT & Pointwise & Saving \\
    \hline
    Naive & 16 & 8 & 8 & 0\% \\
    Schönhage-Strassen & 14 & 7 & 7 & 12\% \\
    Invariance & 8 & 4 & 8 & 37\% \\
    S.-S. + invariance & 8 & 4 & 7 & 40\% \\
  \end{tabular}
\end{frame}

\begin{frame}
  \frametitle{Matrix-vector multiplication}

  \begin{itemize}
  \item If $\alpha, \beta$ are returned: $M$ of size $n/4$, $A', B'$ of size
    $n/2$.
    \begin{equation*}
      M^{-1} \cdot \M{A \\ B} = 2^p \M{\alpha \\ \beta} + M^{-1} \cdot \M{A' \\ B'}
    \end{equation*}
    \begin{tabular}{l|rrl}
      & \#Mults. & Prod. size &\\
      \hline
      Naive & 4 & $3n/4$ & Wins in \FFT range \\
      Block & 8 & $n/2$ & Can use invariance \\
      S.-S. & 7 & $n/2$ & Wins in Karatsuba range \\
    \end{tabular}
    \pause
  \item If only matrix is returned: $M$ of size $n/4$, $A, B$ of size
    $n$.
    \begin{equation*}
      \M{\alpha \\ \beta} = M^{-1} \cdot \M{A \\ B}
    \end{equation*}
    $\alpha, \beta$ are of size $3n/4$ (cancellation!). Compute $\bmod (2^k \pm 1)$,
    with transform size $\approx 3n/4$.
    \pause
  \item 
    \alert{Same transform size}, $3n/4$, no matter if reduced
    numbers are available or not!
  \end{itemize}
\end{frame}

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.