Writing mpz_powm for GPU/CUDA

Tue Jun 11 01:32:17 UTC 2019

So I have an application that requires modular squaring thousands of
4096-bit numbers at once, and I was planning on writing CUDA to do all of
the multiplications in parallel. How much work would it be to port over
just the mpz_powm method to CUDA? I know that the latency for each single
multiplication won't be as good as on the CPU, but I'm looking for
throughput here, not latency.

Thanks,
Dan Cline