Performance on riscv32

Mon Mar 30 16:47:45 CEST 2026

Hi,

I'm trying to do ed25519 operations on a slow riscv32 system. I'm now
using nettle + mini-gmp, with an umul_ppmm patched to use (uint64_t) u *
v, which results in reasonable code using mul and mulhu instructions.

I get this mpn_addmul_1 inner loop:

      28: 4010          lw      a2, 0x0(s0)
      2a: 032636b3      mulhu   a3, a2, s2
      2e: 03260633      mul     a2, a2, s2
      32: 962a          add     a2, a2, a0
      34: 4098          lw      a4, 0x0(s1)
      36: 00a63533      sltu    a0, a2, a0
      3a: 9536          add     a0, a0, a3
      3c: 0411          addi    s0, s0, 0x4
      3e: 9732          add     a4, a4, a2
      40: 00c73633      sltu    a2, a4, a2
      44: 9532          add     a0, a0, a2
      46: 00448613      addi    a2, s1, 0x4
      4a: c098          sw      a4, 0x0(s1)
      4c: 84b2          mv      s1, a2
      4e: fcb61de3      bne     a2, a1, 0x28 <mpn_addmul_1+0x28>

As I understand it, sltu + add is needed for each carry propagation.
For comparison, plain mpn_add_n gets compiled to an inner loop

       c: 419c          lw      a5, 0x0(a1)
       e: 4214          lw      a3, 0x0(a2)
      10: 973e          add     a4, a4, a5
      12: 00f737b3      sltu    a5, a4, a5
      16: 96ba          add     a3, a3, a4
      18: 00e6b733      sltu    a4, a3, a4
      1c: 973e          add     a4, a4, a5
      1e: c114          sw      a3, 0x0(a0)
      20: 0511          addi    a0, a0, 0x4
      22: 0611          addi    a2, a2, 0x4
      24: 0591          addi    a1, a1, 0x4
      26: ff0513e3      bne     a0, a6, 0xc <mpn_add_n+0xc>

Besides the 4(!) instructions for carry propagation, also lack of
indexed addressing looks somwwhat costly.

I'm not that familiar with riscv, but to me the generated code looks
pretty good under the architectural limitations, and I see no obvious
microptimizations (only a single move instruction that appear a bit
redundant). But I may be missing something.

When benchmarking, my ed25519 code is about 50% slower slower than the
monocypher C library, for the ed25519 signing operation (10 million
cycles vs 7 million). That library appears to use arithmetic based on
nail bits (in GMP terminology), to avoid dealing with low-level carry
propagation (and it also has the advantage of specialized code for the
size of interest). So I wonder, is it possible to get reasonable speed
with fullsize limbs (no nails) on this platform? If I could switch from
mini-gmp to full gmp (a bit challenging due to the rather limited
environment with no normal libc), and revive GMP nails code, would that
make sense for performance?

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.