#StackBounty: #algorithms #cryptography #number-theory #modular-arithmetic Optimal parallel-time repeated modular squaring circuit

Bounty: 50

Given a 4096-bit integer $x$ and a 4096-bit RSA modulus $N$ (of unknown factorisation) what is the fastest circuit to compute $x^{2^T} mod N$ where $T=2^{40}$. That is, what is the fastest parallel-time (i.e. lowest latency) algorithm for repeatedly squaring $x$ modulo $N$ a trillion times?

For context, I am looking to build a Verifiable Delay Function ASIC to squeeze out opportunities for parallelisation that CPUs/GPUs cannot capture.

I have found hundreds of papers in the literature that cover modular multiplication. Families of algorithms include Montgomery, Barrett, Residue Number System (RNS), sum of residues, Kochanski, Brickell. Often the optimised parameter is not latency—e.g. it can be throughput, power consumption, die area, simplicity, side channel leak resistance, etc. The fastest implementation I found does a 4096-bit modular multiplication in 607ns.


  1. What is the optimal parallel-time algorithm for squaring a 4096-bit integer? One brute-force approach is to implement a 4096-by-4096 multiplier with $4096^2$ gates and a tree of 3-to-2 adders of depth 4096. Can we do better than that?
  2. The algorithms discussed in the literature often do modular multiplication (as opposed to modular squaring). What latency gains can be had from designing a circuit specific to modular squaring?

Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.