*Bounty: 50*

*Bounty: 50*

Given a 4096-bit integer $x$ and a 4096-bit RSA modulus $N$ (of unknown factorisation) what is the fastest circuit to compute $x^{2^T} mod N$ where $T=2^{40}$. That is, what is the fastest parallel-time (i.e. lowest latency) algorithm for repeatedly squaring $x$ modulo $N$ a trillion times?

For context, I am looking to build a Verifiable Delay Function ASIC to squeeze out opportunities for parallelisation that CPUs/GPUs cannot capture.

I have found hundreds of papers in the literature that cover modular multiplication. Families of algorithms include Montgomery, Barrett, Residue Number System (RNS), sum of residues, Kochanski, Brickell. Often the optimised parameter is *not* latency—e.g. it can be throughput, power consumption, die area, simplicity, side channel leak resistance, etc. The fastest implementation I found does a 4096-bit modular multiplication in 607ns.

Sub-questions:

- What is the optimal parallel-time algorithm for squaring a 4096-bit integer? One brute-force approach is to implement a 4096-by-4096 multiplier with $4096^2$ gates and a tree of 3-to-2 adders of depth 4096. Can we do better than that?
- The algorithms discussed in the literature often do modular
*multiplication*(as opposed to modular squaring). What latency gains can be had from designing a circuit specific to modular squaring?