Area-Time Efficient Architecture of FFT-Based Montgomery Multiplication

Research output: Journal Publications and Reviews (RGC: 21, 22, 62)21_Publication in refereed journalpeer-review

13 Scopus Citations
View graph of relations


Related Research Unit(s)


Original languageEnglish
Article number7547917
Pages (from-to)375-388
Journal / PublicationIEEE Transactions on Computers
Issue number3
Publication statusPublished - 1 Mar 2017


The modular multiplication operation is the most time-consuming operation for number-theoretic cryptographic algorithms involving large integers, such as RSA and Diffie-Hellman. Implementations reveal that more than 75 percent of the time is spent in the modular multiplication function within the RSA for more than 1,024-bit moduli. There are fast multiplier architectures to minimize the delay and increase the throughput using parallelism and pipelining. However such designs are large in terms of area and low in efficiency. In this paper, we integrate the fast Fourier transform (FFT) method into the McLaughlin's framework, and present an improved FFT-based Montgomery modular multiplication (MMM) algorithm achieving high area-time efficiency. Compared to the previous FFT-based designs, we inhibit the zero-padding operation by computing the modular multiplication steps directly using cyclic and nega-cyclic convolutions. Thus, we reduce the convolution length by half. Furthermore, supported by the number-theoretic weighted transform, the FFT algorithm is used to provide fast convolution computation. We also introduce a general method for efficient parameter selection for the proposed algorithm. Architectures with single and double butterfly structures are designed obtaining low area-latency solutions, which we implemented on Xilinx Virtex-6 FPGAs. The results show that our work offers a better area-latency efficiency compared to the state-of-the-art FFT-based MMM architectures from and above 1,024-bit operand sizes. We have obtained area-latency efficiency improvements up to 50.9 percent for 1,024-bit, 41.9 percent for 2,048-bit, 37.8 percent for 4,096-bit and 103.2 percent for 7,680-bit operands. Furthermore, the operating latency is also outperformed with high clock frequency for length-64 transform and above.

Research Area(s)

  • fast Fourier transform (FFT), field-programmable gate array (FPGA), Montgomery modular multiplication, number-theoretic weighted transform

Citation Format(s)

Area-Time Efficient Architecture of FFT-Based Montgomery Multiplication. / Dai, Wangchen; Chen, Donald Donglong; Cheung, Ray C.C.; Koç, Çetin Kaya.

In: IEEE Transactions on Computers, Vol. 66, No. 3, 7547917, 01.03.2017, p. 375-388.

Research output: Journal Publications and Reviews (RGC: 21, 22, 62)21_Publication in refereed journalpeer-review