Pipelined VLSI Architecture for RSA Based on Montgomery Modular Multiplication

GRD Journals | Global Research and Development Journal for Engineering | International Conference on Innovations in Engineering and Technology (ICIET) - 2016 | July 2016

e-ISSN: 2455-5703

Pipelined VLSI Architecture for RSA Based on Montgomery Modular Multiplication 1Vinodhini.N 2Suganya.C 1,2

Research Scholar Department of Electronics and Communication Engineering 1,2 Dr.Mahalingam College of Engineering and Technology, Pollachi-642002 INDIA 1,2

Abstract Modular multiplication forms a key operation in many public key cryptosystems. Montgomery Multiplication is one of the wellknown algorithms to carry out the modular multiplication more quickly. Carry Save Adders are employed to avoid carry propagation at each addition operation. To reduce the extra clock cycles, Configurable carry save adder either with one full-adder or two half-adders can be employed. In addition to that, a mechanism used to skip the unnecessary carry-save addition operations in the one-level CCSA while maintaining the short critical path delay had been developed. In the proposed architecture, maximum worst case delay is analyzed to enhance the throughput. In the path, additional buffers are introduced so that the clock is synchronized to reduce the worst case delay. As a result, pipelining concept is introduced which increases the speed and achieves a high throughput. The pipelined architecture is applied in RSA public key algorithm to increase the throughput of RSA cryptosystem. Keyword- Carry save addition, Montgomery modular multiplier, Pipelining, RSA __________________________________________________________________________________________________

I. INTRODUCTION The increase in data communication and internet services like electronic commerce, the security occupies an important role over the inter-network. Public key cryptosystems by Rivest,R.L., et al provides data security to such networks. In these cryptosystems, modular multiplication (MM) plays an important role in arithmetic functions. To enhance security, MM with large integers is preferred. Montgomery multiplication proposed by Montgomery.P.L.is one of the fast algorithms to carry out the MM more quickly. This algorithm determines the quotient only depending on the least significant digit of operands and replaces the complicated division with a series of shifting modular additions. Montgomery MM is given by=A*B*R-1(Mod N) where, N is the k-bit modulus, R-1 is the inverse of R modulo N, R Ă&#x2014; R-1 = 1 (mod N) and R = 2k mod N. Hence it can be easily implemented to speed up the encryption and decryption process in VLSI circuits. Long carry propagation is a major problem in performing addition for large operands in binary representation. To solve this problem, several approaches based on carry save addition were proposed to achieve a significant speedup of Montgomery MM. These approaches can be divided into semi carry save (SCS) and full carry save (FCS) strategy. The works by Kim, Y.S. et al, Bunimov,V. et al and Zhengbing,H.et al proposed that in Semi Carry Save format, the inputs and outputs of the Montgomery multiplication are represented in binary form but the intermediate results of modular multiplication are kept in carry save format for avoiding carry propagation. However, the format conversion from the carry-save representation of the final product into its binary representation must be performed at the end of each modular multiplication. This conversion can be simply accomplished by adding the carry and sum terms of carry-save representation. But the addition still suffers from long carry propagation, and extra circuit and time are probably needed for these conversions. In Full Carry Save format given by Walter, C.D and Zhengbing, H et al maintaining all the inputs and outputs of the Montgomery modular multiplication in carry-save form except the final step for getting the result of modular exponentiation. However, this implies that the number of operands in modular multiplication must be increased so that additional registers to store these operands are required. Therefore, the FCS based Montgomery multipliers possibly have higher hardware complexity and longer critical path than SCS based multipliers. A. Montgomery multiplication Modular multiplication of two integers X and Y, simply performs, S = A.B mod N Given an integer aË&#x201A;n, where n is the k-hit modulus, Ais A = a*r (mod N) Where r=2k. Likewise, given an integer b<n, Bis said to be its n-residuewith respect to r if, B = b*r (mod N)

463

Pipelined VLSI Architecture for RSA Based on Montgomery Modular Multiplication (GRDJE / CONFERENCE / ICIET - 2016 / 076)

The Montgomery product of A and B can then be defined as, S = A*B*r-1 (mod N) Where r-1 is the inverse of r, modulo n. The radix-2 version of Montgomeryâ&#x20AC;&#x2122;s multiplication algorithm is shown in Fig.1

Fig. 1: MM algorithm

B. RSA RSA is one of the most widely used public key algorithms at present. The RSA encryption and decryption functions are given by C= Me mod n D= Cd mod n Respectively, where M is a plain text message block, C is a cipher text block, n is the k-bit modulus, and e and d are the public and private exponents respectively. The equation ed = 1(mod(p-1)Ă&#x2014;(q-1)) must also hold, where p and q are two large prime numbers and n = pq. Thus, an RSA operation is modular exponentiation with operands satisfying the conditions stated above. RSA requires repeated modular multiplications to accomplish the computation of modular exponentiation and the size of modulus is generally at least 1024 bits for long term security. This paper aim at enhancing the performance of CSA based SCS Montgomery multiplier while maintaining reduced delay through pipelining. The proposed method is implemented in RSA public key algorithm to increase the speed and throughput of RSA cryptosystems.

II. EXISTING MONTGOMERY MULTIPLICATION There were several SCS and FCS based Montgomery multipliers were proposed. Among several previous multipliers, the new SCS-based Montgomery MM has the minimum delay and achieves high throughput compared to the other existing multipliers described in [10]. The new SCS based Montgomery MM algorithm aimed to reduce the critical path delay and number of clock cycles for completing one modular multiplication. A. Architecture On the bases of critical path delay reduction, clock cycle number reduction, and quotient pre-computation mentioned by Kuang,S.R. et al, a new SCS-based Montgomery MM algorithm (i.e., SCS-MM-New algorithm shown in Fig. 4) using one-level CCSA architecture as shown in Fig. 2 is proposed to significantly reduce the required clock cycles for completing one MM. This CCSA architecture consists of one full adder or two half adders toperform carry save additions. The select signal Îą decides whether it performs full adder or half adder. If Îą=1, full adder is selected. The following equations [10] are used to derive the new SCS algorithm: (1) qi+1 = ( SS[i]1â&#x160;&#x2022;SC[i]1) â&#x160;&#x2022;(SS[i]0 Ë&#x201E;SC[i]0) (2) Ě&#x201A; 2 )â&#x160;&#x2022; (SS[i]1 Ë&#x201E;SC[i]1) qi+2 = ( SS[i]2â&#x160;&#x2022;SC[i]2) â&#x160;&#x2022;(qi Ë&#x201E;đ?&#x2018; skipi+1= ~(Ai+1 Ë&#x2026;(SS[i]1â&#x160;&#x2022;SC[i]1) Ë&#x2026;(SS[i]0Ë&#x201E;SC[i]0))

(3)

464

Pipelined VLSI Architecture for RSA Based on Montgomery Modular Multiplication (GRDJE / CONFERENCE / ICIET - 2016 / 076)

Fig. 2: Proposed CCSA circuit.

Fig. 3: SCS-based MM-New architecture.

Ě&#x201A; are first performed.Note that because qi+1, qi+2 must be As shown in SCS-MM-New algorithm, steps 1-5 for producing đ??ľĚ&#x201A; and đ??ˇ th generated in the i iteration, the iterative index i of Montgomery MM will start from -1 instead of 0 and the corresponding initial values of đ?&#x2018;&#x17E;Ě&#x201A; and đ??´Ě&#x201A; must set to 0. Furthermore, the original for loop is replaced with the while loop in SCS-MM-New algorithm to skip some unnecessary iterations when skipi+1 = 1. In addition, the ending number of iterations in SCS-MM-New algorithm is changed to k + 4 instead of k + 1. The hardware architecture of SCS-MM-New algorithm, denoted as SCS-MM-New multiplier, are shown, which consists of one one-level CCSA architecture, two 4-to-1 multiplexers (i.e., M1 and M2), one simplified multiplier SM3, one skip detector Skip_D, one zero detector Zero_D, and six registers. Skip_D is developed to generate skip i+1,đ?&#x2018;&#x17E;Ě&#x201A; and đ??´Ě&#x201A; in the ith iteration. Both M4 and M5 are 3-bit 2-to-1 multiplexers and they are much smaller than k-bit multiplexers M1, M and SM3. In addition, the area of Skip_D is negligible when compared with that if the k-bit one-level CCSA architecture. The select signals of multiplexers M1 and M2 are generated by the control part, which are not depicted for the sake of simplicity. At the beginning of Montgomery multiplication, the FFs stored skipi+1, , are first reset to 0 as shown in step 1 Fig. 4 so that = + can be computed via the one-level CCSA architecture. When performing the while loop, the skip detector as shown in Fig. 5, Skip_D is used to produce skipi+1, and . The Skip_D is composed of four XOR gates, one NOR gate, and two 2-to-1 multiplexers. It first generates the qi+1, qi+2 and skipi+1 signal in the ith iteration according to equations (1), (2), and(3) and then selects the correct and according to skipi+1. At the end of the ith iteration, , and skipi+1 must be stored to FFs.

465

Pipelined VLSI Architecture for RSA Based on Montgomery Modular Multiplication (GRDJE / CONFERENCE / ICIET - 2016 / 076)

B. SCS-based MM-New algorithm

Fig. 4. SCS-MM-New algorithm.

In the next clock cycle of the ith iteration, SM3 outputs a proper x according to đ?&#x2018;&#x17E;Ě&#x201A; and đ??´Ě&#x201A; generated in the ith iteration as shown in steps 9-12, M1 and M2 output the correct SC and SS according to skipi+1 generated in the ith iteration. If skipi+1 = 0, SC â&#x2030;Ť 1 and SS â&#x2030;Ť 1 are selected.Otherwise, SC â&#x2030;Ť 2 and SS â&#x2030;Ť 2 are selected, so that the right-shift 1-bit operations in steps 13 and 17 of SCS-MM-New algorithm are performed together in the next cycle of the iteration i.

Fig. 5: Skip Detector.

466

Pipelined VLSI Architecture for RSA Based on Montgomery Modular Multiplication (GRDJE / CONFERENCE / ICIET - 2016 / 076)

In addition, M4 and M5 also select and output the correct SC[i]2:0 and SS[i]2:0 according to skipi+1 generated in the ith iteration. Note that SC[i]2:0 and SS[i]2:0 can also be obtained from M1 and M2 but a longer delay is required because they are 4-to1 multiplexers. After the while loop in steps 7-24 is completed, đ?&#x2018;&#x17E;Ě&#x201A; and đ??´Ě&#x201A; stored in FFs are reset to 0. Then, the format conversion Ě&#x201A; = đ??ľĚ&#x201A; + đ?&#x2018; Ě&#x201A; in steps 3 and 4. in steps 26 and 27 can be performed by the SCS-MM-New multiplier similar to the computation of đ??ˇ Finally, SS [k + 5] equals to 0.

III. PROPOSED ARCHITECTURE In the existing SCS-MM-New architecture (as shown in Fig. 3), each and every path in that multiplier will be analysed through RTL. Simulation through coding or RTL view, the path having the maximum worst case delay will be found out. In that path, additional buffers such as registers or flip-flops will be introduced and the clock is synchronized to reduce the worst case delay. The concept of pipelining will be introduced and hence efficiency increases. Operating frequency is inversely proposed to critical path. Therefore optimization will be done on Area, Power and Speed. Hence the proposed multiplier shown in Fig. 6 increases the speed and reduces the delay comparing to the previous existing SCS-MM-New multiplier.

Fig. 6: Pipelined Montgomery Multiplier

The proposed pipelined architecture is implemented in Rivest Shamir Adleman algorithm (RSA), one of the public key cryptosystems. Modular exponentiation is the main operation performed in RSA. Modular exponentiation is achieved through repeated modular multiplication. Hence, the modular exponentiation in RSA algorithm is replaced by the proposed pipelined Montgomery modular multiplier to increase the speed and throughput of RSA cryptosystems. Montgomery algorithm is used to calculate modular exponentiation of two integer values in RSA algorithm. The simulation result is shown in the Fig. 6.

467

Pipelined VLSI Architecture for RSA Based on Montgomery Modular Multiplication (GRDJE / CONFERENCE / ICIET - 2016 / 076)

Fig. 7: RSA simulation result

IV. EXPERIMENTAL RESULTS The design is coded using VHDL, an HDL and synthesized and simulated using XILINX ISE 14.2 software. The worst case delay path is analysed with the help of synthesis report. A register is included in that path which reduces the critical path delay. It is difficult to directly compare the proposed multiplier with the previous designs as it adopts different technology. Hence the delay, area and power are compared with the existing and proposed pipelined architectures for two bit key sizes. The results are given in the Table 1. Key size

Delay Area* Power(W) (ns) Existing 5.379 33 3.237 1024 Proposed 5.136 36 3.171 Existing 7.381 38 3.171 2048 Proposed 7.138 43 3.122 Table 1: Comparison of existing and pipelined Montgomery multipliers with 1024 and 2048 bit key sizes. * - number of slices occupied Multiplier

V. CONCLUSION SCS-MM-New multiplier has the shortest critical path delay and needs fewer clock cycles to complete one Montgomery MM, and thus spends the least execution time and achieves the highest throughput rate. This paper presented a pipelined Montgomery modular architecture to reduce the delay and power. While using the proposed multiplier architecture in the present day RSSA algorithms, the computational speed in such algorithms increases which is a major advantage in the cryptosystems.

REFERENCES [1] Bunimov, V.,Schimmler, M. and Tolg, B. (2002).A complexity-effective version of Montgomeryâ&#x20AC;&#x2122;s algorithm.Proc. Workshop complex Effective Designs. [2] Kim, Y.S., Kang, W.S. and Choi, J.R. (2000).Asynchronous implementation of 1024-bit modular processor for RSA cryptosystem.Proc. 2nd IEEE Asia-Pacific Conf. ASIC, pp.187-190. [3] Kuang, S.R., Wang, J.P., Chang, K.C. and Hsu, H.W. (2013). Energy-efficient high-throughput Montgomery modular multipliers for RSA crytosystems.IEEE Trans, VLSI Syst. Vol.21, no.11, pp.1999-2009. [4] Kuang,S.R., Kun-Yi Wu, and Ren-Yao Lu.(2015). Low-Cost High-Performance VLSI Architecture for Montgomery Modular Multiplication.IEEE Trans. Very Large Scale Integr. (VLSI) Syst., Volume:PP , Issue:99.

468

Pipelined VLSI Architecture for RSA Based on Montgomery Modular Multiplication (GRDJE / CONFERENCE / ICIET - 2016 / 076)

[5] McIvor, C.,McLoone, M. and McCanny, J. V. (2004).Modified Montgomery modular multiplication and RSA exponentiation techniques.IEE Proc.-Comput.Digit.Techn, Vol. 151, no. 6, pp. 402–408. [6] Montgomery, P.L. (1985).Modular multiplication without trial division Math.Comput., Vol. 44, no. 170, pp. 519–521. [7] Rivest, R.L., Shamir, A. and Adleman, L. (1978).A method for obtaining digital signatures and public-key cryptosystemsCommun. ACM, Vol. 21, no. 2,pp. 120–126. [8] Walter, C.D. (1999), ‘Montgomery exponentiation needs no final subtractions’, Electron. Lett., Vol.35, no.21. pp.1831-1832. [9] Zhengbing, H., Al Shboul, R.M. and Shirochin, V.P. (2007).An efficient architecture of 1024-bits cryptoprocessor for RSA cryptosystem based on modified Montgomery’s algorithm. Proc. 4th IEEE Int. Workshop Intell.Data Acquisition Adv. Comput.Syst., pp.643-646.

469

Turn static files into dynamic content formats.

Create a flipbook

Pipelined VLSI Architecture for RSA Based on Montgomery Modular Multiplication

Published on Dec 28, 2016

GRD Journals

Modular multiplication forms a key operation in many public key cryptosystems. Montgomery Multiplication is one of the well-known algorithms to carry out the modular multiplication more quickly. Carry Save Adders are employed to avoid carry propagation at each addition operation. To reduce the extra clock cycles, Configurable carry save adder either with one full-adder or two half-adders can be employed. In addition to that, a mechanism used to skip the unnecessary carry-save addition operations in the one-level CCSA while maintaining the short critical path delay had been developed. In the proposed architecture, maximum worst case delay is analyzed to enhance the throughput. In the path, additional buffers are introduced so that the clock is synchronized to reduce the worst case delay. As a result, pipelining concept is introduced which increases the speed and achieves a high throughput. The pipelined architecture is applied in RSA public key algorithm to increase the throughput...