5.IJAEST-Vol-No-6-Issue-No-1-VLSI-Implementation-of-AES-Crypto-Processor-for-High-Throughput-022-026 by ISERP ISERP

Sumanth Kumar Reddy S et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 6, Issue No. 1, 022 - 026

VLSI Implementation of AES Crypto Processor for High Throughput R.Sakthivel

Sumanth Kumar Reddy S

P Praneeth

SENSE VIT University Vellore, India rsakthivel@vit.ac.in

SENSE VIT University Vellore, India sumanthsannala@gmail.com

SENSE VIT University Vellore, India praneeth.mvd@gmail.com

The rest of the paper is organized as follows. Section II describes basic AES algorithm. Section III describes novel onthe-fly key expansion module. Section IV describes pipeline design. Section V describes comparison work. Finally we concluded the paper in section VI.

Abstract—Advanced Encryption Standard (AES), has received

significant interest over the past decade due to its performance and security level. Many hardware implementations have been proposed. In most of the previous works subbytes and inverse subbytes are implemented using lookup table method. In this paper we used combinational logic which helps for making inner round pipelining in an efficient manner. Furthermore, composite field arithmetic helped in obtaining lesser area. Using proposed architecture, a fully sub pipelined encryptor/decryptor with 3 substage pipelining in each round can achieve a throughput of 25.89Gbps on Xilinx xc5vlx110t-1 device which is faster and is 48.78% more effective than the fastest previous FPGA implementations known to date. Also our ASIC implementation achieved 58.18Gbps which is faster compared to the previous ASIC implementations. This AES design was implemented using Verilog HDL and synthesized with RTL Compiler using TSMC’s 90 nm standard cell library, physical design implementation was done using SOC Encounter and achieved the maximum through put of 58.18 Gbps.

II. AES ALGORITHM The AES algorithm is a symmetric block cipher that processes data blocks of 128 bits using a cipher key of length 128, 192, or 256-bits. In addition, the AES algorithm is an iterative algorithm. Each iteration can be called a round, and the total number of rounds, Nr, is 10, 12, or 14, when the key length is 128, 192, or 256 bits, respectively. Table 1 shows the number of rounds as a function of key length. TABLE I. Different AES specifications Key length Nk words

AES-128 AES-192 AES-256

Keywords—AES, Pipelined AES, sub pipelined design, ASIC, FPGA, VLSI.

4 6 8

Block size NB works

4 4 4

Number of rounds(Nr)

10 12 14

The 128-bit data block is divided into 16 bytes. These bytes are mapped to a 4x4 array called the State and the state undergoes all the internal operations of AES algorithm. Every byte in the State is denoted by Si,j(0 ≤ i, j < 4), and is considered as an element of GF(28) . Although different irreducible polynomials can be used to construct GF(28), the irreducible polynomial used in the AES algorithm is p(x) = x8 + x4 + x3 + x + 1. Block diagram of the AES encryption and the equivalent decryption structures are shown in Fig 1.

I. INTRODUCTION The large and growing number of internet and wireless communication users has led to an increasing demand of security measures and devices for protecting the user data transmitted over the open channels. Two types of cryptographic systems are mainly used for security purpose, one is symmetric-key crypto system and other is asymmetrickey crypto system. Symmetric-key cryptography (DES, 3DES and AES) uses same key for both encryption and decryption. The asymmetric-key cryptography (RSA and Elliptic curve cryptography) uses different keys for encryption and decryption. The major disadvantage of DES is its key length is small. In November 2001, the National Institute of Standards and Technology (NIST) of the United States chose the Rijndael algorithm as the suitable Advanced Encryption Standard (AES) to replace previous algorithms like DES algorithm.

After an initial round key addition, a round function consisting of four different transformations sub-bytes, shiftrows, mix-columns, and add-round-key are applied to the data block in the encryption procedure and in reverse order with inverse transformations in Decryption procedure. But last round in encryption contains only sub bytes, shift rows and add round key. Last round in decryption contains only inverse sub bytes, inverse shift rows and add round key. Four transformations in a round function are examined and optimally designed to achieve efficient implementation.

The AES encryption is considered to be efficient both for hardware and software implementations. Compared to software, hardware implementation is more reliable. Some works have been presented on hardware implementations of the AES algorithm using ASIC [6], [7], [8] and FPGA [9], [10]. 1 ISSN: 2230-7818

Page 22

Sumanth Kumar Reddy S et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 6, Issue No. 1, 022 - 026

A. SubByte/Inv SubByte transformations Subbyte transformation is a non linear byte substitution. This can be done by using two methods. One is by using lookup tables (LUT); other is by using a combinational logic.

Round Key

KNr

AddRoundKey (0)

AddRoundKey (Nr)

Sub Bytes ()

Inv Sub Bytes ()

Shift Rows ()

Inv Shift Rows ()

Mix Column ()

Add Round Key (i)

Inv Mix Column ()

Sub Bytes ()

Shift Rows ()

KNr

CipherText (128bit)

Figure3. Shift rows transformation C. MixColumn/InvMixColumn transformation The MixColumns() transformation operates on the State column-by-column, treating each column as a four-term polynomial. The columns are considered as polynomials over GF(28) and multiplied modulo x4 + 1 with a fixed polynomial a(x), given by a(x) = {03}x3 + {01}x2 + {01}x + {02} .

Plain text (128bit)

Inv Sub Bytes ()

The function xtime is used to represent the multiplication with ‗02‘, modulo the irreducible polynomial m(x)= x8 + x4 + x3 + x + 1. Implementation of function xtime() includes shifting and conditional xor with ‗1B‘. Fig. 4 shows the mixed column module. In matrix form, the MixColumns transformation can be expressed as

Inv Shift Rows ()

Add RoundKey (Nr)

CipherTxt(128bit)

Round Key

Add Round Key (0)

PlainTxt(128bit)

1(a). Encryption

1(b). Decryption

S‘0,c

Figure 1. AES encryption and decryption algorithm

01 01

S0,c

03 01

S1,c

S‘2,c

02 03

S2,c

S‘3,c

01 02

S3,c

S‘1,c

0 ≤ c < 4.

In LUT based approach, the unbreakable delay of lookup tables is greater than the other logic. By using LUT method it is difficult to use sub pipeline structure with two pipeline stages, which prevents the further speedup. An alternative method is to use combinational logic, which is faster than the LUT and can also be divided into two pipeline stages, allowing further speedup. In non LUT method sub bytes can be implemented by finding multiplicative inverse followed by affine transform. Similarly inverse sub bytes implemented by using inverse affine transform followed by multiplicative inverse. Here multiplicative inverse is common; by taking this advantage we can implement a single structure for both subbytes and inverse subbytes which is shown in Fig. 2.

Figure 4. Mix column module The InvMixColumns multiplies the polynomial formed by each column of the State with a-1(x) modulo x4+1, where

Figure 2. subbyte/inverse subbyte implementation

a-1(x) = {0b}x3 + {0d}x2 + {09}x + {0e}. In matrix form, the InvMixColumns transformation can be expressed by

ShiftRows/InvShift Rows

ShiftRows is a simple shifting transformation. First row of the state is kept as it is, while the second, third and fourth rows cyclically shifted by one byte, two bytes and three bytes to the left, respectively. In the InvShiftRows, the ﬁrst row of the State does not change, while the rest of the rows are cyclically shifted to the right by the same offset as that in the ShiftRows.

S‘0,

0d 09

S0,c

0b 0d

S1,c

S‘2,c

0e 0b

S2,c

S‘3,c

09 0e

S3,c

S‘1,c

0 ≤ c < 4.

2 ISSN: 2230-7818

Page 23

Sumanth Kumar Reddy S et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 6, Issue No. 1, 022 - 026

if (I mod Nk = 0)

D. Add Roundkey Add RoundKey involves only bit-wise XOR operation. After every round output of the mixcolumn is added with round key.

temp = SubWord(RotWord(wi-1)) XOR Rcon(i/Nk) else if wi = wi-Nk XOR temp end.

By inverting the encryption structure one can easily derive the decryption structure. However, the sequence of the transformations will be different from that in encryption. This feature prohibits resource sharing between encryptors and decryptors. Equivalent decryption structure is shown in Fig. 1(b).

IV. PIPELINING AND SUBPIPELINING To speed up the AES algorithm we can use three architectural optimization techniques. These architectures are based on pipelining, sub pipelining and loop unrolling. The AES encryption for pipeline design is shown in Fig. 6. Here we include pipeline registers in between every round so as to increase the throughput.

III. KEY EXPANSION In the AES algorithm, the key expansion module is used for generating round keys for every round. There are two approaches to provide round keys. One is to pre-compute and store all the round keys, and the other one is to produce them on-the-fly. First approach consumes more area. In second approach, the initial key is divided into Nk words (key0, key1,…, keyNk-1) which are used as initial words. With the help of these initial words rest the words are generated iteratively. It can be computed that is 4, 6, or 8, when the key length is 128, 192 or 256-bit, respectively. Each round key has 128 bits, and is formed by concatenating four words: Roundkey(i) = {w4i,w4i+1,w4i+2,w4i+3}.

Figure 6. AES encryption with pipelining

Similar to the pipelining, sub pipelining can be implemented by inserting registers in combinational logic, but registers are inserted both between and inside each round. By using pipelining and sub pipelining we can process multiple blocks of data simultaneously. Among these architectural optimizations sub pipelining gives maximum speed and better throughput/area. Fig. 7 shows the sub pipelined architecture with r sub stages. Each round unit is divided into r sub stages with equal delays.

W11

X Sbox(Rot (Y)) Rcon[i]

In LUT method sub pipelining is limited to only two sub stages whereas combinational logic can be divided into more sub stages with equal delays. In this pipelining or sub pipelining architectures, the plain text is received at each clock cycle through input register. A single round of algorithm is completed depending on the number of sub stages. Round keys are generated by using key expansion module. Generated round keys are supplied to each round. At each clock cycle data is shifted to next stage and final output is appeared only after the end of ((10*r)+10)th clock cycle. Here ‗r‘ represents number of sub pipeline stages. Advantage of this structure is second output can be obtained immediately in the next clock cycle after the first output. Internal design of the each round contains

Figure 5. Data path for key generator

The key expansion procedure can be described by the pseudo code listed below for i = 0 to Nk-1 wi = keyi end for i = Nk to 4(Nr + 1)-1 temp = wi-1 3 ISSN: 2230-7818

Page 24

Sumanth Kumar Reddy S et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 6, Issue No. 1, 022 - 026

Sub bytes, Shift rows, Mix columns, and add round key which are explained in previous sections.

V. RESULTS COMPARISION The AES architecture was implemented using Verilog HDL, and simulated using Cadence ncsim. Here we implemented two types of designs. AES(LUT) is pipelined implementation using lookup table method with an initial latency of 10 clock cycles and AES(SP) is a fully sub-pipelined implementation with non LUT method, which is having 3 sub-stages in each round with an initial latency of 40 clock cycles. Compared to LUT, non LUT implementation results in lesser area. FPGA implementation of this design has been done using Xilinx XC5VLX110T-1, and the corresponding results are tabulated in TABLE 2. The fully sub-pipelined architecture of 128 bitlength having 10 round units has been synthesized in RTL Compiler using TSMCâ&#x20AC;&#x2DC;s 90 nm standard cells and the corresponding results are tabulated in TABLE 3. This fully sub-pipelined design achieves a throughput of 58.18 Gbps which is faster compared to the previous ASIC implementations. The backend of the design has been done in SOC encounter and final chip layout is shown in fig. 8

Figure 7. Sub pipelining architecture

TABLE 2 FPGA comparision results

Device

Throughput (Gbps) 1.938

Slices

BRAMS

Mbps/slice

Xcv1000-4

Fmax (Mhz) 31.8

Elbirt el al*

10992

0.176

Mcloone el al*

Xcv812e-8

93.9

12.02

2000

244

0.362

Jarvinen *

Xcv1000e-8

129.2

16.5

11719

1.4

Saggese *

Xcv2000e-8

158

20.3

5810

100

1.09

Standert *

Design

Xcv3200e-8

145

18.5

15112

1.28

Xcv812e-8

93.5

11.965

9406

1.272

Xcv1000e-8

168.4

21.556

11022

1.956

Ours (LUT-pipelined)

Xc5vlx110t-1

103.4

13.238

4611

1.077

Ours (Sub pipelining )

Xc5vlx110t-1

202.26

25.89

8896

2.91

Parhi (r = 3)*

Parhi (r = 7)*

*results are estimated from [12]

TABLE 3 Synthesis results (ASIC)

Design

AES(LUT)

AES(SP)

Technology

90nm

180nm

Area (um2)

740870

564036

2258469

Power (mw)

136.995

147.78

655.5

Critical path

3.9ns

2.2ns

4.2ns

Fmax (Mhz)

256.4

454.5

238

Throughput (Gbps)

32.82

58.18

30.47

VI.

CONCLUSION

In this paper, we presented a hardware implementation of efficient pipeline AES architecture which includes both encryption and decryption. Also sub pipelining architecture helped us to get higher throughput than earlier implementations. The design is modeled using Verilog HDL and simulated with the help of Cadence NCsim. Synthesis is done by using RTL Compiler v9.10 and physically designed with SOC Encounter, with the proposed sub-pipelining architecture, throughput has increased and reached to 58.18 Gbps.

4 ISSN: 2230-7818

Page 25

Sumanth Kumar Reddy S et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 6, Issue No. 1, 022 - 026

[12] X.zhang and k.parhi ― high-speed VLSI architectures for the AES algorithm‖ IEEE transactions on VLSI systems, vol.12 sep 2004. [13] N. Sklavos and O. Koufopavlou, ― Architectures and VLSI Implementations of the AES-Proposal Rijndael,‖ IEEE Trans. on Computers, vol. 51, Issue 12, pp. 1454-1459, 2002. [14] R. Karri, K. Wu, P. Mishra, and Y. Kim, ― Concurrent Error Detection Schemes for Fault-Based Side-Channel Cryptanalysis of Symmetric Block Ciphers,‖ IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, vol. 21, No. 12, Dec. 2002. [15] C.-H. Yen, T.-Y. Pai, and B.-F. Wu, ― The implementations of the reconﬁgurable Rijndael algorithm with throughput of 4.9 Gbps,‖ in Proc. 16th VLSI Des./CAD Symp., Hualien, Taiwan, Aug. 2005. [16] M. Alam, W. Badawy, and G. Jullien, ― A novel pipelined threads architecture for AES encryption algorithm,‖ in Proc. IEEE Int. Conf. Appl.-Speciﬁc Syst., Architectures, Process., San Jose, CA, Jul. 2002, pp. 296–302.

REFERENCES

J.Daemen and V.Rijmen, ― AES Proposal: Rijndael, AES algorithm submission,‖ September 3, 1999, available: http://www.nist.gov/CryptoToolkit. [2] ― Draft FIPS for the AES,‖ available from: http://csrc.nist.gov/encryption.aes , February 2001. [3] E. J. Swankoski, R. R. Brooks, V. Narayanan, M. Kandemir, and M. J. Irwin, ― A parallel architecture for secure FPGA symmetric encryption,‖ in Proc. 18th Int. Parallel Distrib. Process. Symp., Santa Fe, NM, Apr. 2004, p. 132. [4] A. Hodjat and I. Verbauwhede, ― Minimumarea cost for a 30 to 70 Gb/s AES processor,‖ in Proc. IEEE Comput. Soc. Annu. Symp., Lafayette, LA, Feb. 2004, pp. 83–88. [5] C.-P. Su, T.-F. Lin, C.-T. Huang, and C.-W. Wu, ― A high-throughput low-cost AES processor,‖ IEEE Commun. Mag., vol. 41, no. 12, pp. 86–91, Dec. 2003. [6] I. Verbauwhede, P. Schaumont and H. Kuo, ― Design and Performance Testing of a 2.29-GB/s Rijndael Processor,‖ IEEE Journal of Solid State Circuits, Vol. 38, No. 3, March 2003, pp. 569-572. [7] T. Ichikawa, T. Kasuya, and M. Matsui, ― Hardware Evaluation of the AES Finalists,‖ in Proc. 3 rd AES Candidate Conference, pp. 279-285, New York, April 2000. [8] L. Deng, H. Chen, A new VLSI implementation of the AES algorithm, in: Communications, Circuits and Systems and West Sino Expositions, IEEE 2002 International Conference on, June 2002, pp. 1500-1504. [9] N. Sklavos, O. Koufopavlou, Architectures and VLSI implementations of the AES-proposal Rijndael, IEEE Transactions on Computers, 51(12) 2(002) 1454–1459. [10] J. H. Shim, D. W. Kim, Y. K. Kang, T. W. Kwon, and J. R. Choi, ― A rijndael cryptoprocessor using shared on-the-ﬂy key scheduler,‖ in Proc. 3rd IEEE Asia-Paciﬁc Conf. ASIC, Taipei, Taiwan, Aug. 2002, pp. 89–92. [11] P. Chodowiec, P. Khuon and K. Gaj, ― Fast Implementations of Secret-Key Block Ciphers Using Mixed Inner- and Outer-Round Pipelining,‖ Proc. ACM/SIGDA Int. Symposium on Field Programmable Gate Arrays, FPGA'01, Monterey, CA, Feb.2001.

[1]

Figure 8. Final chip layout

5 ISSN: 2230-7818

Page 26