Sateesh Reddy* et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 7, Issue No. 2, 271 - 275
Unified Reconfigurable Floating-Point Pipelined Architecture Sateesh Reddy
Vinit T Kanojia
Department of Electronics & Communication, RV College of Engineering Bangalore, India vinitkanojia@gmail.com
Abstract— In this paper, a reconfigurable, pipelined FloatingPoint hardware architecture is designed by exploring the similarities of individual floating-point operations; this architecture is capable of handling floating point addition, subtraction, multiplication and comparison in a pipelined manner resulting in an increase in performance in terms of area and latency. Extensive explorations of all individual components and reconfiguration techniques are discussed at all stages. The proposed architecture is designed using Verilog HDL followed by behavioural simulation, post-map simulation, post-translate simulation, post-place and route simulation and implementation on Xilinx field-programmable gate array (FPGA). Keywords— Floating-point, reconfigurable, pipelining
IJ
A
ES
I. INTRODUCTION Because of the high-precision, great dynamic range and easy operating rules, floating-point operations have found intensive applications in the various fields that require high precision. In modern day computers, floating-point arithmetic operations are mainly performed by the co-processors. In systems without floating-point hardware, the CPU emulates it with a series of simpler fixed-point arithmetic operations that run on the integer arithmetic and logical unit. This saves the added cost of a floating-point unit (FPU) but is significantly slower. Coprocessors cannot fetch instructions from the mainmemory, perform I/O, manage-memory and so on. These processors require the host main processor to fetch the coprocessor instructions from the memory and handle all operations aside from the coprocessor functions. High processor speeds demand high coprocessor operation speed. High-Level Synthesis (HLS) is an emerging technology that synthesizes algorithms represented in high level languages (ANSI-C, Matlab etc) into an effective hardware (RTL). This is achieved by HLS compiler that analyses high level language features such as arithmetic resources, loops, branches and maps them into optimized hardware for data path and control elements. Arithmetic resources operate on both integer and real data. Most of the communication/DSP algorithms commonly use either real or integer operations. These algorithms can be easily represented in high level languages as they contain rich set of data types. Conversion of these algorithms (with real and integer arithmetic) into optimum hardware requires mapping of each operator to an effective hardware resource. As arithmetic operation on real operands is complex over integer operands, design and implementation of optimum hardware resources for real arithmetic is a challenging task. Integer arithmetic resources in-general, consume negligible hardware and can be freely
used in the HLS synthesis process. Real arithmetic can be done either by fixed or floating point methods. Fixed point method is better in hardware performance than floating point method, but the precision of operations and range of numbers that can be handled are limited. On the other hand, floatingpoint method offers high-precision and a great dynamic range at the expense of hardware-requirement and latency. Floatingpoint arithmetic resources alone consume 50 to 70 percent of the total hardware. FPGAs are increasingly being used to design high-end computationally intense microprocessors capable of handling both fixed point and floating-point operations. Floating-point representation offer a wider range of representing real numbers compared to fixed-point representation but, due to the complexity factor, implementation of floating-point units on FPGA consume large amount of resources. This makes FPGA less attractive for floating-point applications. This problem can be resolved by embedding FPUs in FPGAs; however if left unutilized, embedded FPUs waste space on the FPGA die. To overcome this issue, a flexible multimode embedded FPU is proposed that can be configured to perform a wide range of operations [1]. Though this architecture resulted in an area improvement of 3.8 and a delay improvement of 4.2, it lacked an improvement in the latency. Units that are capable of performing integer arithmetic and logical operations along with a floating-point operation are available. For example, a unit that performs integer ALU along with floating-point addition is reported in [2]. This paper deals with a unit that can be reconfigured to perform different floating-point operations. The resource requirement in case of floating-point arithmetic or logical operation is high as each operation is divided into three stages namely; pre-normalization, arithmetic-operation and post-normalization; each of these stages in turn include a number of arithmetic and logical operations. Floating-point arithmetic is complex as each floating-point operand is first divided into three parts namely sign, exponent and the fraction; which are then separately operated in the above mentioned stages. On the other hand, the operations on integer operands do not require grouping or normalization hence, are relatively less complex. There are lots of similarities with respect to the resources being used in the architectures of floating-point addition, subtraction, multiplication and comparison [3-4]. For example, concatenation of hidden-bits to the fraction-bits, addition or subtraction of exponents in the pre-normalization stage,
T
Senior Member Technical Staff, Poseidon Design Systems Bangalore, India jsateeshreddy@gmail.com
ISSN: 2230-7818
@ 2011 http://www.ijaest.iserp.org. All rights Reserved.
Page 103
Sateesh Reddy* et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 7, Issue No. 2, 271 - 275
fractions. Zeros are appended to the exponents for the future arithmetic operations. Bias of 127 is subtracted from the sum of the exponents. Sign-bit of the result is found. 4) FP Comparison: Magnitudes of the input-operands are compared. Depending on the output of the magnitude comparator and the sign-bits of the input operands, it is concluded whether an operand is greater than, less-than or equal to the other operand.
T
B. Core-arithmetic 1) FP Adder: The fraction of the larger operand and the shifted fraction are added resulting in a 24-bit fraction. Using ground, round and sticky bits calculated in prenormalization stage, the resultant fraction-bits are rounded. 2) FP Subtractor: The fraction of the larger operand and the shifted fraction are subtracted. Number of leading zeros is found in parallel. Ground, round and sticky bits found in the pre-normalization stage are used for rounding. 3) FP Multiplier: 24-bit fractions calculated in the prenormalization stage are multiplied using a binarymultiplier resulting in a 48-bit product. C. Post-Normalization 1) FP Adder: In case of a carry-bit generation during arithmetic-addition, fraction has to be right-shifted by one-bit and the exponent is incremented by one, else if there are leading zeros, the fraction is left shifted by an amount equal to the number of leading zeros and the exponent is decremented by the same value. The resultant fraction and the exponent along with the input fractions and exponents are used for checking various exceptions like not-a-number, infinity and overflow. 2) FP Subtractor: Depending on the number of leadingzeros, the fraction is left-shifted and the resultant exponent is decremented by an amount equal to the number of left-shifts. The resultant fraction and the exponent along with the input fractions and exponents are used for checking various exceptions like not-anumber, infinity and overflow. 3) FP Multiplier: If the MSB of the resultant fraction is logic-1, the fraction has to be right shifted by one-bit; else if the 46th bit is a logic-0, the fraction is leftshifted by an amount equal to the number of leading zeros. The exponent is incremented or decremented depending on shift-operation i.e. it is incremented by the number of right-shift operations and decremented by the number of left-shift operations. The fraction is rounded to 23-bits. The resultant fraction and exponent along with the input fractions and exponents are used for checking various exceptions like not-anumber, infinity and the overflow.
ES
shifting in the post-normalization stage are some of the common operations. In this paper, the similarities of different floating-point arithmetic operations are explored to implement a Unified Floating-point hardware architecture resulting in a decrease in the amount of hardware (number of LUTs). In other words, the amount of hardware used in the unified architecture is found to be less than the sum of hardware required in individual floating-point units. The architecture can be reconfigured as floating-point addition, subtraction, multiplication and comparison in a pipelined manner. A large number of arithmetic and logical components are required in the design of any floating-point unit, the same holds good for the design of unified floating-point hardware architecture. To make sure that the most optimized architectures of individual components are used; architectures of all individual arithmetic and logic units are explored and the performance with respect to area and frequency is found. The architectural explorations include: Kogge-Stone Adder, Carry-Skip adder, carry-ripple adder, carry-look-ahead adder, Wallace-Tree multiplier, Serial multiplier, Parallel Multiplier, Boothâ€&#x;s multiplier, Array Multiplier, Barrel shifter, Zerodetector using XOR gates and Cascaded comparator [6-10]. It is found that the Unified Reconfigurable Floating-Point Architecture results in an improvement in latency by a factor of 3 and an improvement in area (number of LUTs) by a factor of 1.5 as compared to the summed up area of individual floating-point units. II. FLOATING-POINT ARITHMETIC ALGORITHMS This section highlights the functionality of floating-point adder, subtractor, multiplier and comparator at various stages.
IJ
A
A. Pre-normalization 1) FP Adder: Input operands are compared, followed by the determination of difference of exponents. Hiddenbits are found for the input operands which are then concatenated with the respective fractions. The fraction of the smaller operand is shifted right by an amount equal to the difference of exponents; this is done to equate the exponents of the operands. Ground, round and sticky-bits are found in parallel with the shift operation. Sign-bit of the result is found. 2) FP Subtractor: Input operands are compared, followed by the determination of the difference of input exponents. Hidden-bits are found for the input operands which are then concatenated with the respective fractions. The fraction of the smaller operand is shifted right by an amount equal to the difference of exponents; this is done to equate the exponents of the operands. Ground, round and stickybits are found in parallel with the shift operation. Sign-bit of the result is found. 3) FP Multiplier: Hidden-bits are determined followed by concatenation of the same with the respective
ISSN: 2230-7818
@ 2011 http://www.ijaest.iserp.org. All rights Reserved.
Page 104
Sateesh Reddy* et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 7, Issue No. 2, 271 - 275
III. SIMILARITIES EXPLORED From the algorithms discussed in the previous section; at every stage of floating-point addition, subtraction, multiplication and comparison operations, many arithmetic and logical operations are common. In this section, the common arithmetic and logical operations/units at every stage are listed.
T
A. Pre-normalization: Exponent Addition/Subtraction Determination of hidden-bits Concatenation of hidden-bits with respective fractions. Right-shifting the fraction of the smaller operand. (In case of FP addition and subtraction) Determination of the sign-bit B. Core Arithmetic: 24-bit Reconfigurable adder/subtractor
ES
C. Post-normalization: Determination of leading zeros Shifting the resultant fraction (left/right) Incrementing/Decrementing the exponent based on the shift-operation Exception Handling IV. PROPOSED ARCHITECTURE
A
The inputs are in IEEE-754 single precision format [5]. The unified reconfigurable floating-point architecture is designed using Verilog HDL and implemented on Virtex-4 FPGA keeping in view the area and frequency of operation. The proposed architecture has a latency of 3. The proposed architecture is shown if figure 1.
IJ
A. Pre-Normalization In this stage, hidden-bit is found and concatenated with the fraction-bits. A comparator (used in case of floating-point addition/subtraction) compares the inputs (31-bits), the output of which is used to decide the fraction to be shifted right (for equating the exponents); it also decides the output exponent. Floating-point multiplier uses two adder modules to add the exponents and subtract the bias (127) whereas; floating-point adder/subtractor uses one adder/subtractor module to find the difference of exponents. Shifter shifts the fraction of the smaller number by a value equal to the difference of the exponents. Sign bit of the result is calculated depending on the sign bits of the inputs, operation required and the comparator output.
ISSN: 2230-7818
Figure 1. Unified Reconfigurable Floating-Point Architecture Table 1. Signal description of the Unified Architecture
Signal A, B LF SF LE ID
Number of Bits 32 24 24 10 2
M1 M2 Ae Ag Al
48 10 1 1 1
Description
Single-Precision Inputs Larger Fraction Smaller Fraction Larger Exponent 00-Addition; 01-Subtraction; 10-Multiplication; 11-Comparison Required arithmetic output Required exponent A=B A>B A<B
B. Core-Arithmetic This stage has a maximum combinational path delay because of the binary multiplication and addition/subtraction operation. Depending on the operation required by the user, only one arithmetic module is active at a time while other modules remain idle. C. Post-Normalization This stage is the post-normalization stage. A zero leading detector (active only for floating-point subtraction and multiplication) detects the number of leading zeros which
@ 2011 http://www.ijaest.iserp.org. All rights Reserved.
Page 105
Sateesh Reddy* et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 7, Issue No. 2, 271 - 275
specify the number of left-shifts required for postnormalization. The output of the arithmetic unit is shifted to normalize the result (Number of shifts is determined by the zero leading detector), followed by rounding. Adder/subtractor module is used to adjust the output exponent depending on the number of shifts required. The exponents and fractions are passed to the exception module which checks for various exceptions like infinity, not a number and overflow. The sign-bit, exponent-bits and fraction-bits are concatenated to obtain the single-precision result.
Figure 5. Synthesis results of various Zero-detectors
ES
T
V. RESULTS Figures 2, 3, 4, 5, 6 give the synthesis results of various components on Xilinx XC4VFX60-11FF1152. CPD stands for the Combinational Path Delay.
Figure 2. Synthesis results of various Multipliers
Figure 6. Synthesis results of various Shifters
Tables 2 and 3 mention the types of components used depending on area or speed requirement
A
Minimum Area: Number of LUTs = 1452; Frequency = 88.23 MHz
IJ
Figure 3. Synthesis results of various Adders
Figure 4. Synthesis results of various Comparators
ISSN: 2230-7818
Table 2. Types of components for minimum area
Component Multiplier Adder Comparator Zero-Detector Shifter
Type Wallace-Tree Carry-Ripple PG-Logic XOR Logical
Maximum Frequency: Number of LUTs = 2056; Frequency = 88.645 MHz Table 3. Types of components for maximum frequency
Component Multiplier Adder Comparator Zero-Detector Shifter
@ 2011 http://www.ijaest.iserp.org. All rights Reserved.
Type Wallace-Tree Kogge-Stone PG-Logic XOR Barrel
Page 106
Sateesh Reddy* et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 7, Issue No. 2, 271 - 275
Architecture
LUTs 614
Frequency (MHz) 145.964
Clock Cycles 3
FP Adder FP Subtractor FP Multiplier
614
145.964
3
FP Comparator
1569
67.363
3
Unified
46
46.15
3
2056
88.645
3
Table 5. Performance Analysis of Xilinx IPs
Xilinx IPs
LUTs 641
Frequency (MHz) 274
Clock Cycles 8
FP Multiplier FP Adder FP Subtractor
578
368
13
578
368
13
Waveform:
REFERENCES Yee Jern Chong and Sri Parameswaran, “Configurable Multimode Embedded Floating-Point Units for FPGAs”, IEEE Transactions 2010. [2] Shamsiah Suhaili and Othman Sidek, “Design and Implementation of Reconfigurable ALU on FPGA”, School of Electrical and Electronics engineering, University Sains Malaysia. [3] Gong Renxi, Zhang Shangjun,Zhang Hainan, Meng Xiaobi, Gong Wenying, Xie Lingling, Huang Yang “Hardware Implementation of High-Speed Floating-Point Multiplier Based on FPGA” Proceedings of 2009 4th International Conference on Computer Science & Education, IEEE [4] Ali Malik, Seok-Bum Ko “A Study on the Floating-Point Adders in FPGAs” Electrical and Computer Engineering, Canadian Conference CCECE „0, 10.1109/CCECE.2006.277498 [5] ANSI/IEEE Std.754-1985. IEEE Standard for binary Floating-Point Arithmetic [S]. IEEE, Inc, New York, 1985. [6] Olivieri, M.; Pappalardo, F.; Smorfa, S.; Visalli, G.; “Analysis and implementation of a Novel Leading Zero Anticipation Algorithm for Floating-Point Arithmetic Units” IEEE Transactions, Vol. 54 2007 [7] R. Gnanasekaran “A Fast Serial-Parallel Binary Multiplier” IEEE Transactions on Computer, Vol. C-34, No. 8 Aug 1985 [8] Mounir Bohsali, Michael Doan “Rectangular Styled Wallace Tree Multipliers”, Berkeley University [9] H.E Weste ,“CMOS VLSI Design”, Pearson Education, Third Edition,2005 [10] Jan M Rabaey , “Digital Integrated Circuits : A Design Perspective ” PearsonEducation,2003 [1]
T
Table 4. Performance Analysis of various floating-point architectures
Figure 7: Waveform
ES
AUTHORS
A
Where, a and b are the single precision inputs, o is the single-precision output id- 0 for floating-point addition, 1 for floating-point subtraction 2 for floating-point multiplication, 3 for floating-point comparison nan, overflow, inf are not-a-number, overflow and infinity flags respectively.
IJ
VI. CONCLUSION The unified-reconfigurable floating-point pipelined architecture is designed using Verilog HDL and implemented on Xilinx Virtex-4 FPGA. The results show that there is a decrease in area (number of LUTs) by a factor of 1.5 as compared to the sum of the area (number of LUTs) of floating-point adder, subtractor, multiplier and comparator. It also results in the improvement in latency by a factor of 3 compared to the Xilinx floating-point IPs. The unified architecture can further be extended to handle floating-point division and also the double-precision (64-bit) floating-point numbers.
ISSN: 2230-7818
Sateesh Reddy obtained his B.Tech degree in Electronics and Communication from Jawaharlal Nehru Technological University, Kakinada in 2002 and his MS degree in VLSI Design in the year 2006 from Indian Institute of Technology, Madras. He is currently working with Poseidon Design Systems Bangalore, as a senior member technical staff in the frontend digital system design. His research interests include in the areas of frontend design of digital systems.
Vinit T Kanojia obtained his B.E degree in Electrical and Electronics from Global Academy of Technology, Bangalore in 2008, Visvesvaraya Technological University (VTU), Belgaum. He is pursuing his M. Tech. in VLSI design and Embedded System in Electronics and Communication Engineering department at RV College of Engineering, Visvesvaraya Technological University (VTU), Belgaum. His research interests include in the areas of frontend design of digital systems and low power VLSI.
@ 2011 http://www.ijaest.iserp.org. All rights Reserved.
Page 107