Poster Paper Proc. of Int. Conf. on Advances in Computer Engineering 2011
Asynchronous Implementation of Split-radix FFT Algorithm Using FPGA Shobith Thomas Jacob1, R. Ramesh2 and S. Malarvizhi3 1
Department of Electronics and Communication Engineering S.R.M University, Tamil Nadu, 603 203, India. Email: stjcec@yahoo.com 2 Department of Electronics and Communication Engineering S.R.M University, Tamil Nadu, 603 203, India. Email: raammesh1976@yahoo.co.in 3 Department of Electronics and Communication Engineering S.R.M University, Tamil Nadu, 603 203, India. Email: hod.ece@ktr.srmuniv.ac.in II. COMMONLY USED ALGORITHMS
Abstract—The Fast Fourier Transform (FFT) finds a variety of applications particularly in the field of communication engineering. The Split Radix FFT (SRFFT) is one of the very many algorithms to compute FFT. In this paper, we propose a completely asynchronous design of the split-radix FFT processor core with absolutely no glitches at the output. It reduces the system latency drastically. The asynchronous design makes sure that there is no separate data load stage in the processor. The reading of the output and the loading of the new inputs can be done simultaneously. This helps to improve the speed of the device. Further, we present an algorithm to reduce the number of ROMs to store the twiddle factors.
Since Cooley-Tukey algorithm was proposed Ref. [3], many new algorithms were put forward. Radix-2, radix-4, radix8, split-radix and mixed-radix are some of the very common algorithms. The first three envisages the use of a single type of radix in the entire calculation of the FFT. The split-radix algorithm utilizes multiple radices in a single stage while the mixed uses different radices alternatively in different stages of computation of the FFT. It can be advantageously used to compute multiple internal stages in parallel. Research shows that split-radix approximates the minimum multiplication by theory (Ref. [4] ). Winograd algorithm is yet another class of algorithm. However, the FFTs are more modular, which is an advantage in hardware implementations, especially in very large scale integration (Ref[5] ). No major works were undertaken in the field of FPGA implementation of split-radix algorithm asynchronously.
Index Terms— FFT, SRFFT, Asynchronous system, CORDIC Algorithm.
I. INTRODUCTION The invention of Fast Fourier Transforms has given a giant leap in the performance of modern communication systems. The Fast Fourier Transform (FFT) is one of the most important algorithms in signal processing and communications and is used in orthogonal frequency division multiplexing (OFDM) systems (Ref. [1]). As the faster version of DFT, FFT and its inverse transform IFFT are important analysis methods in digital signal spectrum analysis (Ref. [2] ). Most of the existing designs use synchronous design of the processor where the speed of the processor core is very much limited by the clock input. The speed of the input and output sections of the device also has a major effect on the clock rate. In this paper, we present an asynchronous core split-radix FFT wherein the processor core speed is limited only by the intrinsic delay of the device so that it can be used in real-time applications but without any glitches in the output section. The system latency is significantly reduced by this design. The paper also uses an algorithm to reduce the number of ROMs for storing the twiddle factors by using certain properties of the complex numbers. This algorithm is true even for very complex systems.
© 2011 ACEEE DOI: 02.ACE.2011.02. 79
III. THE SPLIT-RADIX FFT This algorithm for computing the FFT utilizes the separation of the input sequences into odd and even indexed samples. For DFT with N equals to 2m (m is any natural number), the even and odd indexed output frequencies are clearly given in Ref. [6]. Ref. [6] specifies the general equation in terms of X(2r), X(4r + 1) and X(4r + 3). For a 16-point FFT, more specific equations are used in Ref. [2]. The three main equations used in Ref. [2] for calculations of subsequent stages are as below:
Radix-4 algorithm is used to compute the odd indexed outputs. It is given by the following equations:
where r = 0, 1, 2, ..., (N/4)-1 . This algorithm uses an L-section butterfly structure as given in Ref. [2]. The detailed description of how this equation is derived and transformed in subsequent stages is clearly presented in Ref. [2]. The general signal flow graph is as in Fig.1. 197
Poster Paper Proc. of Int. Conf. on Advances in Computer Engineering 2011 From Ref. [2], some important points can be deduced. In order to compute the values of a(n), b(n), c(n), all the sixteen input values are needed. Hence, it is very clear that FFT computation cannot be started unless we have all the values at the same time. After this, we proceed to compute the values of d(n) and e(n). Here the calculation of the first stage is completed. At the same time the values of f(n), g(n) and h(n) is computed. This is possible because their computation only requires the values of a(n). This is equally applicable to all other stages. It facilitates the parallelizing of computation of different stages. Parallel computation of the stages is the inherent advantage of the split-radix algorithm. This makes the split-radix algorithm very attractive as far as the speed of the system is concerned. In other words, the system latency is greatly reduced because of this approach. Normally, sycnhronous implementation of FFT algorithm is done. Even though it may seem easier, it causes considerable latency as the clock has to synchronize every stage.
W6 16= -0.7071 - 0.7071i (8) W916 = -0.9238 + 0.3826i (9) It is interesting to note the magnitude of the real and imaginary part of many twiddle factors are the same. It is due to the fact that the twiddle factors are symmetrically placed in the unit circle. The property of positional symmetry of the twiddle factors help us in actually reducing the number of ROMs needed to store the values. Suppose that registers ‘a’, ‘b’, ‘c’ are used to store the values 0.9238, 0.3826 and 0.7071 respectively, then all other twiddle factors can be generated very easily. The twiddle factors in terms of values stored in the above registers are shown below: W116 = a- bi (10) W216 = c – ci (11) W316 = b- ai (12) W6 16 = -c – ci (13) W916 = -a + bi (14) Thus it is evident that we need only three ROMs to store all the five twiddle factors. The percentage reduction in the number of twiddle factors can be computed as follows: each twiddle factor has a real and imaginary part. Thus we needed to have ten ROMs to store their values as per the design in Ref. [2]. In this paper, we used only three ROMs. Thus there was a reduction by 70% in the number of ROMs used for storing the twiddle factors. This concept could be easily extended to 64 or 128 point FFTs or even higher order FFTs. The trick lies in the fact that the twiddle factors, which are actually the roots of unity, have the positional symmetry. For a 64 or 128-point FFT, it is not easy to manually compute the values of each and every twiddle factor and then reduce the number of ROMs as described above. So we propose an algorithm to reduce the number of ROMs as described below.
Figure 1. Signal flow graph of 16-point Split radix algorithm
V. ESTIMATION OF THE NUMBER OF ROMS FOR STORING TWIDDLE FACTORS
In this paper, asynchronous implementation is being used. There is no global clock and each part of the algorithm worked independently of each other. Thus f(n), g(n), h(n), d(n) and e(n) were computed in parallel without a common clock. The case is same for other terms. Careful design of the circuit ensured that go no glitch occurs in the processor.
Since many of the twiddle factors have identical real and imaginary parts, a systematic approach is the need of the hour to exactly pre-compute the number of ROMs that stores them. This is especially useful for FFTs with very large number of inputs. The pre-requisite for this is a thorough knowledge of the angular position (ô) of the twiddle factors in the complex plane. The steps are as follows: Step1: Find out the twiddle factors whose angular positions are 00,900, 1800, 2700, 3600 or their multiples. For all such twiddle factors, we need just one ROM. Step2: Find out the twiddle factors that are having the angular positions which are odd multiples of 450. They need just one ROM. Step3: From the remaining values, find out those sets of twiddle factors whose angular positions are related to each other by the equation (900 ± ô). All such twiddle factors need just two ROMs to store their real and imaginary values. The application of this algorithm also helps us to calculate the actual number of twiddle factor values required to be computed by using the CORDIC algorithm. A simple Cprogram or even a MATLAB program is can automate this algorithm.
IV. REDUCTION IN THE NUMBER OF ROMS In Ref. [2], only five twiddle factors out of six were used. This was because the first twiddle factor is unity. The corresponding elements need just to be added and subtracted and do not need a multiplier. In this paper we had further reduced the number of ROMs for storing the twiddle factors by exploiting the properties of complex numbers. The symmetry in the magnitude and angle of the complex numbers was utilized to generate the required complex numbers from lesser number of ROMs compared to those used in Ref. [2].To be clearer, the values of the complex twiddle factors are shown below: W0 = 1 (4) W116 = 0.9238 - 0.3826i (5) W216 = 0.7071 - 0.7071i (6) 3 W 16 = 0.3826 - 0.9238i (7) © 2011 ACEEE DOI: 02.ACE.2011.02.79
198
Poster Paper Proc. of Int. Conf. on Advances in Computer Engineering 2011 VI. STRUCTURE OF THE PROPOSED FFT PROCESSOR
MATLAB code in MATLAB v.7.9.0. Then a Verilog code was written and simulated in ModelSimSE 6.2e. The ModelSim and MATLAB outputs agreed with each other. The Verilog code was then synthesized using Xilinx 8.2i for Virtex4 family. The device used was XC4VLX200. The maximum output required time after a positive edge comes on ‘en’ is 6.711ns. Thus the system latency is very much improved compared to 13 clock pulses (with 60 MHz clock) as in Ref. [2]. The maximum frequency of operation achieved is 85.436 MHz. Figure 3 shows the simulation waveform. The percentage utilization is 3%. This is expected as the processor uses 20-bit buses and internal registers and 20-bit arithmetic units compared to 16-bit registers and arithmetic units as used in Ref. [2].
The block diagram of the proposed processor is shown in Fig. 2.The processor has xn_r and xn_i as the real and imaginary parts of input. The real and imaginary parts of output are XNout_r and XNout_i respectively. The input and output buses are of 20-bit each. It has a single bit enable signal (‘en’) and four bit selection inputs (‘sel’) for selecting the required output.
Figure 2. Block diagram of the proposed processor
To start with, the ‘sel’ was varied line from 0 to 15 and the required data values are given to the processor. To initiate the computation, make the ‘en’ signal go from logic 0 to logic 1. All the sixteen inputs need to be present before the positive edge appears on the en signal. The processor would now perform the computations and present the results at the output buffer. Now the ‘sel’ was varied again from 0 to 15 and the corresponding output values come out via the output buses XNout_r and XNout_i. Simultaneously the new input values can be given via the input buses xn_r and xn_i. This eliminates the need for separate ‘data load’ stage that is present in a normal processor. The processor was designed in such that if we tri-state any of the input signals, the processor will go to its reset state. This eliminates the need for a separate reset signal. The data was given as signed numbers. Out of the 20-bits on a bus, the most significant bit represents the sign bit. To represent floating point values, the ‘11.8’ representation of numbers is utilized here. In this, inputs and outputs were scaled up by 256 (left shift by eight bits.) Thus a decimal value ‘1’ was represented as ‘256’ while the decimal value ‘15’ was represented as ‘3840’. The decimal value of ‘1.5’ was represented ‘384’. In other words, bits 18 to 8 would represent the integral part while the last eight bits represent the fractional part. This means that in order to get the actual decimal value, we need to divide the value on a bus by a factor of 256 (for example, if the output bus has a value of 832, then the actual decimal value is 832/256 which is equal to 3.25). Thus the minimum value that can be represented on the bus is -1/256. Multiplication and division by 256 is not costly in terms of hardware as it requires only shifters. The FFT processor was made completely asynchronous but without any glitches at the output. The processor uses selfgenerated pulses to synchronize various stages of calculation. The asynchronous design helps us to make sure that the speed of the system is not limited by the speed of the global clock. In fact, the delay from the input to the output section is equal to the intrinsic delay of the processor. There is an improvement in the speed of the processor.
Figure 3. Simulation waveform of FFT
VIII. CONCLUSION This paper presents an asynchronous FFT processor which is glitch-free. The high clock latency problem is entirely eliminated in the proposed processor; the output delay is very small. The processor does not need an independent ‘data load’ stage. The paper also demonstrates the fact that a reduction in the number of twiddle by a factor of 70% for 16point FFT is possible by following a sequence of well-defined steps. The paper details the pilot work carried out to demonstrate the practical use of the asynchronous processor. The work can be expanded so that the input and output sections, the arithmetic units and the registers operate in standard 32-bit floating point IEEE formats. All the concepts used here also applicable for 64-point, 128-point and other higher input FFT processors. The processor is easily modifiable to be used for IFFT calculations. ACKNOWLEDGEMENT We would like to thank all the staff of Department of Electronics And Communication Engineering, S.R.M University, who had helped us a lot in the completion of this paper. We are grateful to all our peer reviewers for their inspiration and insightful comments on our work.
VII. SIMULATION AND ANALYSIS The proposed algorithm was tested first by writing a © 2011 ACEEE DOI: 02.ACE.2011.02.79
199
Poster Paper Proc. of Int. Conf. on Advances in Computer Engineering 2011 [4] Guangshu Hu, “Digital Signal Processing”, Tsinghua University Press, pages 133-149, 1997. [5] Paulo S. R. Dinz et al., “Digital Signal Processing System Analysis & Design”, page 129, Cambridge University Press, 2002. [6] John G. Proakis and Dimitris G. Manolakis, “Digital Signal Processing: Principles, Algorithms and Applications”, pages 532536, fourth edition, Pearson Education, 2007.
REFERENCES [1] Weste N. and Schellem D J., “VLSI for OFDM”, IEEE Communications Magazine, 36(10), pages 127-131, October 1998. [2] Xu Peng and Chen Jin Shu, “FPGA Implementation of High Speed FFT algorithm Based on split-radix”, ITAW, December 2008, pages 781-784. [3] J. W. Cooley and J. W. Tukey, “An Algorithm for Machine Calculation of Complex Fourier”, Mathematics of Computation , volume 19, pages 297-301, 1965.
© 2011 ACEEE DOI: 02.ACE.2011.02.79
200