Speed Optimised CORDIC Based Fast Algorithm for DCT

Page 1

GRD Journals | Global Research and Development Journal for Engineering | International Conference on Innovations in Engineering and Technology (ICIET) - 2016 | July 2016

e-ISSN: 2455-5703

Speed Optimised CORDIC Based Fast Algorithm for DCT 1V.K.Vidhysankari 2B.Pradeep

Kumar Department of Electronics and Communication Engineering 1,2 Dr.Mahalingam College of Engineering and Technology, Pollachi-642003 INDIA 1,2

Abstract Discrete Cosine Transform is the most widely used transform recorded in the history especially for image and video compression. DCT can be efficiently carried out using a well-known iterative algorithm called CORDIC to perform vector rotations. An efficient CORDIC based fast algorithm for DCT is presented with some notable advantages like data flow similar to Cooley - Tukey FFT, identical post-scaling factor and rotation angles in arithmetic sequence. CORDIC types are reduced to one by choosing trigonometric formula. This algorithm overcomes the problem of non-synchronization among the CORDIC rotation angles. This is achieved by using Carry Save Adder (CSA) in the Processing Elements (PE) in the place of full adders and the two different PEs are used to exploit four PEs. The delay reduces drastically by using modified 4:2 Carry Save Adder architecture Keyword- Compression, CORDIC, DCT, Carry Save Adder, Carry Look-ahead Adder, Processing Elements __________________________________________________________________________________________________

I. INTRODUCTION Compression is the reduction of number of bits to represent a data by reducing the redundancy. Image compression reduces the size by discarding carefully chosen redundant data, so that necessary information can be retained. In this fast growing world, Digital Image Processing plays an important role in various fields. Ahemed, N., et al proposed DCT and DCT is Discrete Cosine Transform which has gained its importance in the field of Digital Signal and Image Processing and to meet its requirements, various fast algorithms are developed. The existing fast algorithm for DCT can be classified in to two as fixed length and variable length DCT algorithms. Fixed length algorithms are usually used in 8-point DCT and mostly aim at reducing the computational complexity and increasing its efficiency. Few such algorithms like matrix factorization proposed by Fanucci, L., et al and Pan S.B., et al and direct signal flow graph given derivatives by Kaddachi, M. L., et al take the advantage of fast algorithms, but they have the problems of control complexity, irregular signal flow graphs and extending to higher orders are hardly possible. Variable length DCT algorithms for faster computation had been used to meet the market requirements and to overwhelm the problems faced in fixed length algorithms. Many fast algorithms proposed by Chen, C.T., et al, Chen, C.H., et al and Narasimha, M.J., et al for DCT like matrix factorization with simple structure,Hou, H.S. in his paper given about the recursive algorithms to generate higher order DCT from lower order DCT and multiplier based algorithms were proposed. The development of such algorithms leads to unfolded pipelined CORDIC technique which uses linear array architecture and can be used for efficient VLSI implementation. CORDIC based fast algorithms have low hardware complexity, high throughput and better synchronization.

II. CORDIC BASED FAST ALGORITHM FOR DCT/IDCT DCT and IDCT for an N-point signal is defined as, DCT: (2đ?‘› + 1)đ?‘˜đ?œ‹ ], đ?‘Ľ[đ?‘›] = ∑ âˆ? [đ?‘˜]cos 2đ?‘ đ?‘˜=0 (đ?‘› = 0,1 ‌ đ?‘ − 1) đ?‘ −1

[

IDCT: đ?‘ −1

C[đ?‘˜] =âˆ? [đ?‘˜] ∑ x[n] cos [ đ?‘›=0

(2đ?‘› + 1)đ?‘˜đ?œ‹ ], 2đ?‘

(đ?‘˜ = 0,1. . đ?‘ − 1)

N-point "D" "C" "Ěƒ T" with two (N/2)-point "D" "C" "Ěƒ T" is based on CORDIC algorithms as derived by H. Huangcan be given as,

All rights reserved by www.grdjournals.com

443


Speed Optimised CORDIC Based Fast Algorithm for DCT (GRDJE / CONFERENCE / ICIET - 2016 / 073)

đ?‘˜đ?œ‹ đ?‘˜đ?œ‹ cos ( ) sin ( ) Ěƒ CĚƒ L [đ?‘˜] đ??ś [đ?‘˜] 2đ?‘ 2đ?‘ ] . [ [ ] = 2[ ], đ?‘˜đ?œ‹ đ?‘˜đ?œ‹ đ??śĚƒ [đ?‘ − đ?‘˜] CĚƒ đ??ťĚ‚ [đ?‘ /2 − đ?‘˜] −sin ( ) cos ( ) 2đ?‘ 2đ?‘

đ?‘˜ = 1, . . đ?‘ /2 − 1

Where, CORDIC đ?‘˜đ?œ‹ đ?‘˜đ?œ‹ cos ( ) sin ( ) đ?‘˜đ?œ‹ 2đ?‘ 2đ?‘ ] (− ) = [ đ?‘˜đ?œ‹ đ?‘˜đ?œ‹ 2đ?‘ −sin ( ) cos ( ) 2đ?‘ 2đ?‘

Replacing –π/2Nbyď ąď€Źď€ we get,

cos đ?œƒ sin đ?œƒ ] −sin đ?œƒ cos đ?œƒ Equations(1) and (2) leads to an efficient method to overcome the problem of lack of synchronization among đ?œ‹ X X (1) [ đ?‘œđ?‘˘đ?‘Ą ] = CORDIC (− ) . [ đ?‘–đ?‘› ] Yđ?‘œđ?‘˘đ?‘Ą 8 Yđ?‘–đ?‘› CORDIC(−đ?œƒ) =[

đ?&#x;‘đ??… X X [ đ?’?đ?’–đ?’• ] = CORDIC (− ) . [ đ?’Šđ?’? ] Yđ?’?đ?’–đ?’• đ?&#x;?đ?&#x;” Yđ?’Šđ?’?

=

đ??… √đ?&#x;? đ?&#x;? CORDIC (− ) . [ đ?&#x;? đ?&#x;?đ?&#x;” −đ?&#x;?

đ?&#x;? Xđ?’Šđ?’? ][ ] đ?&#x;? Yđ?’Šđ?’?

(2)

Various rotation angles cordic as only one type ofcordic is required.

III. SIGNAL FLOW GRAPH The Coordinate Rotation Digital Computer (CORDIC)array performs the fixed angle rotations in DCT algorithm. For certain point DCT, the CORDIC rotation angles are fixed. The general signal flow diagram of CORDICbased DCT is shown in Fig.1.

Fig. 1: General signal flow diagram of CORDIC based DCT

A. Fast DCT Algorithm The signal flow can be separated into two major components as butterfly operator, similar to that of DFT, which is represented within the bordered dashed lines and the remaining belong to the fixed angle rotations for CORDIC array.The signal flow graph of 8-point DCT is given in Fig. 2.

All rights reserved by www.grdjournals.com

444


Speed Optimised CORDIC Based Fast Algorithm for DCT (GRDJE / CONFERENCE / ICIET - 2016 / 073)

Fig. 2: The signal flow graph of 8-point CORDIC based DCT.

B. Fast IDCT Algorithm IDCT is Inverse Discrete Cosine Transform. Inverting the function of each DCT block, the signal flow of N-point IDCT can be easily obtained asDCT and IDCT are orthogonal to each other. The inversion function blocks and the direction of signal flow are shown in Table 1. SYMBOL

DCT

IDCT

xout= xin+ yin

xout= (xin+ yin )/2

yout= xin- yin

yout= (xin- yin )/2

Kx

x

CLOCKWISE (-ď ą)

ANTI-CLOCKWISE (ď ą)

BUTTERFLY MULTIPLY CONSTANT CORDIC

1

đ??ž

Table1: Transfer functions of DCT and IDCT.

All rights reserved by www.grdjournals.com

445


Speed Optimised CORDIC Based Fast Algorithm for DCT (GRDJE / CONFERENCE / ICIET - 2016 / 073)

Fig. 3: The signal flow graph of 8-point CORDIC based IDCT.

The signal flow graph of 8-point CORDIC based IDCT is given in Fig.3. Rotating the input vector [Xin,Yin] in anti-clockwise direction angle θ generates the output vector [Xout,Yout] given by, X X cos đ?œƒ −sin đ?œƒ [ đ?‘œđ?‘˘đ?‘Ą ] = [ ] . [ đ?‘–đ?‘› ] Yđ?‘œđ?‘˘đ?‘Ą Yđ?‘–đ?‘› sin đ?œƒ cos đ?œƒ Xđ?‘œđ?‘˘đ?‘Ą Xđ?‘–đ?‘› cos đ?œƒ sin đ?œƒ [ ]=[ ] .[ ] Yđ?‘œđ?‘˘đ?‘Ą Yđ?‘–đ?‘› −sin đ?œƒ cos đ?œƒ The unfolded CORDIC technique is used in fixed angle rotation due to the advantages of lower hardware complexity and increased computation speed.The requirement for large number of iterations in conventional CORDICis overcome by using CSA based modified unfolded CORDIC, in which the number of iterations are reduced to 50%.CSA based modified unfolded CORDIC is shown in the Fig. 4.

Fig. 4: Modified signal flow graph of unfolded pipelined CORDIC.

IV. EXISTING ARCHITECTURE This architecturerequires two N/2-point DCT/IDCTs, additional N/2-1 CORDICs and additional N/2 fundamental butterfly operators to construct the N-point DCT/IDCT and is shown in the Fig. 5. The existing architecture is based on the concept of parallel-in serial-out linear array architecture for 8-point DCT as shown in Fig.6, which requires four PEs with two different types as PE_1ST and PE_2ND, by reusing the uniform PE.

All rights reserved by www.grdjournals.com

446


Speed Optimised CORDIC Based Fast Algorithm for DCT (GRDJE / CONFERENCE / ICIET - 2016 / 073)

Fig. 5: Generalized structure for the N-point DCT/IDCT. PE_1ST carries out the operation of butterfly operator and PE_2ND performs the CORDIC array operation. The results are then scaled using uniform scaling factor.

Fig. 6: Architecture for 8-point DCT.

A. Processing Element-1 (PE_1ST) The eight inputs are stored in a temporary memory and are fed in bit reversed order. Control signal (Cr) controls MUX based on decomposition matrixMCr_8which is given in the Eq. (3).

(3)

All rights reserved by www.grdjournals.com

447


Speed Optimised CORDIC Based Fast Algorithm for DCT (GRDJE / CONFERENCE / ICIET - 2016 / 073)

The inputs are grouped into four combinations as (M1,M5), (M2,M4), (M3,M7) and (M6,M8) based on the decomposition matrix and are separated in pairs into two groups. One of the two groups has four different numbers and the other group has four same numbers.

Fig. 7: architecture of PE_1ST for the 8-point CORDIC-based DCT.

All rights reserved by www.grdjournals.com

448


Speed Optimised CORDIC Based Fast Algorithm for DCT (GRDJE / CONFERENCE / ICIET - 2016 / 073)

Fig. 8: Architecture of PE_2ND for 8-point CORDIC-based DCT. The outputs of the 4:2 first stage Carry Save Adders (CSA), which are of 13 bits due to shifting in CSAs are stored in the pipelining registers temporarily. A pair of 14 bit inputs from each second stage CSA is given as input to the Carry Look-ahead Adder (CLA), from which single output is generated in each CLA. B. Processing Element-2 (PE_2ND) According to the Equations (1) and (2), only one type of CORDIC is required to realize the CORDIC arrays. Based on Fig. 4, a computational efficient CSA-based modified unfolded CORDIC architecture (PE_2ND) containing three micro rotation stages is shown in Fig. 8. Each micro rotation stages contains a pair of hardwire shifters, two’s complemented and 3:2 CSA followed by CLA and registers. The final outputs are named as Output_1 and Output_2. The two’s complemented is used to realize the subtract operation. The 15 bit outputs, Out_1 and Out_2 of PE_1ST are fed as inputs In_1 and In_2 to the micro rotation stages of thePE_2ND. Each stage performs add and subtract operation for their corresponding input. Single output obtained from each CLAs is stored in their corresponding registers. The first, second and third micro rotation stages has the shifters accordingly, (3 and 1), (7and 2) and (6 and 3). The outputs obtained from the PE_2ND are similar to the output of the CORDIC array. Example: To obtain the outputs C2and C8, the following three steps are needed.  Rotating the input values M2 and M4 by - /8 (six rotation stages)  Rotating the input values M6 and M8 by - /8  Interweave the intermediate results and rotate them by - /16

V. PROPOSED ARCHITECTURE The 4:2 CSA consists of six modules; they are four XOR and two 2:1 MUX circuits. The proposed architecture replaces the 4:2 CSA in the first stage of PE_1ST with two numbers of XNOR gate in cascade and the carry output has been implemented using a XNOR gate and a multiplexer as proposed by Radhakrishnan, D. and RiyaGarg and in their corresponding papers. The proposed architecture 4:2 CSA architecture is shown in the Fig. 9.

All rights reserved by www.grdjournals.com

449


Speed Optimised CORDIC Based Fast Algorithm for DCT (GRDJE / CONFERENCE / ICIET - 2016 / 073)

Fig. 9: 4:2 Carry Save Adder architecture.

The comparisons between existing and proposed methods based on delayand power are analysed using Xilinx 14.2 ISE Simulator and is shown in the Table 2. Delay (ns) Power (W)

EXISTING PROPOSED 6.993 4.307 0.534 0.534 Table 2: Analysis of power and delay

From Table 2, it is clear that in the proposed architecture the delay reduces drastically with the same amount of power consumption.

VI. CONCLUSION AND FUTURE WORK A CORDIC based fast algorithm for DCT is presented with reduced iterations compared to its ancestries. They are coded and analysed using Xilinx 14.2 ISE Simulator. The proposed architecture has less delay with same power consumption when compared to the existing architecture of 8-point CORDIC based DCT. The future work is carried out for CORDIC based fast algorithms for other orthogonal transforms like DFT and KLT.

REFERENCES [1] Ahemed, N., Nagarajan, T.andRao, K.R. (1974). Discrete Cosine Transform. IEEE Trans. Comput. C-23, 90-94. [2] Chen, C.T., Chen, L.G.,Chiueh, T.D. and Hsiao,J.H. (1995). High throughput CORDIC-based systolic array design for the discrete cosine transform.IEEE Trans. Circuits Syst. Video Technol. 5 (3), 218-225. [3] Chen, C.H., Liu, B.D., and Yang, J.F. (2004). Direct recursive structures for computing radix-r two dimensional DCT/IDCT/DST/IDST.IEEE Trans. Circuits Syst.–I: Regul.Pap.51, 10. [4] Fanucci, L.,Saletti, R. and Saponara, S. (2001). Parameterized and reusable VLSI macro cells for low-powerof 2-D DiscreteCosine-Transform.Microeletron.J.32, 1035-1045. [5] Huang, H., andXiao. L., CORDIC based fast algorithm for power-of-two point DCT and its efficient VLSI implementation.Microelectronics J., Vol. 45, Issue 11, 1480-1488, Nov. 2014. [6] Hou, H.S. (1997). A fast recursive algorithm for computing the discrete cosine transform.IEEE Trans. Acoust. Speech Signal Process.ASSP-35, 1445-1461. [7] Huang, H., andXiao, L. (2013). Variable length reconfigurable algorithms and architectures for DCT/IDCT based modified unfolded CORDIC.The Open Electrical & Electronic Engineering Journal 7, (Supple 1: M8), 71-81. [8] Kaddachi, M.L.,Soudani, A.,Lecuire, V., Makkaoui, L., Moureaux, J.M. and Torki, K. (2012).Design and performance analysis of a zonal DCT-based image encoder for wireless camera sensor networks.Microelectron.J. 43, 809-817. [9] Narasimha, M.J., and Peterson, A.M. (1978). On the computation of the discrete cosine transform.IEEE Trans. Commun. 26 (6), 934-936. [10] Pan, S.B., andPark,R.H. (1997).Unified systolic arrays for computation of DCT/DST/DHT. IEEE Trans. Circuit Syst. Video Technol. 7 (2), 413-419. [11] RiyaGarg, SumanNehra and B.P. Singh., (Mar. 2013). Low power 4-2 Compressor for Arithmetic Circuits.IJRTE,Vol. 2, Issue 1, ISN: 2277-3878. [12] Radhakrishnan, D.,Preethy, A.P. (2000). Low Power CMOS Pass Logic 4-2 Compressor for High-Speed Multiplication.Proc. IEEE Midwest Symp. On Circuits and Systems, pp. 1-3.

All rights reserved by www.grdjournals.com

450


Speed Optimised CORDIC Based Fast Algorithm for DCT (GRDJE / CONFERENCE / ICIET - 2016 / 073)

[13] RiyaGarg, SumanNehra and Singh, B.P. (2013).Low power Full Adder using 9T Structure.International Journal on Recent Trends in Engineering and Technology, Vol. 8, No. 2, pp. 7-10. [14] RiyaGarg, SumanNehra and Singh, B.P. (2013).A New Design of Full Adder based on XNOR-XOR Circuit.International Journal of Computer Application, Vol. 8, No. 2, pp. 7-10.

All rights reserved by www.grdjournals.com

451


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.