Multicore Parallel Implementation of 2D-FFT Based on TMS320C6678 DSP

Page 1

Scientific Journal of Information Engineering June 2015, Volume 5, Issue 3, PP.61-66

Multicore Parallel Implementation of 2D-FFT Based on TMS320C6678 DSP Wende Wu1 ,2#, Zhiyong Xu1 1. Institute of Optics and Electronics of Chinese Academy of Sciences, Chengdu 610209, China 2. University of Chinese Academy of Sciences, Beijing 100039, China #

Email: wuwende2008@126.com

Abstract We put forward a multicore parallel plan for 2D-FFT and implement it on TMS320C6678 DSP after we research the characteristics of different multicore DSP programming models and two-dimension FFT (2D-FFT). We bring the parallel computing capability of multicore DSP into full play and improve working efficiency of 2D-FFT. It has hugely referential value in implementing image processing arithmetic based on 2D-FFT. Keywords: Multicore DSP; Parallel Programming; 2D-FFT; Inter-Processor Communication

1. INTRODUCTION 2D-FFT is a basic arithmetic which is widely used in image processing industry. Owing to the big data and multiple dimensions of image, 2D-FFT is characterized as complex and long-playing operation, and it severely restricts the improvement of efficiency of image processing arithmetic. Platforms which consist of multiple DSPs and FPGAs are used to meet the real time requirement of image processing arithmetic based on 2D-FFT[1],[2]. But multiple DSPs add power and volume of the platforms which are very limited in embedded systems. After the Texas Instruments (TI) presented a piece of high-performance multicore DSP called TMS320C6678 in 2010 ďźŒ applying it to image processing platforms has become a trend in image processing industry, but bringing the parallel computing capability of multicore DSP into full play and improving working efficiency of 2D-FFT become problems. By researching the characteristics of 2D-FFT and C6678, we put forward a multicore parallel arithmetic plan based on data division for 2D-FFT. Experimental results show when the size of image is appropriate, the multicore parallel arithmetic has good speed ratio and parallel efficiency on C6678 DSP. This paper provides good reference for multicore parallel implementation of image processing arithmetic based on 2D-FFT.

2. MULTICORE PARALLEL PROCESSING MODEL Multicore DSPs mainly have three kinds of parallel programming models and they are Master/Slave Processing Model, Data Flow Processing Model and OpenMP Fork-Join Model. The Master/Slave Processing model, shown in Fig.1, represents centralized control with distributed execution. A master core is responsible for scheduling various threads of execution that can be allocated to any available core for processing. It also must deliver any data required by the thread to the slave core. Applications that fit this model inherently consist of many small independent threads that fit easily within the processing resources of a single core[3]. The Data Flow model, shown in Fig.2, represents distributed control and execution. Each core processes a block of data using various algorithms and then the data is passed to another core for further processing. The initial core is often connected to an input interface supplying the initial data for processing from either a sensor or FPGA. Scheduling is triggered upon data availability. Applications that fit the Data Flow model often contain large and computationally complex components that are dependent on each other and may not fit on a single core[3]. OpenMP is an Application Programming Interface (API) for developing multi-threaded applications in C/C++ or - 61 http://www.sjie.org


Fortran for shared-memory parallel (SMP) architectures. Once the programmer identifies parallel regions and inserts the relevant OpenMP constructs, the compiler and runtime system figures out the rest of the details[3]. OpenMP ForkJoin Model is shown in Fig.3.

FIG. 1 MASTER/SLAVE PROCESSING MODEL

FIG. 2 DATA FLOW PROCESSING MODEL

FIG. 3 OPENMP FORK-JOIN MODEL

3. MULTICORE PARALLEL COMPUTING PLAN FOR 2D-FFT 3.1 Decomposing of 2D-FFT 2D-FFT of an image f ( x, y) whose size is M*N can be described in Eq.1.

F (u , v)   

1 M

M 1

1 M

M 1

e

1 MN

M 1 N 1

  f ( x, y)e

 j 2 ( ux / M  vy / N )

x 0 y 0

 j 2 ux / M

x 0

 F ( x, u )e

1 N 1  f ( x, y)e j 2 vy / N N y 0

(1)

 j 2 ux / M

x 0

1 N 1 (2)  f ( x, y)e j 2 vy / N N y 0 For each x , when y=0, 1, 2,…, N-1, the Eq.2 is a complete one-dimension FFT (1D-FFT). In other words, F ( x, u) is 1D-FFT of one row of f ( x, y) . When x ranges from 0 to M-1, F ( x, u) computes 1D-FFT of all rows of f ( x, y) and frequency u still keeps constant. To finish 2D-FFT, the value of u must range from 0 to M-1 in expression 1 M 1  F ( x, u)e j 2 ux / M . Obviously, this refers to computing 1D-FFT of every column of F ( x, u) [4]. M x 0 F ( x, u ) 

In conclusion, to finish 2D-FFT, we can compute 1D-FFT of every row of image and then compute the 1D-FFT of every column of the middle result. The steps are shown in Fig.4. - 62 http://www.sjie.org


3.2 Parallel Computing Plan Because 1D-FFT of row or column has no correlation and has big computing scale which takes too long in single core, Master/Slave Processing Model is selected and multiple times of 1D-FFT for different data blocks are regarded as different separated tasks to be allocated to different cores. Since every core in C6678 is equal in computing capability[5], the computing of two times of 1D-FFT is equally assigned to every core. As is shown in Fig.5, the master core (Core0 in Fig.5) is responsible for synchronization and communication among different cores. Before the first 1D-FFT, it fetches the original image data, assigns data to slave cores (CoreX in Fig.5) and schedules slave cores to execute 1D-FFT. Between two times of 1D- FFT, it combines the middle results, transposes data, assigns data to slave cores and schedules slave cores to execute 1D-FFT again. Finally, it combines the final results and outputs the processed results. These tasks above only can be serially executed by master core.

f(x,y)

1D-FFT of row

F(x,v)

1D-FFT of column

F(u,v)

FIG. 4 DECOMPOSING OF 2D-FFT

Core0

Core0 Core0

Core0 CoreX

Core0 CoreX

FIG. 5 MULTICORE PARALLEL COMPUTING PLAN

Send request and image data

C6678 Core0

PC Browser Response to request and send back results

FIG. 6 DATA TRANSFER BETWEEN PC AND DSP

4. MULTICORE PARALLEL ARITHMETIC OF 2D-FFT 4.1 Data Transfer between PC and DSP The original image data and processed results are transferred between personal computer (PC) and C6678 through network interface. To avoid developing the user interface program on PC, browser/server (B/S) network application model is adopted. PC uploads original image and downloads processed results through web browser. Core0 in C6678 is responsible for the network service. Because this paper is focused on multicore parallel implementation of 2D-FFT, the input and output of data through network are not discussed in details and simply shown in Fig.6.

4.2 Data Storage Plan and Inter-processor Communication Plan In the whole application, data transferred through network is big and is stored in DDR3 external shared memory. Each core uses enhanced direct memory access (EDMA) controllers to move data between DDR3 and its own L2SRAM[6] to implement data division and combination. During the executing of arithmetic, multiple times of synchronization and data transfer among different cores are needed. We use Notify with semaphore or event to synchronize multiple cores and use MessageQ to transfer global data among different cores. - 63 http://www.sjie.org


Here, MessageQ and Notify are inter-thread communication methods provided by IPC[7]. IPC is the inter-processor communication components of SYS/BIOS real time kernel[8]. Notify provides notify functionality and can join with semaphore or event to synchronize among multiple threads[7]. MessageQ provides message transfer functionality and can be used to transfer a little mount of global data and synchronize among multiple threads[7].

4.3 Parallel Arithmetic of 2D-FFT The arithmetic flow chart is shown in Fig.7. Core0 has two working threads. The master thread fetches original image data from PC browser through network interface and transforms image data to float complex number. Then the master thread assigns data to slave cores and Core0’s slave thread for processing. These processing threads get the data, cooperate to process a complete image and send back the results to Core0’s master thread. The master thread combines the results, creates RAM-based files[9] and sends them back to PC browser. Core0

Core1~7

Load and execute

Load and execute

Initialize platform IPC synchronize all of the cores Core0 master thread Initialize and configure network interface, create HTTP server

Core1~7

slave thread Initialize EDMA

Initialize EDMA

MessageQ receive data

MessageQ receive data

Wait HTTP request

Receive HTTP request,get and divide image data, MessageQ send data

Move data,FFT,transpose, move and combine data

Move data,FFT,transpose, move and combine data

Synchronize all of cores via Notify MessageQ receive processed data Move data,FFT,transpose, move and combine data

Move data,FFT,transpose, move and combine data

MessageQ send back result data

MessageQ send back result data

Encapsulate data,send back to browser

FIG. 7 MULTICORE PARALLEL ARITHMETIC OF 2D-FFT

5. EXPERIMENTAL RESULTS AND ANALYSIS We use 8bits gray images with the size of 128*128, 256*256, 512*512, 1024*1024 pixels to test the multicore parallel arithmetic. Firstly, we implement the serial 2D-FFT in one core, and then code and optimize the multicore parallel arithmetic. On C6678 platform, we measure the time of 2D-FFT arithmetic and calculate speed ratio and parallel efficiency when number of cores and size of image vary regularly. As is shown in Table 1, the parallel efficiency for the same image is going down slowly when the number of cores increases. The main reason is that the time cost on communication and moving data among cores will be longer when the number of cores increases. - 64 http://www.sjie.org


With the number of cores constant, the parallel efficiency goes up firstly and then goes down a little when the image becomes bigger. The main reason is that the proportion of time cost on communication and moving data among cores decreases when image becomes bigger and the parallel computing capability of multiple cores comes into play sufficiently. But the parallel efficiency goes down a little later because the local second level static memory (L2SRAM) is sizefixed. When the size of image increases to some degree, the proportion of time cost on communication and moving data among cores doesn’t decrease. In reverse, the division and moving of more data add the proportion of time cost on communication and moving data among cores. When the size of image fits with the size of L2SRAM, the speed ratio of two cores rises up to 1.98, the speed ratio of four cores rises up to 3.60 and the speed ratio of eight cores rises up to 6.15. In a way, the speed ratio is linear with the number of cores, which indicates the parallel plan in this paper basically accords with Gustafson law, that is to say, speed ratio increases linearly when the number of cores increases[10]. TABLE 1 TESTING RESULT OF PARALLEL 2D-FFT

image size performance number of cores 1 2 4 8 image size performance number of cores 1 2 4 8

128*128 average time (us) 556 295 178 148

speed ratio 1 1.88 3.12 3.76

256*256 parallel efficiency 100% 94% 78% 47%

average time (us) 2165 1100 623 430

parallel efficiency 100% 99% 90% 76.9%

average time (us) 34150 18510 9815 5626

512*512 average time (us) 9369 4722 2603 1524

speed ratio 1 1.98 3.60 6.15

speed ratio 1 1.97 3.48 5.03

parallel efficiency 100% 98.5% 87% 62.9%

1024*1024 speed ratio 1 1.84 3.48 6.07

parallel efficiency 100% 92% 87% 75.9%

6. CONCLUSIONS In this paper, we put forward a multicore parallel computing plan based on Master/Slave Processing Model and implement the multicore parallel arithmetic of 2D- FFT on C6678 DSP platform. The arithmetic is based on division of image data and makes full use of the computing capability of C6678 DSP. The experimental results indicate when the size of image to be processed is appropriate, the arithmetic has good speed ratio and parallel efficiency, which lays the foundation for implementing complex image processing arithmetic base on 2D-FFT on C6678 DSP platform.

REFERENCES [1]

Bo Wen, Qiheng Zhang, Jianlin Zhang. Real-time processing method of 2D-FFT/IFFT for high-resolution image and hardware implementation [J]. Research of computer application, 2011, 28(11):4376-4379

[2]

Hui Dong, Qiuxi Jiang, Daping Bi. Implementation of Two-dimensional FFT in TMS320 DSP [J]. Radar and counterwork, 2002,1: 34-38

[3]

Texas Instruments. Multicore Programming Guide[R]. Texas: Texas Instruments, 2012

[4]

Jun Yang, Hongwei Ding. Research and application of FFT processing system based on FPGA [M], Science Press, 2012:1-34

[5]

Texas Instruments. TMS320C6678 Multicore Fixed and Floating-Point Digital Signal Processor Data Manual[R]. Texas: Texas Instruments, 2012

[6]

Texas Instruments. KeyStone Enhanced Direct Memory Access (EDMA3) Controller User Guide[R]. Texas: Texas Instruments, 2011

[7]

Texas Instruments. SYS/BIOS Inter-Processor Communication (IPC) 1.25 User’s Guide [R]. Texas: Texas Instruments, 2012

[8]

Texas Instruments. TI SYS/BIOS v6.35 Real-time Operating System User's Guide [R]. Texas: Texas Instruments, 2013 - 65 http://www.sjie.org


[9]

TI Network Developer's Kit (NDK) v2.21 User's Guide [R]. Texas: Texas Instruments, 2012

[10] Weiming Zhou. Multicore computing and programming [M], Huazhong University of Science &Technology Press, 2009:19-22

AUTHORS 1

Wende Wu. Wu is male and was born on February the 4th, 1990. Wu is pursuing master’s degree and majors in

electronic and communication engineering in Institute of Optics and Electronics of Chinese Academy of Sciences, Chengdu, Sichuan Province, China.

- 66 http://www.sjie.org


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.