Performance Analysis of Code Optimization Based on TMS320C6678 Multi-core DSP by menez

Transactions on Computer Science and Technology June 2015, Volume 4, Issue 2, PP.35-39

Performance Analysis of Code Optimization Based on TMS320C6678 Multi-core DSP Li Zhou #, Yuxing Wei, Yun Xu Institute of Optics and Electronics Chinese Academy of Sciences, Sichuan Province, China #

Email: stevechaw@126.com

Abstract In the development of modern DSP, more and more use of C/C++ as a development language has become a trend. Optimization of C/C++ program has become an important link of the DSP software development. This article describes the structure features of TMS320C6678 processor, illustrates the principle of efficient optimization method for C/C++, and analyzes the results. Keywords: TMS320C6678, Program Optimization, Software Pipelining, Parallel Execution.

1 INTRODUCTION As the complexity of DSP task system is increasing, how to make full use of the resources of the DSP has become one of the key points of software development, and program optimization is an effective way to solve this problem. Hardware level optimization mainly relies on manual compilation instructions to achieve in traditional embedded development projects. The TI C6000 series of DSP can optimize code directly in the C language level because of its special hardware structure, can reach tens of times faster upgrade. This makes the development of C6000 series DSP meet the requirement of real-time in the C language optimization phase. This paper started from the C6000 hardware structure, discusses the optimization principle, method and strategy.

2 PRINCIPLE OF OPTIMIZATION 2.1 Hardware Structure As shown in Figure 1, the CPU in TMS320C6678 carries two DATAPATH for data processing, DATAPATH A and DATAPATH B. Each DATAPATH has .L, .M, .S, .D four functional units and one register group consists of sixteen 32 bit registers. Each unit completes arithmetic or logic operations. Theoretically, 8 instructions can run simultaneously on CPU due to these independent parallel processing units. This is equivalent to eight traditional CPU parallel working [1] [2].

FIG. 1 THE STRUCTURE OF DATAPATH

2.2 Code Development Flow to Increase Performance When we optimize the code, as shown in Figure 2, normally we have three stages. The first stage is called the - 35 http://www.ivypub.org/cst

development stage, we can develop C / C ++ code in the absence of the knowledge of optimization. In order to improve the performance of the code, we should enter the stage two. The second stage, known as the refining stage, we can use the intrinsics and compiler options to improve the function and performance of code [3]. If we still need to improve performance, we can enter the third stage. The third stage is called writing linear assembly, we can extract the time-critical areas from C code and rewrite the code in linear assembly[4].

FIG. 2 CODE DEVELOPMENT FLOW

2.3 Memory Dependencies To maximize the efficiency of C/C++ code, the C6000 compiler schedules as many instructions as possible in parallel. To schedule instructions in parallel, the compiler must determine the dependencies between instructions. Dependency means that one instruction must occur before another. For example, a variable must be loaded from memory before it can be used. Because only independent instructions can execute in parallel, dependencies inhibit parallelism. If the compiler cannot determine that two instructions are independent, it assumes a dependency and schedules the two instructions sequentially. Often it is difficult for the compiler to determine if instructions that access memory are independent. We can use the restrict keyword, -pm, and -mt option to indicate that a pointer is the only pointer that can point to a particular object in the scope in which the pointer is declared.

2.4 Using Instrinsics The inline function provided by TMS320C6678 is a kind of special function that directly mapping to the inline assembler instructions, which can quickly optimize the C/C++ code. The method using inline function is the same with ordinary function. And the inline function is beginning with an underline [5].

2.5 Software Pipelining Software pipelining is a technique used to schedule instructions from a loop so that multiple iterations of the loop - 36 http://www.ivypub.org/cst

execute in parallel. Loops in the algorithm accounts for the most execute time of the entire program. Reduce the execution time of the loop body is extremely important for the optimization work. If the 8 function units on C6678 kernel simultaneously processing instructions, we can short the running time of the entire body of the loop, which is similar to the pipeline, known as software pipelining. As shown in Figure 3, suppose there are five instructions in one loop, A, B, C, D, and E. The shaded area is the core of the loop, five instructions execute at the same time. The area prior to the core is called pipelined-loop prolog, the area after the core is called pipelined-loop epilog [6].

FIG. 3 SOFTWARE PIPELINE

The degree of the instruction parallelism reflects the quality of the pipeline. The higher degree of parallelism, the shorter program executes. The highest parallelism degree of C6678 is eight instructions executing simultaneously. When using the compiler option –O2 or –O3, the compiler can optimize the code to use software pipelining optimization and collect relevant information from the program [7]. Because loops present critical performance areas in C/C++ code, we often consider the following areas to improve the performance. 1) Trip Count A trip count is the number of loop iterations executed. The trip counter is the variable used to count each iteration. When the trip counter reaches a limit equal to the trip count, the loop terminates. If the compiler knows the number of iterations, it can produce faster, more compact code. Sometimes the compiler cannot determine whether the loop count is greater than the minimum number of iterations, so the compiler will produce two versions of the loop: ① If the loop count is less than the minimum number of cycles, the version of non-pipeline executes. ② If the loop count is equal to or greater than the minimum number of cycles, then the version of software pipelining executes. To help the compiler generates the version of software pipelining loops, we can determine the minimum number of cycles by the directive MUST_ITERATE. 2) Loop Unrolling Loop Unrolling is expanding small loops so that each iteration of the loop appears in code. This optimization increases the number of instructions available to execute in parallel. We can use loop unrolling when the operations in a single iteration do not use all of the resources of the C6000 architecture. There are two ways loop unrolling can be performed: ①The compiler can automatically unroll the loop; ②We can rewrite the code to unroll the loop.

3 TEST AND DATA ANALYSIS We use an example to analyze the efficiency improvement of optimization. As shown in Figure 4, we use two for loops to add two arrays.

FIG. 4 ORIGINAL CODE - 37 http://www.ivypub.org/cst

Without any optimization, to complete this operation requires 517 119 clock cycles. TABLE I Origin

Restrict

Loop Unroll

Instricsics

517119

136231

42607

13860

8583

As shown in Table I, we can see the results of different optimization methods. After opening O3 compiler optimization options, runtime shortened to 136231 clock cycles, this reaches 3.78 times faster. When opening the O3 option, the compiler will make a series of operations. It removes unused code, simplifies expressions and statements, the most important is that it will carry out software pipelining. After using the keywords restrict to eliminate the memory dependencies, runtime shortened to 42607 clock cycles, this reaches 3.20 times faster. When using the keywords restrict, the access of memory is independent, this allows the compiler schedules as many instructions as possible in parallel, which enhances the software pipelining. After unroll the loop, the runtime shortened to 13860 clock cycles, this reaches 3.07 times faster. We have two for loops in the project, but the software pipelining only processes in the inner loop. We unroll the inner loop creating a bigger loop, this improves the effiency of software pipelining. As shown in Figure 5, after rewriting the loop body (using instrinsics), the runtime shortened to 8538 clock cycles, this reaches 1.61 times faster. When compiling C/C++ code, the compiler translates the C/C++ code to assembly language, and this operation bings redundancy. If we write the assembly language ourselves, redundancy can be reduced considerably. In addition, rewriting the loop body uses the instrinsics. The inline functions use the word accesses, this makes the memory access more efficient.

FIG. 5 REWRITE THE LOOP BODY

The above data shows that we can extremely improve the efficiency by using software pipelining and eliminating memory dependencies. And manually rewrite the C/C++ code using linear assembly code for efficiency improvement is not so obvious. Reasonable combination of the above various optimization methods can greatly improve the efficiency of the code.

4 CONCLUSIONS The optimization of code on C6000 series DSP is obviously more convenient than that on traditional DSP. But to really play its working efficiency of the chip still requires experience and skills. This not only requires developers familiar with the hardware architecture, but also requires a certain understanding of compile theory. In addition, reaching the peak of DSP performance (8 instructions run in parallel) is very difficult; under most circumstances only reach 6 or 7 instructions run in parallel. In the actual development, if the optimization results have reached 6, 7 parallel instructions but still very far away from the real-time requirements, and then spend a lot of manpower to achieve the eight instructions run in parallel is not economical. In this situation, we should consider other technologies improvements or adjustments in strategy in order to achieve requirements. - 38 http://www.ivypub.org/cst

REFERENCES [1]

TMS320C66x DSP Cache User Guide, Literature Number: SPRUGY8, November 2010

[2]

TMS320C6678 Multicore Fixed and Floating–Point Digital Signal Processor, Literature Number: SPRS691C, February

[3]

L. Karam, I. AlKamal, A. Gatherer, G. Frantz, D. Anderson, and B. Evans, “Treads in Multicore DSP

[4]

Chengfei Gu, Xiangyang Li, Wenge Chang, Gaowei Jia and Haishan Tian, “Matrix Transposition Based on TMS320C6678”, in

[5]

Enhanced Direct Memory Access Controller User Guide, Literature Number: SPRUGS5A, December 2011

[6]

Guolong Zhang and Xiaosu Xu, “High-speed and Real-time Communication Controller for Embedded Integrated Navigation

GSMM, pages 29-32, May 2012

System,” in IHMSC, pp. 331-334, August 2009 [7]

Bernhard H.C. Sputh, Andrew Lukin and Eric Verhulst, “Transparent Programming of Many/multi Cores with OpenComRTOS: Comparing Inter 48-core SCC and TI 8-core TMS320C6678,” in The 6th MARC Symposium, pp. 52-58, July 2012

AUTHORS 1

Li Zhou. Zhou is male and was born on December the 2th, 1991. Zhou is pursuing master’s degree and majors in

Computer Applications Technology in Institute of Optics and Electronics of Chinese Academy of Sciences, Chengdu, Sichuan Province, China.

- 39 http://www.ivypub.org/cst