Algorithms for string comparison on gpus master thesis

Page 1

A LGORITHMS FOR S TRING C OMPARISON ON GPU S Kenneth Skovhus Andersen, s062390 Lasse Bach Nielsen, s062377

T1! ! ! !

Technical University of Denmark ! Informatics and Mathematical Modelling ! !

Supervisors: Inge Li Gørtz Template! & Philip Bille

Generic!Prose!Document! August, 2012 DD.MM.YYY,!vX.Y! Harald!Störrle!

! ! ! ! ! ! ! Institute!for!Informatics!and!Mathematical!Modelling! Technical!University!of!Denmark!

!


DTU Informatics Department of Informatics and Mathematical Modeling Technical University of Denmark Asmussens Alle, Building 305, DK-2800 Lyngby, Denmark Phone +45 4525 3351, Fax +45 4588 2673 reception@imm.dtu.dk www.imm.dtu.dk


A BSTRACT We consider parallelization of string comparison algorithms, including sequence alignment, edit distance and longest common subsequence. These problems are all solvable using essentially the same dynamic programming scheme over a two-dimensional matrix, where an entry locally depends on neighboring entries. We generalize this set of problems as local dependency dynamic programming (LDDP). We present a novel approach for solving any large pairwise LDDP problem using graphics processing units (GPUs). Our results include a new superior layout for utilizing the coarse-grained parallelism of the many-core GPU. The layout performs up to 18% better than the most widely used layout. To analyze layouts, we have devised theoretical descriptions, which accurately predict the relative speedup between different layouts on the coarse-grained parallel level of GPUs. To evaluate the potential of solving LDDP problems on GPU hardware, we implement an algorithm for solving longest common subsequence. In our experiments we compare large biological sequences, each consisting of two million symbols, and show a 40X speedup compared to a state-of-the-art sequential CPU solution by Driga et al. Our results can be generalized on several levels of parallel computation using multiple GPUs.

iii



R ESUM E´ Vi betragter parallelisering af algoritmer til sammenligning af strenge, herunder sequence alignment, edit distance og longest common subsequence. Disse problemer kan alle løses med en todimensional dynamisk programmeringsmatrix med lokale afhængigheder. Vi generaliserer disse problemer til local dependency dynamic programming (LDDP). Vi præsenterer en ny tilgang til at løse store parvise LDDP-problemer med grafikprocessorer (GPU’er). Ydermere har vi udviklet et nyt layout til udnyttelse af GPU’ens multiprocessorer. Vores nye layout forbedrer køretiden med op til 18% i forhold til tidligere layouts. Til analyse af et layouts egenskaber, har vi udviklet teoretiske beskrivelser, der præcist forudsiger den relative køretidsforbedring mellem forskellige layouts. For at vurdere GPU’ens potentiale til at løse LDDP-problemer, har vi implementeret en algoritme, som løser longest common subsequence. I vores eksperimenter sammenligner vi lange biologiske sekvenser, der hver best˚ar af to millioner symboler. Vi viser mere end 40X hastighedsforøgelse i forhold til en state-of-the-art sekventiel CPU-løsning af Driga et al. Vores resultater kan generaliseres p˚a flere niveauer af parallelitet ved brug af flere GPU’ere.

v



P REFACE This master’s thesis has been prepared at DTU Informatics at the Technical University of Denmark from February to August 2012 under supervision of associate professors Inge Li Gørtz and Philip Bille. It has an assigned workload of 30 ECTS credits for each of the two authors. The thesis deals with the subject of local dependency dynamic programming algorithms for solving large scale string comparison problems on modern graphical processing units (GPUs). The focus is to investigate, combine and further develop existing state of the art algorithms.

Acknowledgments We would like to thank our supervisors for their guidance during the project. ¨ A special thanks to PhD student Morten Stockel at the IT University of Copenhagen for providing the source code for sequential string comparison algorithms [1] and PhD student Hjalte Wedel Vildhøj at DTU Informatics for his valuable feedback.

Lasse Bach Nielsen

Kenneth Skovhus Andersen

August, 2012

vii



C ONTENTS

Abstract

iii

Resum´e

v

Preface

vii

1

Introduction 1.1 This Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Our Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 2 2 2

2

Local Dependency Dynamic Programming 2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Previous Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Our Approach Based on Previous Results . . . . . . . . . . . . .

5 5 6 9

3

Graphics Processing Units 3.1 GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11 11 13 14

4

Parallel Layouts for LDDP 4.1 Diagonal Wavefront . . . . . . . . . . . . . . 4.2 Column-cyclic Wavefront . . . . . . . . . . 4.3 Diagonal-cyclic Wavefront . . . . . . . . . . 4.4 Applying Layouts to the GPU Architecture 4.5 Summary and Discussion . . . . . . . . . .

. . . . .

15 16 17 18 19 20

5

Implementing LDDP on GPUs 5.1 Grid-level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Thread-level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Space Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . .

23 23 23 28

6

Experimental Results for Grid-level 6.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31 31 32

7

Experimental Results for Thread-level 7.1 Setup . . . . . . . . . . . . . . . . . 7.2 Results for Forward Pass Kernels . 7.3 Results for Backward Pass Kernel . 7.4 Part Conclusion . . . . . . . . . . .

37 37 37 41 42

ix

. . . .

. . . .

. . . .

. . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . .


C ONTENTS 8

Performance Evaluation 8.1 The Potential of Solving LDDP Problems on GPUs . . . . . . . . 8.2 Comparing to Similar GPU Solutions . . . . . . . . . . . . . . . .

43 43 44

9

Conclusion 9.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45 45

Bibliography

47

Appendices

51

A NVIDIA GPU Data Sheets A.1 NVIDIA Tesla C2070 . . . . . . . . . . . . . . . . . . . . . . . . . A.2 NVIDIA GeForce GTX 590 . . . . . . . . . . . . . . . . . . . . . .

53 53 54

B Kernel Source Code B.1 Forward pass kernels . . . . . . . . . . . . . . . . . . . . . . . . .

55 55

x


1

I NTRODUCTION

We revisit the classic algorithmic problem of comparing strings, including solving sequence alignment, edit distance and finding the longest common subsequence. In many textural information retrieval systems, the exact comparison of large-scale strings is an important, but very time consuming task. As an example, the exact alignment of huge biological sequences, such as genes and genomes, has previously been infeasible due to computing and memory requirements. Consequently, much research effort has been invested in faster heuristic algorithms1 for sequence alignment. Although these methods are faster than exact methods, they come at the cost of sensitivity. However, the rise of new parallel computing platforms such as graphics processing units, are able to change this scenario. Graphics processing units (GPUs) are designed for graphics applications, having a large degree of data parallelism using hundreds of cores, and are designed to solve multiple independent parallel tasks. Previous results for accelerating sequence alignment using GPUs show a significant speedup, but are currently focused on aligning many independent short sequences, a problem where the GPU architecture excel. Our focus, is based on the need to solve large-scale exact pairwise string comparison of biological sequences containing millions of symbols. Our work is motivated by the increasing power of GPUs, and the challenge of making exact comparison of large strings feasible. We consider parallelization of a general set of pairwise string comparison algorithms, all solvable using essentially the same dynamic programming scheme over a two dimensional-matrix. Taking two input strings X and Y of the same length n, these problems can be solved by computing all entries in an n ⇼ n matrix using a specific cost function. Computation of an entry in the matrix depends on the neighboring entries. We generalize all these problems as local dependency dynamic programming (LDDP). In general, LDDP problems are not trivially solved in parallel, as the local dependencies gives a varying degree of parallelism across the entire problem space. The parallelism of a GPU is exposed as a coarse-grained grid of blocks, where each block consists of finer-grained threads. We call these levels of parallelism the grid- and thread-level. We focus on layouts as a mean to describe how LDDP problems can be mapped to the different levels on the GPU.

1 One of the first heuristic algorithms for sequence alignment was FASTA, presented by Pearson and Lipman in 1988 [2].

1


1. I NTRODUCTION

1.1

This Report

We start by presenting a short description of our work. The following chapter gives a theoretical introduction to LDDP problems, including a survey of previous sequential and parallel solutions. Based on this, we select a set of state-of-the-art algorithms as a basis for our new GPU solution. We then introduce the GPU architecture, and the programming model. This is followed by a chapter describing layouts for distributing LDDP problems to parallel compute units. The chapter also introduces our main result: a new layout for improving the performance of solving LDDP problems on GPUs, targeted at the GPU grid-level. In the following we describe our implementation on the NVIDIA GPU architecture, and what design considerations have been made. Finally, the experimental results examine the practical performance of our LDDP implementation on GPUs.

1.2

Previous Work

Currently there are several GPU solutions for solving LDDP problems. Many of these implement the Smith-Waterman algorithm for the local alignment problem [3, 4, 5, 6, 7, 8, 9, 10]. The solutions achieve very high utilization of GPU multiprocessors by comparing a large number of short sequences, thereby solving multiple independent LDDP problems. Existing GPU solutions for longest common subsequence [11, 12] are able to compare rather large sequences by decomposing the LDDP problem into smaller subproblems, called tiles, and process these in an anti-diagonal manner on the GPU multiprocessors. This widely used layout, for mapping tiles onto compute units, is referred to as Diagonal Wavefront. It assigns all tiles in an anti-diagonal for computation and continues to the next anti-diagonal, once all tiles have been computed. Galper and Brutlag [13] presented another layout, Column-cyclic Wavefront,2 which improves resource utilization compared to the widely used Diagonal Wavefront. Krusche and Tiskin [14] used a similar layout on the BulkSynchronous Parallelism Model [15] for solving large-scale LDDP, mapping each column in the tiled LDDP problem to a processor.

1.3

Our Results

We present an algorithm for solving large pairwise LDDP problems using GPUs. Our main result is a new layout Diagonal-cyclic Wavefront for distributing a decomposed LDDP problem onto the GPU. Let an n ⇼ n matrix be decomposed into k ⇼ k equally sized tiles, each computable on a GPU multiprocessor. The mapping of tiles onto multiprocessors take place on the gridlevel, and the computation of entries inside a tile is done on the thread-level.

2 Originally

2

called Row Wavefront Approach (RWF).


1.3. Our Results To theoretically evaluate our new layout Diagonal-cyclic Wavefront with the widely used Diagonal Wavefront and Column-cyclic Wavefront, we examine the utilization of each. The utilization of a layout is the fraction of tiles computed where all multiprocessors are fully utilized to the total number of tiles. Theoretically, we show that our new layout Diagonal-cyclic Wavefront, in general, achieves the best utilization—depicted in Figure 1.1. Our experiments confirm the theoretical analysis; our new layout is superior to the widely used Diagonal Wavefront, and achieves a performance speedup up to 18% for any LDDP problem. Furthermore, for various input sizes, our new layout generally outperforms Column-cyclic Wavefront. We show, that our theoretical descriptions of the layouts give very accurate predictions on how the architecture behaves. In general, the new layout Diagonalcyclic Wavefront should always be used for distributing large LDDP problems onto the GPU grid-level. To explore the potential of solving LDDP problems on GPUs, we implement a set of kernels for solving longest common subsequence. A kernel defines the computation of individual tiles. For simplicity, we focus on Diagonal Wavefront for distributing computable entries to threads. To determine the best kernel parameters, e.g., tile size and number of threads, we conduct automatic performance tuning [16]. We present a scaling technique for spacereduction of cost values inside a tile, giving a better kernel performance. Scaling can be applied to a subset of LDDP problems. Besides, we have investigated a rather undocumented feature of the CUDA compiler, the volatile keyword. Depending on placement, we have observed up to 10% performance increase. To evaluate whether the problem is viable to solve on a GPU, we compare our results to state-of-the-art sequential CPU solutions. The experiments shows an average of 40X performance advantage over the sequential CPU solution by Driga et al. [17] for comparing strings larger than 219 using the NVIDIA Tesla C2070 GPU and a Intel i7 2.66GHz CPU. To the best of our knowledge, our LDDP implementation supports the largest input size n for GPUs in literature, up to 221 . Utilization of k ⇼ k tiles using p = 14 compute units

100

Utilization U (%)

90

80

70

60

50 28

Diagonal-cyclic Wavefront Column-cyclic Wavefront Diagonal Wavefront 42

56

70

84

98

112 126 140 154 168 182 196 210 224 238 252 266 280

Number of tiles (k )

Figure 1.1: Utilization of the three layouts.

3



2

L OCAL D EPENDENCY D YNAMIC P ROGRAMMING

A large group of string comparison problems can be solved using essentially the same scheme over a two-dimensional dynamic programming matrix (DPM), where an entry (i, j) in the matrix depends on at most three neighboring entries. These include widely-used string problems in bioinformatics such as edit distance, sequence alignment and longest common subsequence. We refer to all these problems as local dependency dynamic programming (LDDP) problems.

2.1

Definitions

Let X and Y be the input strings with characters from a finite alphabet S, for simplicity we assume equal string length, i.e., | X | = |Y | = n. The character at position i in X is denoted X [i ].

2.1.1

Local Dependency Dynamic Programming

Given the input strings X and Y, an LDDP problem can be solved by filling a (n + 1) ⇼ (n + 1) DPM, denoted c. The entry c[i, j] depends on at most three neighboring entries, c[i 1, j 1], c[i, j 1], c[i 1, j] and the characters X [i ] and Y [ j]. We let parent(i, j) denote the neighboring entries that determines c[i, j]. In general, the recurrence for solving an LDDP problem is: ( b(i, j) if i = 0 _ j = 0, c[i, j] = (2.1) f ( X [i ], Y [ j], parent(i, j)) if i, j > 0 The function b initializes the north and west border of the DPM c in time O(1) for each entry. The function f ( X [i ], Y [ j], parent(i, j)) computes the solution to the subproblem c[i, j] in time O(1) as it depends on three neighboring entries and input characters X [i ] and Y [ j]. The forward pass computes the length of the optimal path by filling the DPM, and backward pass finds the optimal path by backtracking through the DPM.

2.1.2

Longest Common Subsequence

For simplicity, we use the longest common subsequence (LCS) problem as a case study, although all techniques and results presented generalize to any LDDP problem. We define the problem as follows: Let X [i, j] denote the substring of X from position i to j. A subsequence of X is any string Z with zero or more elements left out of X. We say that Z is

5


2. L OCAL D EPENDENCY D YNAMIC P ROGRAMMING a common subsequence of X and Y if Z is a subsequence of both X and Y. The longest common subsequence problem for two input strings X and Y, is to find a maximum-length common subsequence of both X and Y. Given two strings X and Y where | X | = |Y | = n, the standard dynamic programming solution to LCS fills a (n + 1) ⇼ (n + 1) dynamic programming matrix c using the following recurrence [18]: 8 > <0 c[i, j] = c[i 1, j 1] + 1 > : max ( c[i, j 1], c[i

if i = 0 _ j = 0, 1, j] )

if i, j > 0 ^ X [i ] = Y [ j],

(2.2)

if i, j > 0 ^ X [i ] 6= Y [ j]

The length of the LCS between X [1, i ] and Y [1, j] is c[i, j], therefor the length of the LCS of X and Y is c[n, n]. To compute the forward pass, the algorithm use O(n2 ) time and space. The solution path and thus the LCS is deduced by backtracking from c[n, n] to some c[i0 , j0 ] where i0 = 0 _ j0 = 0. For a given entry c[i, j] the backward pass determines in O(1) time which of the three values in parent(i, j) that was used to compute c[i, j]. The complete LCS is reconstructed in O(n) time, when all cost values in the DPM are available.

2.2

Previous Results

We start by presenting an overview of previous results for solving LDDP problems in general. We divide the findings in sequential and parallel solutions.

2.2.1

Sequential Solutions

Wagner and Fischer 1974 [19] presented one of the first dynamic programming solutions to the Levenshtein distance problem using O(n2 ) time and space. We call this a full-matrix algorithm as it stores the complete DPM. Needleman-Wunch 1970 [20] and Smith-Waterman 1981 [21] presented other examples of full-matrix algorithms for LDDP problems. Hirschberg 1975 [22] improved space at the cost of increased time for the backward pass. They used a divide and conquer approach by combining a standard and reverse application of the linear space cost-only variation to find a partitioning midpoint. Although the original solution was presented for LCS, Myers and Miller [23] generalized it to sequence alignment in 1988. The algorithm uses O(n2 ) time, O(n) space and O(n2 ) recomputations. Driga et al. 2006 [17] presented their cache-aware Fast Linear-Space Alignment (FLSA). It divides the DPM into k2 equally sized tiles, as shown in Figure 2.1. All tiles share boundaries of intersecting cost-values. The time-space tradeoff parameter k is selected so the problem space in a tile can be computed using full-matrix. The forward pass fills the boundaries. The backward pass uses the boundaries to compute the optimal 6


2.2. Previous Results path by processing only tiles that intersect the optimal path. The algorithm implements backward pass optimization which reduces the size of the tiles according to the entry point of the optimal path. FLSA uses 2 O(n2 ) time, O(nk ) space and O( nk ) recomputations. Chowdhury and Ramachandran 2006 [24] also tiled the DPM, but reduced I/O bound by splitting the DPM into four tiles, then recursively compute each tile. Unlike Driga et al. [17], the algorithm is cache-oblivious [25]. The algorithm uses O(n2 ) time and O(n) space. As the backward pass intersect at most 3/4 tiles, it performs O(n2 ) recomputations. Bille and Stockel ¨ 2012 [1] combined the k2 tiles from Driga et al. [17] with recursive and cache-oblivious approach from Chowdhury and Ramachandran [24]. Experiments showed a superior performance over Chowdhury and a comparable performance to Driga. The algorithm uses O(n2 ) 2 time, O(nk) space and O( nk ) recomputations. All presented algorithms solve LDDP string comparison problems in general. For specific LDDP problems, specialized solutions exist improving complexity or space bounds by restricting the problem in terms of alphabet size, cost function or by exploiting properties of a specific LDDP problem—see e.g., [26, 27, 28, 29] and the surveys [30, 31]. X Y

1

2

3

4

2

3

4

5

3

4

5

6

4

5

6

7

t=

n k

input character stored value

Figure 2.1: Decomposition of LDDP problems using tiled linear-space reduction as presented by Driga et al. [17]. The DPM is divided into k2 equally sized tiles sharing boundaries of intersecting cost-values. The forward pass of a tile receives as input the boundaries on north and west, and outputs the south and east boundaries. The backward pass uses the stored boundaries to compute the optimal path by processing only tiles that intersect the optimal path. In a parallel context, the numbers inside each tile refer to the order in which the tiles can be calculated during a forward pass.

7


2. L OCAL D EPENDENCY D YNAMIC P ROGRAMMING

2.2.2

Parallel Solutions

Several theoretical results for LDDP are based on the Parallel Random Access Machine (PRAM) model [32], which ignores memory hierarchy, memory latency and cost of synchronization. As an example, Mathies [33] shows an algorithm for determining edit distances for two strings of size m and n in O(log m log n) time for mn processors. Although these results show the extend of parallelism, their assumptions, that the number of processors is in the order of the problem size, make their algorithm impractical. In general, to solve an LDDP problem in parallel, the problem space must be decomposed and distributed to compute units. Let an n ⇼ n DPM be decomposed into k2 equally sized square tiles of size t = n/k. A layout defines the order in which the computation of tiles in the DPM is performed. Parallel Solutions for CPU Galper and Brutlag 1990 [13] presented the layout Column-cyclic Wavefront (originally called Row Wavefront Approach) for efficiently solving LDDP problems on a shared-memory multiprocessor. The layout is examined and analyzed in chapter 4 on page 15. Krusche and Tiskin 2006 [14] used a similar layout as Galper and Brutlag to find the length of the longest common subsequence using the Bulk Synchronous Parallelism Model (BSP) [15]. Their algorithm decomposes the DPM into rectangular tiles similar to Driga et al. [17], and sequentially computes the values inside the tile. Driga et al. 2006 [17] presented a parallel version of their linear-space FLSA algorithm for CPU multicore systems. The algorithm computes the tiled DPM by advancing in a diagonal wavefront pattern, called the Diagonal Wavefront layout. The computation flow is shown in Figure 2.1. Their experiments showed a linear speedup up to 8 processors for sequences where n < 219 . Chowdhury and Ramachandran 2008 [34] showed a general cache-efficient recursive multicore algorithm for solving LDDP problems. They considered three types of caching models for chip multiprocessors (CMP) including private, shared and multicore. Performance tests for two LDDP problems pairwise sequence alignment with affine gap cost and median of 3 sequences, again with affine gap penalty, solved using their CMP algorithm, showed a 5 times speedup on a 8-core multiprocessor. Diaz et al. 2011 [35] implemented Smith-Waterman and Needleman-Wunsch on the Tilera Tile64 processor having 64 cores. They based their parallel algorithm on FLSA by Driga et al. [17]. Their implementation achieved up to 15 times performance increase compared to the same algorithm on an x86 multicore architecture.

8


2.3. Our Approach Based on Previous Results Parallel Solutions for GPU Currently there are several GPU solutions to LDDP problems, but we found the Smith-Waterman algorithm for local alignment to be the most explored. The most important is listed here: Liu, W. et al. 2006 [3, 4] presented the first solution to Smith-Waterman on a GPU, and achieved a very high utilization of GPU multiprocessors by comparing a large number of independent short sequences. This means that they solve multiple independent LDDP problems with no dependencies on a GPU grid-level. To reduce space when computing the optimal length of a n2 cost matrix, they only store three separate buffers of length n holding cost values for the most recent calculated diagonals— we call this linear space reduction three cost diagonals. Similar solutions were presented in 2008–2009 [5, 6, 7]. Liu, Y. et al. 2010 [8, 9] presented CUDASW++, reported to perform up to 17 billion cells update per second (GCUPS) on a single-GPU GeForce GTX 280 for solving Smith-Waterman. We note that CUDASW++ use the Column-cyclic Wavefront layout on the thread-level. No backtracking is made, and their algorithm does not generalize to large LDDP. Although many GPU solutions for Smith-Waterman were found, they are only able to compare strings of size n < 216 . As a result, they are not applicable for comparing large biological sequences considered in this report. For solution of large LDDP problems, we found two interesting GPU implementations of Longest Common Subsequence (LCS): Kloetzli et al. 2008 [11] presented a combined CPU/GPU solution for solving LCS of large sequences (up to n  220 ). They showed a five-fold speedup over the cache-oblivious single processor algorithm presented by Chowdhury and Ramachandran [24]. The experiments were performed on an AMD Athlon 64 and an NVIDIA G80 family GTX GPU. Deorowicz 2010 [12] calculates the length of LCS on large sequences. Their algorithm decomposes the problem space into tiles like Driga et al. [17], and calculates the tiles using the Diagonal Wavefront layout. Their experiments show a significant speedups obtained over their own serial CPU implementation of the same algorithm for n = 216 . Unfortunately no comparison is made for any known CPU solutions, and, despite having tried, we have not been able to obtain the source code.

2.3

Our Approach Based on Previous Results

We now select relevant results, for our further investigations. As a basis for our LDDP solution, we use the tiling approach by Driga et al. [17] to achieve a linear-space reduction and decomposition of the problem space. Furthermore, we wish to investigate the properties and efficiencies of the layouts Diagonal Wavefront [12, 17] and Column-cyclic Wavefront [13, 14]. The three cost diagonals presented by Liu, W. et al. [3, 4] is explored for space reduction. 9



3

G RAPHICS P ROCESSING U NITS

In this chapter we introduce relevant aspects of graphics processing unit architectures and the programming model exposing the hardware. Where central processing units (CPUs) are highly optimized for solving a wide range of single-threaded applications, GPUs are built for graphics applications having a large degree of data parallelism. Graphics applications are also latency tolerant, as the processing of each pixel can be delayed as long as frames are processed at acceptable rates. As a result, GPUs can trade off single-thread performance for increased parallel processing. As a consequence, each processing element on the GPU is relatively simple and hundreds of cores can be packed per die. [36] There are currently several frameworks exposing the computation power of GPUs, including ATI Stream, Open Computing Language (OpenCL) and NVIDIA’s Compute Unified Device Architecture (CUDA) [37]. For our implementation we choose to work on NVIDIA CUDA.

3.1

GPU Architecture

A GPU is composed of a number of streaming multiprocessors (SM) each having a number of compute units called streaming processors (SP) running in lockstep. The number of streaming multiprocessors differ between GPU models, but as an example the NVIDIA Tesla GPUs has 14 SMs each with 32 SPs, totaling 448 SPs. See hardware specifications in Appendix A. The architecture of a GPU is akin to Single Instruction Multiple Data (SIMD), however, a GPU refines the SIMD architecture into Single Instruction Multiple Thread (SIMT). Instructions are issued to a collection of threads called a warp. SIMT allows individual execution paths of threads to diverge as a result of branching. If threads within a warp diverge, the warp will serialize each path taken by the threads. [37] Compared to CPU-threads, threads on a GPU are lightweight and handled in hardware. Register memory for individual threads is kept in the SM register memory, making hardware-based context switching possible at no cost. Warps are not scheduled to run until data is available to all threads within the warp, making it possible to hide memory latency.

11


3. G RAPHICS P ROCESSING U NITS

3.1.1

CUDA Programming Model

The programming model provides two levels of parallelism, coarse and finegrained. On the coarse-grained grid-level, partitioning of work is done by dividing the problem space into a grid consisting of a number of blocks. A block is mapped to a symmetric multiprocessor, and represents a task which can be solved independently. On the fine-grained thread-level concurrent threads are assigned to a block, and provide data and instruction parallelism. The levels of parallelism are depicted in Figure 3.1. Kernel with size (gridDim, blockDim) and a set of instructions.

Grid

Block 1

Warp 1 Instructions

Warp ... Instructions

Threads 1..32

Block ...

Block gridDim

Warp n Instructions

Thread i..blockDim

Figure 3.1: Taxonomy of the CUDA work partitioning hierarchy.

A kernel function sets the partitioning parameters and defines the instructions to be executed. If the available resources of an SM allows it, multiple blocks can be allocated on an SM. This way, the hardware resources are better utilized. Synchronization Primitives Each level has different means of synchronizing. Grid-level No specific synchronization primitive is available to handle synchronization between blocks as concurrent blocks represent independent tasks. Implicit synchronization can, however, be achieved by a number of serialized kernel calls. Thread-level CUDA only supports barrier synchronization for all threads within a block.

12


3.2. Memory

3.2

Memory

GPU memories are shown in FigTexture ure 3.2. Registers are used as priGlobal Memory Constant vate memory for threads, while all threads within a block have acL2 Cache cess to shared memory. All threads across blocks have access to global Streaming memory and the read-only texture Multiprocessor and constant memory. L1 Cache Registers Two levels of caches, L1 and L2, exist. Both caches are remarkTexture Cache Streaming ably smaller than typical caches Processors Constant Cache on CPUs. Each SM is equipped with its own L1 cache that resides in shared memory. The L2 cache Shared Memory is shared between all SM as a fully coherent unified cache, with cache lines of 128 bytes. As shown in Figure 3.2, L1 and L2 caches are Figure 3.2: CUDA memory spaces accessible from used for global memory. The spe- a streaming processor (SP). Please note that for simcial caches; texture and constant plicity only a single SM and a single SP are shown. memory can be mapped to specific parts of global memory, and provide specialized cached access patterns to these parts of global memory. The CUDA memory types and their traits are shown in Table 3.1. For accessing global memory, the number of memory transactions performed will be equal to the number of L2 cache lines needed to completely satisfy the request. Shared memory is divided into equal sized memory banks which can be accessed simultaneously. Concurrent memory access, which falls into distinct banks, can be handled simultaneously, whereas concurrent access to the same bank will cause serialized access—referred to as bank conflicts. Access time for texture and constant cache depends on access patterns, but constant cache is stated to be as fast as reading from a register, as long as all threads read the same address [38].

Type

Location on SM

Cached

Access

Scope

Access latency (non-cached) 0-1

Register

yes

n/a

R/W

1 thread

Shared

yes

n/a

R/W

threads in block

Local

no

yes

R/W

1 thread

400-600

1

Global

no

yes

R/W

all threads + host

400-600

Constant

no

yes

R

all threads + host

400-600

Texture

no

yes

R

all threads + host

400-600

Table 3.1: Memory types in CUDA. n/a stand for ”not applicable”, R for read and W for write. The documented access latencies is given in cycles. [38]

13


3. G RAPHICS P ROCESSING U NITS

3.3

Best Practices

A number of best practices to effectively exploit the GPU architecture are described by NVIDIA [38]. The most important are presented here: Shared memory should be used when possible, as shared memory is faster than global memory. Values which are accessed often should be placed in shared memory. Global memory to compute ratio should be maximized, as global memory access is slow, while parallel computation is fast. Minimize kernel branch diversion because divergent branches means serialized execution for each divergent branch.

14


4

PARALLEL L AYOUTS FOR LDDP

We will now present how LDDP problems can be computed in parallel. From recurrence 2.1 on page 5, an entry (i, j) in the DPM is computable if neighboring entries to the west, north-west and north have been computed. Thus, for any entry (i, j) to be computable, the data dependencies (id , jd ) are where 0  id < i ^ 0  jd < j. Due to these data dependencies, the order in which the DPM can be computed in parallel follows anti-diagonal lines, a pattern known as wavefront parallelism [17]. A wavefront Wd consists of the set of entries (i, j) where d = i + j + 1 and the number of entries in the set is denoted |Wd |. The dependencies give a varying degree of parallelism across the DPM, an inherent property of all LDDP problems. In our context and modeling of the problem, parallel computations are executed as a sequence of steps, each consisting of computations and communications followed by a barrier synchronization. Once all compute units reach the barrier in step si they proceed to the next step si+1 . The total number of steps to complete a computation is denoted S. This corresponds to the term supersteps in Valiant’s BSP model [15]. To partition the LDDP problem for parallel computation and reduce the space usage, we use Driga’s tiling algorithm [17]. The n ⇥ n DPM is decomposed into k2 equally sized tiles of size t = n/k. The data dependency for entries applies to tiles as well. After decomposing the LDDP problem, computable tiles are mapped onto compute units. A scheme for this mapping is called a layout. A layout defines a series of steps, each computing a set of tiles. As the degree of parallelism varies, so will resource utilization across the series of steps a layout defines. When applying a layout, we distinguish between tiles computed in a step where all compute units are fully utilized, and tiles computed in a step where compute units are under-utilized. These notions are denoted TU =1 and TU <1 respectively, and called utilization-parameters. The resource utilization U of a layout is the fraction of tiles computed where all compute units are fully utilized to the total number of tiles: U = TU =1 /k2 . From a utilization perspective, the best possible layout using p processors is only under-utilizing the processors in the wavefronts in the north-west and south-east corners where |Wd | < p, i.e., the wavefronts with 1, 2, . . . , p 1 p 1 tiles. Each corner consists of Âd=1 |Wd | = ( p2 p)/2 tiles. For both corners, the best case number of tiles under-utilizing resources is TU <1 = p2 p. As our focus is comparison of large strings on GPU architecture, we assume k is an order of magnitude larger than the number of compute units p, so k p, where p is constant. We now describe three layouts with a different flow of computations: the widely used Diagonal Wavefront, Column-cyclic Wavefront [13, 14] and our new Diagonal-cyclic Wavefront. 15


4. PARALLEL L AYOUTS FOR LDDP

4.1

Diagonal Wavefront

The widely used Diagonal Wavefront (D IAWAVE) layout process the tiled DPM in an anti-diagonal manner where a wavefront Wd consist of all tiles (i, j) in wavefront. Wd is computable when Wd 1 has been processed and a barrier synchronization prevents Wd+1 from being processed before Wd is finished. Since all entries in Wd 1 have been computed, entries in Wd can be computed independently. This layout is depicted in Figure 4.1 for k = 6 and a number of compute units p. 1

2

3

5

7

9

1

2

3

4

6

8

1

2

3

4

5

6

2

3

4

7

9

11

2

3

4

5

8

10

2

3

4

5

6

7

3

4

6

9

11

13

3

4

5

7

9

11

3

4

5

6

7

8

4

6

8

10

12 14

4

5

7

9

11

12

4

5

6

7

8

9

6

8

10

12

14 15

5

7

9

11

12

13

5

6

7

8

9

10

8

10

12

14

15 16

7

9

11

12 13

14

6

7

8

9

10

11

(a) p = 3

(b) p = 4

(c) p = n = 6

Figure 4.1: Diagonal Wavefront layout. A DPM subdivided into k ⇥ k tiles, here k = 6. The number inside the tiles denote their parallel computation step. For clarity, the figure shows an ordered execution of tiles in a Wd , although they could be computed in any order. The gray color shows tiles, that are computed in a step where some compute units under-utilized. (a) 16 steps are used to compute the DPM, with utilization U = 24/36 = 67%. (b) 14 steps and U = 56%. (c) 11 steps and U = 6/36 = 17%. Only step 6 is not under-utilizing compute units.

l m |Wd | A complete wavefront Wd of length |Wd | can be computed in parp allel computation steps. As there are 2k 1 diagonals in a k ⇥ k matrix, the number of steps to compute the entire DPM is: S=

2k 1 ⇠

Â

d =1

|Wd | p

As |Wd | = d for 1  d  k, then by symmetry the number of steps for all Wd where d 6= k is 2 Âkd=11 dd/pe. The complete number of steps is: S=2

k 1

 dd/pe + dk/pe

(4.1)

d =1

Resource Utilization When the length of a wavefront is not divisible l m by p, under-utilization will occur. As each wavefront is divided into

|Wd | p

steps, the utilization-parameter

TU <1 for each wavefront Wd is |Wd | mod p. Starting from the first wavefront, and advancing p wavefronts, the number p of under-utilized tiles1 are: Âd=1 (|Wd | mod p) = ( p2 p)/2. This number of under-utilized tiles will periodically continue to occur, when advancing p 1 For simplicity, the phrase “under-utilized tiles” is used for the number of tiles computed in a step where under-utilization occurs.

16


4.2. Column-cyclic Wavefront wavefronts further, see Figure 4.1 (a). This gives a total of 2bk/pc full periods, where ( p2 p)/2 tiles are computed while under-utilizing. The remaining k mod p wavefronts will by symmetry give (k mod p)2 tiles, see Figure 4.1 (b). For a full computation of k2 tiles, the total number of tiles where D IAWAVE is under-utilizing the available resources are:

TU <1 = bk/pc( p2

4.2

p) + (k mod p)2

(4.2)

Column-cyclic Wavefront

Galper and Brutlag [13] presented the Column-cyclic Wavefront (C OLC YCLIC) layout, where columns in the tiled DPM are divided into column groups spanning p columns. The layout is shown in Figure 4.2. Each column group G is calculated as a wavefront Wd where |Wd |  p. When column i in the group is completed, column i + p in the next group will start. In cases where a compute unit can be explicitly mapped to a specific column and values can be stored locally between steps, C OLC YCLIC minimizes memory transfers by keeping the values from (i, j) when advancing to tile (i, j + 1). Krusche and Tiskin [14] exploit this potential by mapping the layout to the BSP model [15]. p p0

p1

p2

p0

p1

p2

p0

p1

p2

p3

p0

p1

1

2

3

7

2

3

4

8

8

9

1

2

3

4

7

8

9

10

2

3

4

5

8

3

4

5

9

9

10

11

3

4

5

6

9

10

4

5

6

10

11

12

4

5

6

7

10

11

5

6

7

11

12

13

5

6

7

8

11

12

6

7

8

12

13

14

6

7

8

9

12

13

(a) p = 3

(b) p = 4

| Gsub |

p

(k mod p)

k

(c)

| Gsub |

Figure 4.2: Column-cyclic Wavefront layout where the vertical line shows the separation of column groups spanning p columns. p0 p3 indicates the column mapping to compute units as presented by Krusche and Tiskin [14]. Tiles computed while under-utilizing the compute units are shown with a gray color. (a) p = 3 using 14 steps and U = 83%. (b) p = 4 using 13 steps and U = 56%. Note the suboptimal utilization for the last column group Gsub . (c) Illustrates how to calculate the number of steps when k is not a multiple of p.

When k is not a multiple of p, as in the example in Figure 4.2 (b), the utilization is suboptimal in the last column group Gsub since the number of columns | Gsub | = (k mod p) < p. The number of steps needed by C OLC YCLIC is described by two cases: When k is a multiple of p we have k/p column groups and k steps are taken for each, giving k2 /p = kdk/pe steps. Due to the in-column wavefront, additional p 1 steps are needed to get the total number of steps. When k is not a multiple of p we treat the last column group Gsub as if the number of columns | Gsub | = p, and get kdk/pe + p 1 steps. We now 17


4. PARALLEL L AYOUTS FOR LDDP need to adjust the steps to reflect the fact that | Gsub | < p. The number of superfluous steps taken by the initial assumption is p (k mod p). This gives k dk/pe + p 1 ( p (k mod p)) steps. From the above two cases, the total number steps for C OLC YCLIC is: ( ⇠⇥ 0 if k mod p = 0, k S= k+ p 1+ p (k mod p) p if k mod p 6= 0

(4.3)

Resource Utilization When k is a multiple of p, C OLC YCLIC achieves the optimal utilization TU <1 = p2 p. However, when that is not the case, Gsub has a negative impact on U. The tiles which are computed in steps where compute units are not fully utilized in Gsub are TU <1 = (k p)(k mod p). Thus, the total number of tiles computed where C OLC YCLIC under-utilizes the resources is:

TU <1 = p2

4.3

p + (k

p)(k mod p)

(4.4)

Diagonal-cyclic Wavefront

We present a new Diagonal-cyclic Wavefront2 (D IAC YCLIC) layout which, in the general case, improves utilization. The layout allows compute units to cyclically continue to Wd+1 even though Wd is not completely finished. The layout imposes the constraint that the series of steps follow a consecutive order for wavefronts where |Wd | p. When the remaining tiles on Wd become less than p, the superfluous compute units will continue on the next wavefront Wd+1 , where the same consecutive ordering is honored. It follows trivially that the local dependencies hold for this layout under the given constraints. The D IAC YCLIC layout is shown in Figure 4.3. D IAC YCLIC will take 2( p 1) steps, computing p2 p tiles for wavefronts where |Wd | < p, the north-west and the south-east corners. The remaining 1

2

3

5

6

8

1

2

3

4

6

7

2

3

4

3

4

6

6

8

10

2

3

4

5

7

8

8

10

11

3

4

5

7

8

4

5

9

7

9

11

12

4

5

6

8

9

10

5

7

9

11

12 13

5

6

8

9

10

11

7

9

10 12

13 14

6

7

9

10

11

12

(a) p = 3

(b) p = 4

Figure 4.3: Diagonal-cyclic Wavefront layout. Tiles computed while under-utilizing are shown with a gray color. (a) p = 3 using 14 steps and U = 83%. The arrows indicate the continuation of step 5 and 10 on the next diagonal. (b) p = 4 using 12 steps and U = 61%.

2 A.k.a

18

Snake.


4.4. Applying Layouts to the GPU Architecture part is where the cyclic approach is used, and has k2 ( p2 p) entries, computed by p processing units, giving (k2 p2 + p)/p steps. To cope with the case where (k2 p2 + p) is not a multiple of p we ceil the expression. This gives total number of steps for Diagonal-cyclic Wavefront: S = 2( p

1) +

k2

p2 + p p

=

⇡ k2 +p p

1

(4.5)

Resource Utilization For wavefronts where |Wd |

p, thej numberkof tiles computed while compute k 2 p2 + p

units are fully utilized is TU =1 = p. When k2 is not a multiple of p p, the last step in the cyclic approach will under-utilize the compute units, and the number of tiles in this last step is k2 mod p. The total number of tiles computed where Diagonal-cyclic Wavefront under-utilize resources is:

TU <1 = p2

4.4

p + k2 mod p

(4.6)

Applying Layouts to the GPU Architecture

When applying the layouts to the GPU, we map tiles to a grid of blocks and the entries in a tile are computed by a thread block. The two levels of parallelism differ architecturally, and we consider these differences in the context of our layout descriptions. Grid-level Depending on the resource usage of a block, multiple blocks might be concurrently executed on a symmetric multiprocessor (SM). We let B denote the number of concurrent block per multiprocessor. To reflect this fact, we define the number of compute units p that are used in our layout descriptions as: p = B · number of SMs

As an example, if we have 14 multiprocessors available capable of computing 2 concurrent blocks, under-utilization will occur when less than 28 tiles are computed in a step. Hence, this definition of p makes the expressions for under-utilization applicable for all values of B. On the GPU grid-level, there are currently no possibility of mapping a block to a specific processor, and between steps nothing can be kept in shared memory, thus all blocks will have the exact same starting point. This means, the calculation of a tile can be assumed to take constant time—making it possible to use steps to predict running time. This also means utilization U gives a way of comparing differences in running times for the layouts.

19


4. PARALLEL L AYOUTS FOR LDDP Thread-level Different memories are available on the thread-level and data can be reused from one step to the next. This means steps will take a varying amount of time, and predicting running times from the number of steps is not possible on this level. However, the utilization U can be used as an indication for performance. To extend the layout description for the thread-level, an entry is considered to be a tile of size 1 ⇥ 1.

4.5

Summary and Discussion

The layouts Column-cyclic Wavefront (C OLC YCLIC) and Diagonal-cyclic Wavefront (D IAC YCLIC) take steps using a cyclic distribution of tiles which improves utilization compared to Diagonal Wavefront (D IAWAVE). C OLC YCLIC and D IAC YCLIC divert when it comes to data locality as C OLC YCLIC attains a higher degree of data locality than D IAC YCLIC. D IAC YCLIC on the other hand, will in most cases take fewer parallel computation steps to complete than C OLC YCLIC. Table 4.1 summarizes the number of parallel computation steps S each layout takes. Steps are also closely related to the utilization U as less steps for a given k and p will give higher utilization. Layout

Steps S (forward pass)

Diagonal Wavefront

2

k 1

 dd/pe + dk/pe

d =1

Column-cyclic Wavefront Diagonal-cyclic Wavefront

( ⇠ ⇡ 0 k k+ p 1+ p (k mod p) ⇠ 2⇡ k +p 1 p

p

if k mod p = 0, if k mod p 6= 0

Table 4.1: Comparison of different layouts for decomposing k ⇥ k tiles in the forward pass of LDDP problems. Steps S corresponds to the term supersteps in Valiant’s BSP model [15].

Table 4.2 gives an overview of the number of tiles where the three layouts under-utilize available resources, including best-case and worst-case for constant p. In general, the under-utilized corners |Wd | < p will have less impact on utilization when k grows. The expressions show the relative TU <1 -utilization is periodic around k mod p. For the layout D IAWAVE, TU <1 grows as k grows. Within the period k mod p, D IAC YCLIC will in the worst case have TU <1 = p2 1. The utilization parameter TU <1 for C OLC YCLIC spans D IAWAVE and D IAC YCLIC in the period, where in the worst case TU <1 equals that of D IAWAVE and in the best case equals D IAC YCLIC. In general, the following is maintained for all k and p: UD IAWAVE  UC OLC YCLIC  UD IAC YCLIC 20


4.5. Summary and Discussion Layout

TU <1

Diagonal Wavefront

( p2

Column-cyclic Wavefront

p2

p + (k

Diagonal-cyclic Wavefront

p2

p + (k2 mod p)

p)bk/pc + (k mod p)2 p)(k mod p)

Best-case

Worst-case

Q(k)

Q(k)

Q (1)

Q(k)

Q (1)

Q (1)

Table 4.2: Expressions showing the number of tiles where the resources are under-utilized. Expressed by the constant number of processors p and varying k ⇥ k tiles for the three layouts.

Furthermore, when k > 2p, then UD IAWAVE < UD IAC YCLIC . The utilization U for the three layouts, when p = 14 and varying values of k are shown in Figure 4.4. The period k mod p is clear from the figure, and as expected we see the upper-limit of C OLC YCLIC is D IAC YCLIC and lower limit D IAWAVE. From a theoretical point of view, we have shown that our Diagonal-cyclic Wavefront-layout gives the overall best resource utilization, which on the grid level means best running time, as explained in section 4.4. On this level of parallelism, the Column-cyclic Wavefront only gives as good utilization as Diagonal-cyclic Wavefront when p divides k. The widely used layout Diagonal Wavefront gives overall worst utilization.

Utilization of k ⇥ k tiles using p = 14 compute units

100

Utilization U (%)

90

80

70

60

50 28

Diagonal-cyclic Wavefront Column-cyclic Wavefront Diagonal Wavefront 42

56

70

84

98

112 126 140 154 168 182 196 210 224 238 252 266 280

Number of tiles (k )

Figure 4.4: Utilization of the three layouts for 28  k  280 and p = 14. For other values of p, the plot will show the same tendency, however the period will change.

21



5

I MPLEMENTING LDDP ON GPU S

This chapter describes relevant aspects of our LDDP implementation on the GPU grid- and thread-level. To achieve a linear-space reduction, we use the tiling algorithm presented by Driga et al. [17]. Here the problem space is decomposed into tiles, and only intersecting boundaries between the tiles is saved, as depicted in Figure 2.1. In our algorithm, the boundaries are allocated in global memory, and are available for all threads during the runtime of the algorithm.

5.1

Grid-level

The three layouts Diagonal Wavefront, Column-cyclic Wavefront and our new Diagonal-cyclic Wavefront layout have all been implemented on the grid-level for distributing tiles to multiprocessors. On the grid-level, tiles are represented by CUDA blocks. A CUDA kernel call executes a set of blocks, on the multiprocessors, and in our context it is followed by a barrier synchronization. In relation to the theoretical description of a layout, then to execute a step, a kernel call is made with the blocks needed to compute the step. The implementations of Column-cyclic Wavefront and our Diagonal-cyclic Wavefront will make kernel calls with at maximum p blocks. For Diagonal Wavefront, the implementation executes kernel calls containing all blocks in a diagonal. Relating to the description of the layout in section 4.1 on page 16, the steps needed to complete a full diagonal is given by the length of a diagonal divided by p. This is an approximation, as small variation in the execution time of kernels will make it possible to better schedule blocks when a kernel is called with more blocks. This has the effect, that our Diagonal Wavefront implementation will have a slightly better running time than predicted by the layout steps.

5.2

Thread-level

On the thread-level we solve part of the full DPM, by computing a t ⇼ t tile represented by a CUDA block. A block is organized as a set of data-parallel threads, and the implementation of a block is called a kernel. The cost function used in our thread-level computes longest common subsequence (LCS), but can be extended to solve any LDDP problem. In the following we present our forward and backward pass kernels, and evaluate optimization strategies. 23


5. I MPLEMENTING LDDP ON GPU S

5.2.1

Forward Pass Kernels

Output boundary (result values)

s

ue

tv al

re su l

ry

Te m po ra

Input boundary (computed values)

The forward pass kernels define the Input boundary calculation of a t ⇥ t tile using a block of threads. The input is the substrings X 0 and Y 0 of size t, and input boundaries for previously computed DPM values. The organization of boundaries is shown in Figure 5.1. When threads compute entries in the local DPM c[i, j] where i = 0 _ j = 0, the neighboring values are read from the input boundaries in Output boundary global memory. When an entry has been computed on the output bor! der i = t 1 _ j = t 1, the cost value is saved in the output boundary Figure 5.1: Computation of a forward pass tile in global memory. having t ⇥ t entries. A block of threads receives

computed cost values, computes result values in a local DPM c (shown in white) and saves the result values in the output boundaries. The temporary result values is locally stored in three diagonals.

Selecting a Layout The previously described layouts are all candidates for distributing work in the forward pass kernels. We believe the most appropriate, due to data locality, is Column-cyclic Wavefront. Despite that, we have decided to explore the thread-level using the most simple layout, the Diagonal Wavefront. As the task of implementing efficient CUDA kernels is very time consuming, we chose to focus on a simple layout. Optimization for the kernel implementation has been explored using best practice recommendations from NVIDIA [38]. Furthermore we have investigated optimal performance in the kernel parameter space.

Space Reduction using Three Cost Diagonals In the algorithm by Driga et al. [17], the problem space is divided down to a level where a complete fullmatrix using O(n2 ) space can be stored in nearby memory. Due to the small amount of shared memory on a GPU, this approach would reduce the tile size to t  128. A small tile size in that order would mean that boundaries take too much global memory space, as the size of boundaries in global memory is directly dependent on the tile size. Another drawback of a small tile size is a small compute-to-memory ratio. Since the Diagonal Wavefront layout computes a wavefront Wd concurrently, the dependencies for Wd are the cost values on the previous two diagonals Wd 1 and Wd 2 , as shown in Figure 5.1. For space reduction, we only allocate memory for these three diagonals, like Liu et al. [3, 4]. This gives a linear space usage, resulting in a larger maximum tile sizes. The three costdiagonals are placed in shared memory because they are frequently accessed and entries need to be accessible for all threads.

24


5.2. Thread-level Boundaries When values are accessed on the boundaries, only a single entry is read from or written to global memory at a time. This memory access pattern is suboptimal. Following best practices, we ought to first transfer global input boundaries to shared memory, compute entries, store temporary output boundaries in shared memory, then finally flush the temporary boundaries to global memory. However, by experiments we have found that reading the input boundaries into shared memory does not have a significant speedup on the compute time compared to the additional 2t shared memory it would use—same goes for storing all output boundary temporary in shared memory before flushing them to global memory. Substrings The substrings X 0 and Y 0 can be read directly from global memory, or they can be transferred to shared memory for better memory access time. For an alphabet limited to 256 characters, shared memory substrings can be stored using 2t bytes. When the substrings are in global memory more shared memory is available for the cost diagonals, resulting in a larger maximum tile size at the cost of higher access times when reading characters from the substrings. General and Specialized Kernels Two sets of kernels have been developed, general kernels and specialized kernels. The difference between them is the data type used for representing the DPM cost values in a tile. The general kernels represent inner cost values as 4-byte integers, whereas the specialized kernels use a smaller datatype. The general kernels, solves all LDDP problems, as long as the cost values can be represented by 4-byte integers. In the specialized kernels we apply the observation, that for a full tile computation, we are only interested in how much the cost values changes from input boundary to output boundary. If all possible cost-changes between neighboring entries can be represented in a smaller datatype, we can reduce the amount of shared memory needed. The boundaries in global memory are represented as 4-byte integers. To represent cost values as a smaller data type, scaling is applied between boundaries and inner cost values. Scaling is applied by subtraction and addition of a tile scaling constant q. When reading a value b from the input boundaries, the inner cost value will be computed using the scaled value b q. Equivalently scaling by addition is used when writing to the output boundaries to global memory. For an LDDP problem to be solvable using the specialized kernels, we need to investigate how much the cost values maximum varies between entries, since this value must be representable by the smaller data type. The scaling constant q is determined from an appropriate value on the input boundaries. For LCS, the cost for a match is 1, hence the maximum difference in cost values occurs when X 0 = Y 0 . In this case the maximum difference equals the tile size t. The scaling value q is assigned to the input boundary value north-west of c[0, 0]. Using a 2-byte integer datatype to represent the inner cost values we are able to halve the amount of shared memory used for the cost values compared to the general kernels. This comes at the expense of extra subtraction and addition operations when reading from and writing to the global memory boundaries. 25


5. I MPLEMENTING LDDP ON GPU S Overview of Our Forward Pass Kernels The following forward pass kernels have been implemented: GK ERNEL SHARED

general kernel, input strings in shared memory

GK ERNEL GLOBAL

general kernel, input strings in global memory

SK ERNEL SHARED

specialized kernel, input strings in shared memory

SK ERNEL GLOBAL

specialized kernel, input strings in global memory

5.2.2

Backward Pass Kernel

Input boundary

Although our focus is on the forward Input boundary pass we have implemented a simple LCS Sub tile 1 Sub tile 1 backward pass kernel B P K ERNEL. Like Driga et al. [17] our algorithm uses the boundaries saved in the forward pass to compute one optimal path by processing tiles that intersect the path. Sub tile 1 Sub tile 4 full-matrix Due to shared memory limitations when using full-matrix, our B P K ERNEL is able to compute the deduced backward pass for a tile of up to 128 ⇥ 128. ! For k ⇥ k tiles, we start by computing the optimal path for the last tile (k, k). If Figure 5.2: Computation of a backward the tile size t used in the forward pass pass using sub-tiling. is greater than 128, we need to decompose the problem space down to a level where full-matrix can be computed in shared memory. In this case we decompose the tile into sub-tiles of size 128 ⇥ 128, and run a forward pass kernel to compute sub boundaries. When the problem space can be represented in shared memory, we run our B P K ERNEL to obtain an optimal path through all sub-tiles. Once the kernel reaches the border of the current tile, the computation is moved to the next tile where the optimal path intersects. The algorithm continues in this way until the computation reaches the border of the DPM. The sub-tiling is shown in Figure 5.2. In our implementation, the solution path is saved directly in host memory. For space reduction, the cost-matrix c used in B P K ERNEL could be left out in favor of a matrix b containing directions taken while computing the forward pass. A given point in b[i, j] points to the neighboring entry corresponding to the optimal solution chosen when computing c[i, j]. This approach is described by Cormen et al. [18]. As we still need cost-values in the computation of the forward pass, we could use the three cost diagonal approach for temporary cost values. Although this idea would increase the backward pass tile size, it introduces bank-conflicts and makes the backward pass kernel rather complex.

26


5.2. Thread-level

5.2.3

Best Practice Recommendations

The NVIDIA Best Practices Guide [38] has been used as a guideline during implementation of the kernels. An excerpt of what we have found especially useful is listed below. Minimize Branching In the forward pass kernels, two CUDA max()-functions can replace the branching needed when testing for character match and finding the maximum between neighboring values. This reduces branch diversion, which might otherwise occur for each entry. The reduced branching increased performance with up to 8%. Shared Memory When accessing shared memory, we have organized the access pattern so no bank conflicts occur. Loop Unrolling Using #pragma unroll on loops will help the compiler predict branches in the code. Experiments have shown up to 8% speedup for our kernels. As the right level of loop unrolling is very time consuming to find and dependent on other kernel parameters, the directive is not used in the results presented.

5.2.4

Undocumented CUDA Features

Due to findings on the NVIDIA Developer Forum,1 we have investigated the use of the rather undocumented features of the volatile keyword. The CUDA compiler often inline operations needed to compute a value of a variable used in the code, instead of keeping the value saved in a register. When declaring a variable volatile, it forces the compiler to save the variable in a register, i.e., the value will not be recomputed. Besides changing the flow of computations in the kernels, the register usage is changed. Manually exploring all relevant placement of the volatile keyword, gave up to 10% speedup for some of our kernels.

1 http://blog.icare3d.org/2010/04/cuda-volatile-trick.html

27


5. I MPLEMENTING LDDP ON GPU S

5.3

Space Constraints

We look into GPU memory constraints on the thread- and grid-level. The datatypes used will impact the amount of memory used. DVar describes the size in bytes for representing one variable Var. The datatypes are described in Table 5.1 below. Variable

Description

Size in bytes

DInputString

Size of input string type

DDPM-value

Size of DPM representation in a block

DBoundary

Size of the global memory boundaries type

1 2 or 4 4

Table 5.1: Sizes of variables.

5.3.1

Thread-level

On the thread-level we are limited by the amount of shared memory Smem per multiprocessor. Each block requires an amount of memory depending on the tile size t. The upper limits on t are: X 0 and Y 0 in shared memory 2t · DInputString + 3t · DDPM-value  Smem {z } | {z } |

Input strings X 0 + Y 0

DPM Cost Values

X 0 and Y 0 in global memory

3t · DDPM-value  Smem Notice, that DDPM-value can have two values, depending on whether the kernel applies scaling. Table 5.2 shows the the maximum value of tile size t for different kernels on NVIDIA GPUs with Smem = 49152 bytes. Forward Pass Kernel

Maximum tile size t

GK ERNEL SHARED

3510

GK ERNEL GLOBAL

4096

SK ERNEL SHARED

6144

SK ERNEL GLOBAL

8192

Table 5.2: Comparison of tile size limitations for different forward pass kernels. Values are for NVIDIA GPUs with compute capability 2.0, see data sheet in section A.1.

28


5.3. Space Constraints

5.3.2

Grid-level Memory Usage

On the grid-level we are limited by the amount of global memory Gmem on the GPU. The constraints given by n and k, where k = n/t n⇣ n 2n · DInputString + 2n(k + 1) DBoundary + | {z } | {z } |k 64k

Input strings X + Y

Forward pass boundaries

⌘ 2 · DBoundary  Gmem {z }

Backward pass boundaries

The solution path is saved directly in host memory. The relation shows that the maximum solvable problem size n becomes larger as k decreases. So to maximize n, the largest available tile size t is selected. When selecting the forward pass kernel SK ERNEL GLOBAL , the largest tile size is t = 8192. Using this kernel on an NVIDIA Tesla C2070 with 5375 Mbyte of global memory, our implementation is able to compare two strings of up to 2.39 · 106 symbols each, including finding the optimal path. For the GeForce GTX 590 with 1536 Mbyte of global memory, we can compare two strings of length up to 1.2 · 106 .

29



6

E XPERIMENTAL R ESULTS FOR G RID - LEVEL

In this chapter we present our results for mapping the three LDDP layouts Diagonal Wavefront (D IAWAVE), Column-cyclic Wavefront (C OLC YCLIC) and Diagonal-cyclic Wavefront (D IAC YCLIC) onto the GPU grid-level. The widely used Diagonal Wavefront layout is selected as baseline to compare relative speedup for the Column-cyclic Wavefront and Diagonal-cyclic Wavefront. The results will be related to the theoretical descriptions of the layouts, presented in chapter 4.

6.1

Setup

The experiments are performed on an NVIDIA Tesla C2070 GPU having 14 multiprocessors and ECC memory error protection.1 To explore the portability of our solution we conduct the same tests for the NVIDIA GeForce GTX 590 having 16 multiprocessors. Unlike the Tesla, which is targeted for scientific computing, the GeForce series is intended for the PC gaming market, thus ECC is not available and global memory is limited. A data sheet for each GPU is listed in Appendix A. Unless otherwise stated the Tesla GPU is used for the presented results. All algorithms are implemented in CUDA/C++ and compiled using nvcc 4.1 and gcc 4.4.3. The input data is selected from the DNA collection of the Pizza & Chili Corpus [39], with the alphabet S = { A, C, G, T }. The data is selected to be representative of real-life samples used in bioinformatics. To perform accurate timings we use CUDA event timers, which are based on on-board counters on the GPU with a sub-microsecond resolution [40]. Additional information has been obtained using the CUDA Profiler. Although the independent timings are almost identical, all running times presented are the median of 5 independent runs.

1 Error-Correcting

Code memory is used, where data corruption cannot be tolerated.

31


6. E XPERIMENTAL R ESULTS FOR G RID - LEVEL

6.2

Results

(a)

Speedup compared to Diagonal Wavefront (%)

Figure 6.1 shows the experimental speedup for C OLC YCLIC and D IAC YCLIC relative to D IAWAVE for k = [28; 280] in the two cases where the maximum number of blocks per multiprocessor B is (a) 1 and (b) 2. As predicted by the utilization in chapter 4, D IAC YCLIC always outperforms D IAWAVE. Also, as anticipated, C OLC YCLIC varies between D IAWAVE and D IAC YCLIC. In the cases where the number of steps S is equal for D IAC YCLIC and C OLC YCLIC, we observe a slightly better speedup for C OLC YCLIC, the difference being less than 0.2 percentage points. This can be explained by the fact that C OLC YCLIC has a higher degree of data locality. Since the L2 cache is not purged between kernel calls, a higher hit rate on the L2 cache occurs for C OLC YCLIC. This has been confirmed by profiling the two layouts. The results also show that C OLC YCLIC performs up to 1% worse than D IAWAVE, when the number of steps are the same. This is due to differences in placements of the step synchronization barriers, refer to section 5.1.

Speedup compared to Diagonal Wavefront (%)

Diagonal-cyclic Wavefront Column-cyclic Wavefront

15

10

5

0 28

(b)

Experimental layout-performance compared to Diagonal Wavefront

20

42

20

56

70

84

98

112

126

140

154

168

182

196

210

224

238

252

Experimental layout-performance to Diagonal Wavefront Number ofcompared tiles (k)

266

280

Diagonal-cyclic Wavefront Column-cyclic Wavefront

15

10

5

0 28

42

56

70

84

98

112

126

140

154

168

Number of tiles (k)

182

196

210

224

238

252

266

280

Figure 6.1: Relative layout-performance of Column-cyclic Wavefront and Diagonal-cyclic Wavefront compared to Diagonal Wavefront with fixed tile size of 1024 and a varying number of tiles k = [28; 280]. The result is independent of the actual forward pass kernel used. (a) Shows speedup with one block per multiprocessor. (b) Two concurrent blocks per multiprocessor.

32


6.2. Results

6.2.1

Comparing Theoretical and Experimental Speedup

We will now investigate how well the theoretical descriptions predict the actual speedup for the grid-level layouts. Depending on how many resources a block requires, the GPU may be able to schedule multiple concurrent blocks on a symmetric multiprocessor. We denote the number of concurrent blocks per multiprocessor B and the number of multiprocessors P. To predict speedup theoretically, we need to take into account scheduling of multiple blocks on the multiprocessors. As examples, when B = 1, the time to compute a number of blocks grows as a linear step function with steps the length of P. When B = 2, then time cannot be predicted by linear steps as two concurrent blocks on a multiprocessor will take less than double time to finish. We define a function virtual time which simulates the concurrent execution of blocks. It maps the characteristics of each step of a layout into a relative time unit, see Table 6.1.

Number of tiles in a step

]0; P]

] P; 2P]

]2P; 3P]

]3P; 4P]

B=1

t1

B=2

t1

]4P; 5P]

]5P; 6P]

...

2t1

3t1

t2

t2 + t1

4t1

5t1

6t1

...

2t2

2t2 + t1

3t2

...

Table 6.1: Virtual time function simulating the GPU when B = 1 and B = 2. It shows approximate timings for a step containing a number of tiles. Notice that the number of tiles is given in intervals of number of multiprocessors P. For the timings t1 and t2 , the following apply: t1 < t2 < 2t1 . From empirical profiling of the architecture, we have found that: t2 ⇥ 1.7t1 .

As an example, when B = 2 and the number of tiles in step lies in the interval ] P; 2P], then due to better utilization on the multiprocessor, the computation time is less than 2t1 > t2 . Our empirical studies have shown that t2 ⇥ 1.7t1 . The tendency seen here for B = 2 will be similar in cases where B > 2. Speedup Predictability By applying the virtual time function to the concept of steps, we are able to theoretically determine speedup for C OLC YCLIC and D IAC YCLIC relative to D IAWAVE. By comparing the theoretical speedup with the speedup measured in our experiments, we will now investigate how well the theoretical descriptions predict the actual speedup on the GPU architecture. Figure 6.2 shows the difference in theoretical speedup and experimental speedup with one block per multiprocessor. Plot (a) compares predictability for a kernel with substrings in shared memory, and (b) shows predictability when the substrings are in global memory. When the substrings are in shared memory, the access time is close to constant, and it also gives less pressure on the caches where boundaries will reside. When the substrings are in global

33


6. E XPERIMENTAL R ESULTS FOR G RID - LEVEL memory caches will be used for both substrings and boundaries, giving more pressure on the cache, which results in larger variations in memory access time. This explains the reason plot (b) contains more noise than (a). In general we see that our theoretical descriptions predict a less than 0.30 percentage point better speedup, than what is observed in experiments. For the kernels where substrings are in shared memory, we see a less than 0.10 percentage point better speedup in theory, than observed in experiments. Difference between theoretical and experimental speedup Diagonal-cyclic Wavefront (µ = 0.08) Column-cyclic Wavefront (µ = 0.06)

(a)

Percentage point

0.25 0.20 0.15 0.10 0.05 0.00

28

42

56

70

84

98

112

126

140

154

168

Number of tiles (k)

182

196

210

224

238

252

266

280

252

266

280

Difference between theoretical and experimental speedup Diagonal-cyclic Wavefront (µ = 0.23) Column-cyclic Wavefront (µ = 0.19)

(b)

Percentage point

0.25 0.20 0.15 0.10 0.05 0.00

28

42

56

70

84

98

112

126

140

154

168

Number of tiles (k)

182

196

210

224

238

Figure 6.2: Comparing theoretical layout-performance with experiments for using a fixed tile size of 1024 and a varying number of tiles k = [28; 280] and one block per multiprocessor. Average deviation µ and standard deviation s from theory is listed in the labels. The plot shows a slight worse speedup in practice than predicted in our theory, due to the implementation of Diagonal Wavefront. (a) LCS kernel GK ERNEL SHARED having input strings in shared memory. (b) GK ERNEL GLOBAL with input strings in global memory.

Figure 6.3 shows how well we can predict speedup theoretically when up to two blocks can execute concurrently on a multiprocessor. The data is from the same experiments, as shown in Figure 6.1 (b). A lower accuracy is observed compared to when maximum one block can execute concurrently on a multiprocessor. This is expected, since virtual time becomes more approximative as B gets larger due to the t2 term. In this case, the theoretical speedup predicts a less than 1.2 percentage point better speedup, than what is observed in experiments. For most results, we see that the theory predicts a better speedup than what we see in our experiments. This degradation in our experimental speedup is due to the implementation of the D IAWAVE. As explained in section 5.1, we reduce the amount of grid-level synchronization by only executing a single 34


6.2. Results Difference between theoretical and experimental speedup Diagonal-cyclic Wavefront (µ = 0.66) Column-cyclic Wavefront (µ = 0.58)

Percentage point

1.0

0.5

0.0

0.5 28

42

56

70

84

98

112

126

140

154

168

Number of tiles (k)

182

196

210

224

238

252

266

280

Figure 6.3: Comparing theoretical layout-performance for experiments shown in Figure 6.1 (b), where two concurrent blocks can execute per multiprocessor. The plot shows a slight worse speedup in practice than predicted in our theory. The LCS kernel used is GK ERNEL SHARED with 512 threads per block.

kernel call per wavefront front. This has the effect, that our implementation will have a slightly better running time than predicted by step to virtual timefunction.

6.2.2

Part Conclusion

On the grid-level, the theoretical description of the three layouts captures what we see in our experimental results with a very small margin of error. This has been shown for one and multiple concurrent blocks per multiprocessor. In the cases where our theoretical predictions are inaccurate, we are able to pinpoint what is causing it—and these inaccuracies are small enough to consider negligible. Thus, our theoretical approach for describing the layouts as steps, and mapping steps to virtual time, gives extremely good predictions on how the architecture behaves. It also shows that the utilization metric gives a good theoretical way of comparing layouts on the grid-level. Overall this shows our new Diagonal-cyclic Wavefront gives the best overall performance, both in experiments and theoretically.

35



7

E XPERIMENTAL R ESULTS FOR T HREAD - LEVEL

This chapter documents the structured approach and experiments carried out for optimizing the kernel performance. A number of best practice optimizations have been evaluated during implementation, and a selection of these are presented in subsection 5.2.3.

7.1

Setup

On the thread-level we have four independent kernels with different characteristics for computing the forward pass of longest common subsequence. All kernels use the Diagonal Wavefront layout for mapping the problem onto CUDA threads. A simple backward pass kernel have also been designed. All kernels are implemented in NVIDIA CUDA. We use the same setup as the grid-level experiments, described in section 6.1. Performance of the kernels are tested on DNA sequences [39] and randomly generated symbols from an alphabet of 256 symbols. To compare the results we use cell updates per second (CUPS), a commonly used performance measure in bioinformatics literature [4, 6, 9, 10]. CUPS represents the time for a complete computation of one entry in the DPM, including memory operations and communication. Given a DPM with n ⇥ n entries, the GCUPS (billion cell updates per second) measure is n2 /(t · 109 ), where t is the total computation time in seconds.

7.2

Results for Forward Pass Kernels

We present the results for our four forward pass kernels for solving longest common subsequence. For all of them, each thread block computes a tile having t ⇥ t entries by comparing the substrings X 0 and Y 0 of length t, using two tile boundaries as input, and outputs the boundaries to the south and east. The four forward pass kernels are: GK ERNEL SHARED

general kernel, input strings in shared memory

GK ERNEL GLOBAL

general kernel, input strings in global memory

SK ERNEL SHARED

specialized kernel, input strings in shared memory

SK ERNEL GLOBAL

specialized kernel, input strings in global memory

37


7. E XPERIMENTAL R ESULTS FOR T HREAD - LEVEL

7.2.1

Automatic Performance Tuning

Selecting the optimal kernel parameters is a key task in optimizing the performance of GPU applications. As performance of kernels is almost impossible to analyze theoretically, performance optimizations are driven by empirical studies. The strategy is to exhaustive search the parameter space by executing several runs using different tuning parameters. This is a technique known as automatic performance tuning, or auto-tuning. The effectiveness of this technique depends on the chosen tuning parameters to optimize. Auto-tuning of GPU kernel parameters can include relevant parameters like block size, threads per block, loop unrolling level and internal algorithm trade-offs. For more information on the topic in a GPU context, see [16, 41, 42]. Our automatic performance tuning is conducted by selecting a set of kernel parameters, and for each of these, automatically measure the running time on our four kernels. To make the results independent of the grid-level layouts, we occupy all multiprocessors with the maximum number of concurrent blocks for the given parameters. So the performance measured is for the best case, where maximum utilization on the grid-level is achieved. All test presented are for the targeted NVIDIA Tesla C2070 GPU. Tile Size and Threads Ratio We start by examining the optimal relationship between the kernel parameters tile size t and the number of threads per block. The tile size is selected to be a multiple of the architectures 128 byte memory transaction size and up to the maximum size supported by the kernels. The number of threads is selected to be a multiple of the warp size (32 threads), and up to the maximum of 1024 threads. Our tests show a tendency for the optimal ratio between tile size and number of threads per block to be 1:4 for the general kernels and 1:8 for the specialized kernels. Figure 7.1 visualizes the result of an auto-tuning test. Exploring parameter space for GK ERNEL SHARED 3.5

1024

5.3 5.2

640 512

5.2

4.4

768

Threads per block

4.6

3.6

4.7

5.2

4.9

5.2

5.4

384

5.3 5.1

256

4.6

5.4

5.4

3.9

4.0

4.1

128

5.1

5.0

4.0

2.2

2.2

2.3

512

1024

1536

2048

2560

3072

Tile size t

Figure 7.1: Example of auto-tuning two kernel parameters for GK ERNEL SHARED . The performance in GCUPS is shown for each supported configuration of tile size t and number of threads per block. The black cells denote unsupported configurations, where number of threads does not divide t. The best performance for each tile size t is shown with a underline.

38


7.2. Results for Forward Pass Kernels Optimal Tile Size To find the optimal value for tile size t, we tested all kernels by varying t using the best ratio between t and number of threads per block. Notice that with different tile sizes, the number of concurrent blocks per multiprocessor will vary. The results for the general kernels are shown in Figure 7.2, and Figure 7.3 on the next page shows results for the specialized kernels. The results clearly show that selecting the tile size to be a multiple of the architecture’s 128 byte memory transaction size, generally yield the best performance. Also, having multiple blocks per multiprocessor gives a better performance for all kernels. Max performance of GK ERNEL SHARED (threads per block = 1/4 tile size) 5.6

(a)

Giga Cell Updates per Second (GCUPS)

6

5

4

3

2

1

5.4 5.2 5.0 4.8 4.6 4.4 4.2

512

640

768

896

1024

1152

1280

Tile size

1408

1536

1664

1792

1920

2048

Max performance of GK ERNEL GLOBAL (threads per block = 1/4 tile size) 5.6

(b)

Giga Cell Updates per Second (GCUPS)

87 6 5

4

3

2

1

5.4 5.2 5.0 4.8 4.6 4.4 4.2

512

768

1024

1280

1536

1792

2048

2304

Tile size

2560

2816

3072

3328

3584

3840

4096

Figure 7.2: Experimental tile-performance for the general kernels using a varying tile size with maximum number of blocks for each configuration. The vertical lines and numbers inside the plots, denote the number of blocks per multiprocessor. The circles show where t is a multiple of the 128 byte memory transaction size. Test were conducted on the NVIDIA Tesla C2070 GPU.

39


7. E XPERIMENTAL R ESULTS FOR T HREAD - LEVEL

Max performance of SK ERNEL SHARED (threads per block = 1/8 tile size) 6.5

(a)

Giga Cell Updates per Second (GCUPS)

8

7

6

5

4

3

2

1

6.0

5.5

5.0

4.5

4.0

512

768

1024

1280

1536

1792

2048

2304

Tile size

2560

2816

3072

3328

3584

3840

4096

Max performance of SK ERNEL GLOBAL (threads per block = 1/8 tile size) 6.0

(b)

Giga Cell Updates per Second (GCUPS)

8

7 6 5

4

3

2

1

5.5

5.0

4.5

4.0

3.5

512

1024

1536

2048

2560

3072

3584

4096

4608

Tile size

5120

5632

6144

6656

7168

7680

8192

Figure 7.3: Experimental tile-performance for the specialized kernels using a varying tile size with maximum number of blocks for each configuration. The vertical lines and numbers inside the plots, denote the number of blocks per multiprocessor. The circles show where t is a multiple of the 128 byte memory transaction size. Test were conducted on the NVIDIA Tesla C2070 GPU.

When comparing the overall performance of the kernels seen in Figure 7.2 and 7.3, the specialized kernels surprisingly outperforms the general kernel for most configurations. The specialized kernels do more computations when scaling cost values, however the smaller datatype reduces shared memory access. This does not fully explain why the performance is better. Another explanation could be different optimizations applied by the CUDA compiler.

40


7.3. Results for Backward Pass Kernel Summary and Comparison To give an overview of kernel performance on the NVIDIA Tesla GPU using the auto-tuned parameters and ratios, Table 7.1 presents combined results for tile size t = 1024, 2048, 4096, 8192 using 256, 512, 1024 threads per block. The results indicate the same ratio between t and number of threads per block, as found in our auto-tuning test. Execution time in Giga Cell Updates per Second (GCUPS) Tile size

Threads

SK ERNEL SHARED

SK ERNEL GLOBAL

GK ERNEL SHARED

GK ERNEL GLOBAL

1024 1024 1024

256 512 1024

5.63 (4) 4.87 (2) 3.62 (1)

5.39 (5) 4.70 (2) 3.58 (1)

5.36 (3) 4.71 (2) 3.53 (1)

5.45 (4) 4.81 (2) 3.62 (1)

2048 2048 2048

256 512 1024

6.18 (3) 5.79 (2) 4.81 (1)

5.71 (4) 5.36 (2) 4.59 (1)

3.86 (1) 4.91 (1) 4.64 (1)

5.09 (2) 5.51 (2) 4.70 (1)

4096 4096 4096

256 512 1024

4.18 (1) 5.73 (1) 5.74 (1)

5.12 (2) 5.77 (2) 5.32 (1)

-

3.51 (1) 5.32 (1) 5.49 (1)

8192 8192 8192

256 512 1024

-

3.26 (1) 5.24 (1) 5.67 (1)

-

-

Table 7.1: Kernel performance in GCUPS with maximum number of blocks for different configurations of tile size t and threads per block. The number in the parenthesis denotes the number of concurrent blocks per SM, B. Tests were conducted on the NVIDIA Tesla C2070 GPU.

We conducted the same tests for the NVIDIA GeForce GTX 590, showing a performance increase around 9% for all kernels compared to the Tesla GPU. The speedup can be explained by the increased clock frequency and faster memory access, due to the lack of ECC, on the GeForce.

7.2.2

Effect of Alphabet Size and String Similarity

By evaluating the performance of varying the alphabet size up to 256 symbols for random sequences, we see no impact on the running time. The similarity of the given strings X and Y also has no influence on running time. This is due to the branch reduction achieved by CUDA max()-functions in the implementation, giving no branching within the kernel when matching character pairs from the strings.

7.3

Results for Backward Pass Kernel

We have implemented a simple backward pass kernel B P K ERNEL. Although the implementation does not utilize the available resources efficiently, the time spend on the backward pass out of the total running time is less than 2% for n = 221 .

41


7. E XPERIMENTAL R ESULTS FOR T HREAD - LEVEL

7.4

Part Conclusion

Currently there are no possibilities of modeling the behavior at the GPU threadlevel, as small changes can have cascading effects, which are hard to predict. Some metrics, like utilization and data locality can help determine what is viable to implement, but experimental results are the only way to get solid answers. The kernels presented, are all implemented with NVIDIA’s best practices in mind [38]. As examples, we have no bank conflicts for shared memory access, we have minimized branch diversion when possible, and we have investigated the effects of loop unrolling. We have found a rather undocumented feature, the volatile keyword, which can have an impact on performance. By doing this we have made sure that we achieved a high performance for the implemented kernels. Furthermore, we have experimentally auto-tuned the kernel parameters, by performing an exhaustive search of the parameter space. This provides us with general pointers on optimal relationship between kernel parameters. The auto-tuning results are also valuable in determining optimal tile size for a given input size.

42


8 8.1

P ERFORMANCE E VALUATION

The Potential of Solving LDDP Problems on GPUs

To evaluate the potential of solving LDDP problems on GPU hardware, we compare our results for finding longest common subsequence (LCS) to sequential CPU solutions. For comparison we have used the CPU solutions pro¨ vided by Stockel and Bille [1], including LCS implementations for Hirschberg [22], FLSA by Driga et al. [17], Chowdhury and Ramachandran cache-oblivious algorithm [24] and their own FCO [1]. As the fastest known general sequential ¨ CPU solution, we use Stockel’s optimized implementation of FLSA by Driga et al. [17]. Table 8.1 shows a performance comparison between our GPU solution on an NVIDIA Tesla C2070 and FLSA [17] running on a Intel i7 2.66GHz having 4GB memory. Both of the two architectures were released in 2010. The data shows that our GPU solution has an average of over 40X performance advantage over FLSA for comparing two string of size larger than 219 . Our experiments show that our inefficient backward pass has a negative impact on the GPU speedup for smaller strings. Input size n

FLSA CPU running time

GPU running time

GPU speed-up

218

0.132h

0.0035h

38X

219

0.529h

0.0125h

42X

220

2.127h

0.0473h

45X

221

8.741h

0.2077h

42X

Table 8.1: Performance comparison of state-of-the-art single-threaded CPU solution FLSA [17] (on a Intel i7 2.66GHz having 4GB memory) and our GPU solution (NVIDIA Tesla C2070) for solving the longest common subsequence. The timings include complete computation time, memory transfer and traceback of solution path. All GPU tests use our new Diagonal-cyclic Wavefront layout. For n = 221 , SK ERNEL GLOBAL was used as kernel with tile size t = 8192 and 1024 threads per block. All other tests used SK ERNEL SHARED with t = 2048 and 256 threads.

Driga et al. [17] presented a parallel FLSA with an almost linear speedup for up to eight processors, when comparing two strings with length just above 218 . For 32 processors the speedup is halved. When comparing this with the run times presented in Table 8.1, the GPU is still an order of magnitudes faster than the parallel FLSA CPU solution running on a 32-core CPU. We believe this shows the potential of solving LDDP problems on GPU hardware.

43


8. P ERFORMANCE E VALUATION

8.2

Comparing to Similar GPU Solutions

As stated in section 2.2.2, there are currently many GPU solutions for LDDP problems, especially for Smith-Waterman. All of them are able to compare a large number of independent short sequences, with a length up to 216 . As a result, they are not targeted the same problem size as we consider, making direct comparison impossible. The GPU+CPU solution by Kloetzli et al. [11] is able to solve longest common subsequence (LCS) on string length up to 220 . They showed a five-fold speedup over the single processor algorithm presented by Chowdhury and Ramachandran [24]. Since we achieve a much higher speedup, we conclude that our solution is superior. We found Deorowicz’ [12] solution to be the most resembling GPU implementation, although the implementation only computes the length of the LCS. The implementation is based on the widely used Diagonal Wavefront layout. Their experiments showed a significant speedup obtained over their own serial CPU implementation of the same anti-diagonal algorithm for n = 216 . Unfortunately no comparison is made for any known CPU solutions. Although we have tried to obtain the source code for their implementation, it has not been possible. To the best of our knowledge, all existing GPU solutions for solving largescale LDDPs problem use Diagonal Wavefront-layout on the grid-level, where it would be an advantage to use our Diagonal-cyclic Wavefront layout instead.

44


9

C ONCLUSION

Based on an analysis of state-of-the-art algorithms for solving local dependency dynamic programming (LDDP) problems and a thorough investigation of GPU architectures, we have combined and further developed existing LDDP solutions. The result is a novel approach for solving any large pairwise LDDP problem, supporting the largest input size for GPUs in literature. Our results include a new superior layout Diagonal-cyclic Wavefront for utilizing the coarse-grained parallelism of the many-core GPU. For various input sizes the Diagonal-cyclic Wavefront always outperforms the widely used, the Diagonal Wavefront, and in most cases the Column-cyclic Wavefront layout. In general, any GPU algorithm for large LDDP problems can adopt our new layout with advantage. Theoretically we have been able to analyze the efficiency of the layouts, and accurately predict the relative speedup. Our results can be generalized to several levels of parallel computation using multiple GPUs. As case study, we have implemented GPU kernels for finding the longest common subsequence. We present ways of optimizing kernel performance by minimizing branch divergence, scaling inner cost values, evaluating NVIDIA best practices, performing automatic tuning of kernel parameters and exploiting the compiler keyword volatile with a previously undocumented speedup effect. Compared with the fastest known sequential CPU algorithm by Driga et al. [17], our GPU solution obtain a 40X speedup. Comparing two sequences of 2 million symbols each, the CPU running time was close to 9 hours. Our implementation solves the same problem in 12 minutes. This shows, that exact comparison of large biological sequence is now feasible.

9.1

Future Work

In this report we have focused on efficient layouts for the GPU grid-level. To further optimize the performance, experiments must be made applying different layouts on the fine-grained thread-level. We believe the most appropriate is Column-cyclic Wavefront, due to the degree of data locality. The potential of using multiple GPUs should also be examined. We have focused on implementing longest common subsequence within our parallel LDDP solution. It could be interesting to extend this implementation with other LDDP algorithms with more advanced cost functions.

45



B IBLIOGRAPHY [1]

¨ P. Bille and M. Stockel. Fast and cache-oblivious dynamic programming with local dependencies. Language and Automata Theory and Applications, pages 131–142, 2012.

[2]

W.R. Pearson and D.J. Lipman. Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences, 85(8):2444, 1988.

[3]

W. Liu, B. Schmidt, G. Voss, A. Schroder, and W. Muller-Wittig. Biosequence database scanning on a gpu. In Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International, pages 8–pp. IEEE, 2006.

[4]

¨ Schmidt B.b Voss G.a Muller-Wittig W.a Liu, W.a. Streaming algorithms for biological sequence alignment on gpus. IEEE Transactions on Parallel and Distributed Systems, 18(9):1270–1281, 2007.

[5]

Valle G.a Manavski, S.A.a b. Cuda compatible gpu cards as efficient hardware accelerators for smith-waterman sequence alignment. BMC Bioinformatics, 9(SUPPL. 2), 2008.

[6]

Rudnicki W. Ligowski, L. An efficient implementation of smith waterman algorithm on gpu using cuda, for massively parallel scanning of sequence databases. 2009.

[7]

G.M. Striemer and A. Akoglu. Sequence alignment with gpu: Performance and design challenges. In Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pages 1–10. IEEE, 2009.

[8]

Maskell D.L. Schmidt B. Liu, Y. Cudasw++: Optimizing smith-waterman sequence database searches for cuda-enabled graphics processing units. BMC Research Notes, 2, 2009.

[9]

Yongchao Liu, Bertil Schmidt, and Douglas L Maskell. Cudasw++2.0: enhanced smith-waterman protein database search on cuda-enabled gpus based on simt and virtualized simd abstractions. BMC Research Notes, 3(1):93, 2010.

[10] Jacek Blazewicz, Wojciech Frohmberg, Michal Kierzynka, Erwin Pesch, and Pawel Wojciechowski. Protein alignment algorithms with an efficient backtracking routine on multiple gpus. BMC BIOINFORMATICS, 12:–, 2011. [11] J. Kloetzli, B. Strege, J. Decker, and M. Olano. Parallel longest common subsequence using graphics hardware. In Proceedings of the Eurographics 47


B IBLIOGRAPHY Symposium on Parallel Graphics and Visualization. Eurographics Association, 2008. [12] Sebastian Deorowicz. Solving longest common subsequence and related problems on graphical processing units. Software: Practice and Experience, 40(8):673–700, 2010. [13] A.R. Galper, D.L. Brutlag, and Stanford University. Medical Computer Science. Knowledge Systems Laboratory. Parallel similarity search and alignment with the dynamic programming method. Knowledge Systems Laboratory, Medical Computer Science, Stanford University, 1990. [14] P. Krusche and A. Tiskin. Efficient longest common subsequence computation using bulk-synchronous parallelism. Computational Science and Its Applications-ICCSA 2006, pages 165–174, 2006. [15] L.G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103–111, 1990. [16] H.H.B. Sørensen. Auto-tuning of level 1 and level 2 blas for gpus. 2012. [17] A. Driga, P. Lu, J. Schaeffer, D. Szafron, K. Charter, and I. Parsons. Fastlsa: a fast, linear-space, parallel and sequential algorithm for sequence alignment. Algorithmica, 45(3):337–375, 2006. [18] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms. The MIT Press, 2001. [19] R.A. Wagner and M.J. Fischer. The string-to-string correction problem. Journal of the ACM (JACM), 21(1):168–173, 1974. [20] S.B. Needleman, C.D. Wunsch, et al. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology, 48(3):443–453, 1970. [21] Waterman M.S. Smith, T.F. Identification of common molecular subsequences. Journal of Molecular Biology, 147(1):195–197, 1981. [22] D.S. Hirschberg. A linear space algorithm for computing maximal common subsequences. Communications of the ACM, 18(6):341–343, 1975. [23] E.W. Myers and W. Miller. Optimal alignments in linear space. Computer applications in the biosciences: CABIOS, 4(1):11–17, 1988. [24] R.A. Chowdhury and V. Ramachandran. Cache-oblivious dynamic programming. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 591–600. ACM, 2006. [25] E.D. Demaine. Cache-oblivious algorithms and data structures. Lecture Notes from the EEF Summer School on Massive Data Sets, pages 1–29, 2002. [26] P. Bille. Faster approximate string matching for short patterns. Theory of Computing Systems, pages 1–24, 2008. 48


Bibliography [27] J.W. Hunt and T.G. Szymanski. A fast algorithm for computing longest common subsequences. Communications of the ACM, 20(5):350–353, 1977. [28] G.M. Landau and U. Vishkin. Fast parallel and serial approximate string matching. Journal of Algorithms, 10(2):157–169, 1989. [29] W.J. Masek and M.S. Paterson. A faster algorithm computing string edit distances. Journal of Computer and System sciences, 20(1):18–31, 1980. [30] L. Bergroth, H. Hakonen, and T. Raita. A survey of longest common subsequence algorithms. In String Processing and Information Retrieval, 2000. SPIRE 2000. Proceedings. Seventh International Symposium on, pages 39–48. IEEE, 2000. [31] G. Navarro. A guided tour to approximate string matching. ACM computing surveys (CSUR), 33(1):31–88, 2001. [32] S. Fortune and J. Wyllie. Parallelism in random access machines. In Proceedings of the tenth annual ACM symposium on Theory of computing, pages 114–118. ACM, 1978. [33] T.R. Mathies. A fast parallel algorithm to determine edit distance. 1988. [34] R.A. Chowdhury and V. Ramachandran. Cache-efficient dynamic programming algorithms for multicores. In Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures, pages 207– 216. ACM, 2008. [35] David Diaz, Francisco Jose Esteban, Pilar Hernandez, Juan Antonio Caballero, Gabriel Dorado, and Sergio Galvez. Parallelizing and optimizing a bioinformatics pairwise sequence alignment algorithm for many-core architecture. PARALLEL COMPUTING, 37(4-5):244–259, APR-MAY 2011. [36] V.W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A.D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, et al. Debunking the 100x gpu vs. cpu myth: an evaluation of throughput computing on cpu and gpu. In ACM SIGARCH Computer Architecture News, volume 38, pages 451–460. ACM, 2010. [37] NVidia. CUDA C Programming Guide Version 4.1. November 2011. [38] NVidia. CUDA C Best Practices Guide Version 4.1. January 2012. [39] P. Ferragina and G. Navarro. Pizza & chili corpus. University of Pisa and University of Chile. http://pizzachili.di.unipi.it/. 2012. [40] NVidia. CUDA C Toolkit Reference Manual version 4.2. April 2012. [41] Y. Li, J. Dongarra, and S. Tomov. A note on auto-tuning gemm for gpus. Computational Science–ICCS 2009, pages 884–892, 2009. [42] P. Micikevicius. Analysis-driven optimization. In GPU Technology Conference. NVIDIA, 2010.

49



Appendices

51



A A.1

NVIDIA GPU D ATA S HEETS

NVIDIA Tesla C2070

The Tesla C2070 is developed for high performance scientific computing. Release year

2010

CUDA Compute and Graphics Architecture

Fermi

CUDA Driver Version

4.1

CUDA Compute Capability

2.0

Total amount of global memory:

5375 MBytes (5636554752 bytes)

Symmetric Multiprocessor (SM)

14

(14) Multiprocessors x (32) CUDA Cores/MP:

448 CUDA Cores (SPs)

GPU Clock Speed:

1.15 GHz

Memory Clock rate:

1494.00 Mhz

Memory Bus Width:

384-bit

L2 Cache Size:

786432 bytes

Total amount of constant memory:

65536 bytes

Total amount of shared memory per SM:

49152 bytes

Total number of registers available per SM:

32768 bytes

Warp size:

32

Maximum number of threads per block:

1024

Maximum sizes of each dimension of a block:

1024 x 1024 x 64

Maximum sizes of each dimension of a grid:

65535 x 65535 x 65535

Maximum number of resident blocks per SM:

8

Concurrent copy and execution:

Yes with 2 copy engines

Device has ECC support enabled:

Yes

53


A. NVIDIA GPU D ATA S HEETS

A.2

NVIDIA GeForce GTX 590

The GeForce GTX 590 is intended for the PC gaming market. The specifications written in bold highlights where the GeForce GTX 590 differs from NVIDIA Tesla C2070. Release year

2011

CUDA Compute and Graphics Architecture

Fermi

CUDA Driver Version

4.1

CUDA Compute Capability

2.0

Total amount of global memory:

1536 MBytes (1610285056 bytes)

Symmetric Multiprocessor (SM)

16

(16) Multiprocessors x (32) CUDA Cores/MP:

512 CUDA Cores (SPs)

GPU Clock Speed:

1.22 GHz

Memory Clock rate:

1707.00 Mhz

Memory Bus Width:

384-bit

L2 Cache Size:

786432 bytes

Total amount of constant memory:

65536 bytes

Total amount of shared memory per block:

49152 bytes

Total number of registers available per block:

32768 bytes

Warp size:

32

Maximum number of threads per block:

1024

Maximum sizes of each dimension of a block:

1024 x 1024 x 64

Maximum sizes of each dimension of a grid:

65535 x 65535 x 65535

Maximum number of resident blocks per SM:

8

Concurrent copy and execution:

Yes with 1 copy engine

Device has ECC support enabled:

No

54


B B.1 B.1.1 1 2 3 4 5 6 7 8 9 10 11 12 13 14

K ERNEL S OURCE C ODE

Forward pass kernels GK ERNEL SHARED

/* * * Computes the LCS cost - boundaries for the tiles defined in boxArray . * * @param boxArray list of tiles to compute * @param xStr X’ * @param yStr Y’ * @param boundaryX boundaries in X direction * @param boundaryY boundaries in Y direction * @param n input size * @param boxDim a multiple of # threads per block */ __global__ __launch_bounds__ ( MAXTHREADSPERBLOCK ) void k e r n e l _ l c s _ f p _ w a v e _ b o u n d a r y _ s t r i v i n g ( const int2 * boxArray , char const * const xStr , char const * const yStr , int * const boundaryX , int * const boundaryY , int n , int boxDim ) {

15 16 17

int boxI = boxArray [ blockIdx . x ]. x ; int boxJ = boxArray [ blockIdx . x ]. y ;

18 19

extern __shared__ char d y n S m e m W s t r i n g s G e n e r a l [];

20 21 22 23

char * const xStr_shared = & d y n S m e m W s t r i n g s G e n e r a l [0]; char * const yStr_shared = & d y n S m e m W s t r i n g s G e n e r a l [ boxDim ]; int * diag_base = ( int *) & d y n S m e m W s t r i n g s G e n e r a l [ boxDim *2];

24 25 26 27 28 29 30 31 32

// inner cost diagonals int * subMatrix [3] = { & diag_base [0] , & diag_base [ boxDim ] , & diag_base [2* boxDim ] }; // a pointer for wrapping around the diagonals int * tmpPointer ;

33 34 35 36

// stride variables based on the problem size and number of threads int totalstrides = ( boxDim / blockDim . x ) ; int strideWidth = boxDim / totalstrides ;

37 38 39 40 41 42 43 44

// Copy local X - string and Y - string needed for the current tile for ( int stride = 0; stride < totalstrides ; ++ stride ) { xStr_shared [ threadIdx . x + stride * strideWidth ] = xStr [ boxI * boxDim + threadIdx . x + stride * strideWidth ]; yStr_shared [ threadIdx . x + stride * strideWidth ] = yStr [ boxJ * boxDim + threadIdx . x + stride * strideWidth ]; }

45 46

int totalD iagonals = boxDim *2 - 1;

47 48 49 50

// pre - calculate index values for y / boundaryX int b o u n d a r y X O f f s e t R e a d = n *( boxJ -1) + boxDim * boxI ; int b o u n d a r y X O f f s e t W r i t e = b o u n d a r y X O f f s e t R e a d + n ;

51

55


B. K ERNEL S OURCE C ODE int b o u n d a r y Y O f f s e t R e a d int b o u n d a r y Y O f f s e t W r i t e

52 53

= n *( boxI -1) + boxDim * boxJ ; = boundaryYOffsetRead + n;

54

// sync all threads in block __syncthreads () ;

55 56 57

// calculate the cost values for ( int slice = 0; slice < totalD iagonal s ; ++ slice ) {

58 59 60

// for each stride for ( int stride = 0; stride < totalstrides ; ++ stride ) { // update i , j int i = threadIdx . x + ( stride * strideWidth ) ; int j = slice - i ;

61 62 63 64 65 66

if (!( j < 0 || j >= boxDim ) ) { // calculate int northWestValue , result ; int northValue = j == 0 ? boundaryX [ b o u n d a r y X O f f s e t R e a d + i ] : subMatrix [ NOMATCH ][ i ]; int westValue = i == 0 ? boundaryY [ b o u n d a r y Y O f f s e t R e a d + j ] : subMatrix [ NOMATCH ][ i - 1];

67 68 69 70 71 72

if ( j == 0) { // border to the north north WestVal ue = boundaryX [ b o u n d a r y X O f f s e t R e a d + i - 1]; } else if ( i == 0) { // border to the west north WestVal ue = boundaryY [ b o u n d a r y Y O f f s e t R e a d + j - 1]; } else { // not on border , read from own cost values north WestVal ue = subMatrix [ MATCH ][ i - 1]; }

73 74 75 76 77 78 79 80 81 82 83

result = max ( westValue , north WestVal ue + ( yStr_shared [ j ] == xStr_shared [ i ]) ); result = max ( northValue , result ) ;

84 85 86 87 88 89

subMatrix [ RESULT ][ i ] = result ; // end of calculation

90 91 92

// on south / east edge ? Save in output if ( j == boxDim - 1) { // south edge . Note : corner is only boundaryY boundaryX [ b o u n d a r y X O f f s e t W r i t e + i ] } else if ( i == boxDim - 1) { // east edge boundaryY [ b o u n d a r y Y O f f s e t W r i t e + j ] }

93 94 95 96 97 98 99 100

}

101

}

102 103

// memory wrap around tmpPointer = subMatrix [2]; subMatrix [2] = subMatrix [0]; subMatrix [0] = subMatrix [1]; subMatrix [1] = tmpPointer ;

104 105 106 107 108 109

__syncthreads () ;

110

}

111 112

}

56

boundary saved here and not in = subMatrix [ RESULT ][ i ];

= subMatrix [ RESULT ][ i ];


B.1. Forward pass kernels

B.1.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14

SK ERNEL SHARED

/* * * Computes the LCS cost - boundaries for the tiles defined in boxArray . * * @param boxArray list of tiles to compute * @param xStr X’ * @param yStr Y’ * @param boundaryX boundaries in X direction * @param boundaryY boundaries in Y direction * @param n input size * @param boxDim a multiple of # threads per block */ __global__ __launch_bounds__ ( MAXTHREADSPERBLOCK ) void k e r n e l _ l c s _ f p _ w a v e _ b o u n d a r y _ s t r i v i n g _ g l o b a l S t r i n g s ( const int2 * boxArray , char const * const xStr , char const * const yStr , int * const boundaryX , int * const boundaryY , int n , int boxDim ) {

15 16 17

int boxI = boxArray [ blockIdx . x ]. x ; int boxJ = boxArray [ blockIdx . x ]. y ;

18 19

extern __shared__ int d y n S m e m W O s t r i n g s G e n e r a l [];

20 21 22 23 24 25 26

int * subMatrix [3] = { & d y n S m e m W O s t r i n g s G e n e r a l [0] , & d y n S m e m W O s t r i n g s G e n e r a l [ boxDim ] , & d y n S m e m W O s t r i n g s G e n e r a l [ boxDim *2] }; int * tmpPointer ;

27 28 29 30

int totalstrides = ( boxDim / blockDim . x ) ; volatile int strideWidth = boxDim / totalstrides ; volatile int tot alDiago nals = boxDim *2 - 1;

31 32 33 34 35 36

// pre - calculate index values for volatile int b o u n d a r y X O f f s e t R e a d int b o u n d a r y X O f f s e t W r i t e volatile int b o u n d a r y Y O f f s e t R e a d int b o u n d a r y Y O f f s e t W r i t e

y / boundaryX = n *( boxJ -1) + boxDim = boundaryXOffsetRead = n *( boxI -1) + boxDim = boundaryYOffsetRead

* + * +

boxI ; n; boxJ ; n;

37 38 39 40

// no sync , needed // calculate the cost matrix for ( int slice = 0; slice < totalD iagonal s ; ++ slice ) {

41 42 43 44 45 46

// for each stride for ( int stride = 0; stride < totalstrides ; ++ stride ) { // update i , j int i = threadIdx . x + ( stride * strideWidth ) ; int j = slice - i ;

47 48 49 50 51 52 53

// from here the kernel is the same as k e r n e l _ l c s _ f p _ 0 1 _ w a v e _ b o u n d a r y if (!( j < 0 || j >= boxDim ) ) { // calculate int northWestValue , result ; int northValue = j == 0 ? boundaryX [ b o u n d a r y X O f f s e t R e a d + i ] : subMatrix [ NOMATCH ][ i ]; int westValue = i == 0 ? boundaryY [ b o u n d a r y Y O f f s e t R e a d + j ] : subMatrix [ NOMATCH ][ i - 1];

54 55 56 57 58 59 60 61 62 63 64

if ( j == 0) { // border to the north north WestVal ue = boundaryX [ b o u n d a r y X O f f s e t R e a d + i - 1]; } else if ( i == 0) { // border to the west north WestVal ue = boundaryY [ b o u n d a r y Y O f f s e t R e a d + j - 1]; } else { // not on border , read from own cost values north WestVal ue = subMatrix [ MATCH ][ i - 1]; }

57


B. K ERNEL S OURCE C ODE 65

result = max ( westValue , nort hWestVa lue + ( yStr [ boxJ * boxDim + j ] == xStr [ boxI * boxDim + i ]) ); result = max ( northValue , result ) ;

66 67 68 69 70 71

subMatrix [ RESULT ][ i ] = result ; // end of calculation

72 73 74

// on south / east edge ? Save in output if ( j == boxDim - 1) { // south edge . Corner is only saved boundaryX [ b o u n d a r y X O f f s e t W r i t e + i ] } else if ( i == boxDim - 1) { // east edge boundaryY [ b o u n d a r y Y O f f s e t W r i t e + j ] }

75 76 77 78 79 80 81 82

}

83

}

84 85

// memory wrap around tmpPointer = subMatrix [2]; subMatrix [2] = subMatrix [0]; subMatrix [0] = subMatrix [1]; subMatrix [1] = tmpPointer ;

86 87 88 89 90 91

__syncthreads () ;

92

}

93 94

}

58

boundary here and not in boundaryY = subMatrix [ RESULT ][ i ];

= subMatrix [ RESULT ][ i ];


B.1. Forward pass kernels

B.1.3 1 2 3 4 5 6 7 8 9 10 11 12 13 14

SK ERNEL SHARED

/* * * Computes the LCS cost - boundaries for the tiles defined in boxArray . * * @param boxArray list of tiles to compute * @param xStr X’ * @param yStr Y’ * @param boundaryX boundaries in X direction * @param boundaryY boundaries in Y direction * @param n input size * @param boxDim a multiple of # threads per block */ __global__ __launch_bounds__ ( MAXTHREADSPERBLOCK ) void k e r n e l _ l c s _ f p _ w a v e _ b o u n d a r y _ s t r i v i n g _ s c a l i n g ( const int2 * boxArray , char const * const xStr , char const * const yStr , int * const boundaryX , int * const boundaryY , int n , int boxDim ) {

15 16 17

int boxI = boxArray [ blockIdx . x ]. x ; int boxJ = boxArray [ blockIdx . x ]. y ;

18 19

extern __shared__ char d y n S m e m W s t r i n g s S p e c i a l i z e d [];

20 21 22 23

char * const xStr_shared = & d y n S m e m W s t r i n g s S p e c i a l i z e d [0]; char * const yStr_shared = & d y n S m e m W s t r i n g s S p e c i a l i z e d [ boxDim ]; diag_t * diag_base = ( diag_t *) & d y n S m e m W s t r i n g s S p e c i a l i z e d [ boxDim *2];

24 25 26 27 28 29 30

diag_t * subMatrix [3] = { & diag_base [0] , & diag_base [ boxDim ] , & diag_base [2* boxDim ] }; diag_t * tmpPointer ;

31 32 33 34

int totalStrides = boxDim / blockDim . x ; int strideWidth = boxDim / totalStrides ; volatile int tot alDiago nals = boxDim *2 - 1;

35 36 37 38 39 40 41 42

// Copy X - string and Y - string needed for the current tile for ( int stride = 0; stride < totalStrides ; ++ stride ) { xStr_shared [ threadIdx . x + stride * strideWidth ] = xStr [ boxI * boxDim + threadIdx . x + stride * strideWidth ]; yStr_shared [ threadIdx . x + stride * strideWidth ] = yStr [ boxJ * boxDim + threadIdx . x + stride * strideWidth ]; }

43 44 45 46 47 48

// pre - calculate index values for volatile int b o u n d a r y X O f f s e t R e a d int b o u n d a r y X O f f s e t W r i t e volatile int b o u n d a r y Y O f f s e t R e a d int b o u n d a r y Y O f f s e t W r i t e

y / boundaryX = n *( boxJ -1) + boxDim = boundaryXOffsetRead = n *( boxI -1) + boxDim = boundaryYOffsetRead

* + * +

boxI ; n; boxJ ; n;

49 50 51

// scaling int resultScaling = ( boxJ ==0) ? 0 : boundaryX [ b o u n d a r y X O f f s e t R e a d ];

52 53 54

// sync all threads in block __syncthreads () ;

55 56 57

// calculate the cost matrix for ( int slice = 0; slice < totalD iagonal s ; ++ slice ) {

58 59 60 61 62 63 64

// for each stride // # pragma unroll 8 for ( int stride = 0; stride < totalStrides ; ++ stride ) { // update i , j int i = threadIdx . x + ( stride * strideWidth ) ; int j = slice - i ;

65 66

if (!( j <0 || j >= boxDim ) ) {

59


B. K ERNEL S OURCE C ODE int northWestValue , result ; int northValue = j ==0 ? boundaryX [ b o u n d a r y X O f f s e t R e a d + i ] resultScaling : subMatrix [ NOMATCH ][ i ]; // scaling int westValue = i ==0 ? boundaryY [ b o u n d a r y Y O f f s e t R e a d + j ] resultScaling : subMatrix [ NOMATCH ][ i -1]; // scaling

67 68 69 70

if ( j ==0) { // border to the north nort hWestVa lue = boundaryX [ b o u n d a r y X O f f s e t R e a d +i -1] resultScaling ; // scaling } else if ( i ==0) { // border to the west nort hWestVa lue = boundaryY [ b o u n d a r y Y O f f s e t R e a d +j -1] resultScaling ; // scaling } else { // not on border , read from own cost values nort hWestVa lue = subMatrix [ MATCH ][ i -1]; }

71 72 73 74 75 76 77 78 79 80 81

result = max ( westValue , nort hWestVal ue + ( yStr_shared [ j ] == xStr_shared [ i ]) ); result = max ( northValue , result ) ;

82 83 84 85 86

subMatrix [ RESULT ][ i ] = result ; // end of calculation

87 88 89

// on south / east edge ? Save in output boundary if ( j == boxDim -1) { // south edge , corner is only saved here and not in boundaryY boundaryX [ b o u n d a r y X O f f s e t W r i t e + i ] = subMatrix [ RESULT ][ i ] + resultScaling ; // scaling } else if ( i == boxDim -1) { // east edge boundaryY [ b o u n d a r y Y O f f s e t W r i t e + j ] = subMatrix [ RESULT ][ i ] + resultScaling ; // scaling }

90 91 92 93 94 95 96 97 98 99

}

100

}

101 102

// memory wrap around tmpPointer = subMatrix [2]; subMatrix [2] = subMatrix [0]; subMatrix [0] = subMatrix [1]; subMatrix [1] = tmpPointer ;

103 104 105 106 107 108

__syncthreads () ;

109

}

110 111

}

60


B.1. Forward pass kernels

B.1.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14

SK ERNEL GLOBAL

/* * * Computes the LCS cost - boundaries for the tiles defined in boxArray . * * @param boxArray list of tiles to compute * @param xStr X’ * @param yStr Y’ * @param boundaryX boundaries in X direction * @param boundaryY boundaries in Y direction * @param n input size * @param boxDim a multiple of # threads per block */ __global__ __launch_bounds__ ( MAXTHREADSPERBLOCK ) void k e r n e l _ l c s _ f p _ w a v e _ b o u n d a r y _ s t r i v i n g _ s c a l i n g _ g l o b a l S t r i n g s ( const int2 * boxArray , char const * const xStr , char const * const yStr , int * const boundaryX , int * const boundaryY , int n , int boxDim ) {

15 16 17

int boxI = boxArray [ blockIdx . x ]. x ; int boxJ = boxArray [ blockIdx . x ]. y ;

18 19

extern __shared__ diag_t d y n S m e m W O s t r i n g s S p e c i a l i z e d [];

20 21 22 23 24 25 26

diag_t * subMatrix [3] = { & d y n S m e m W O s t r i n g s S p e c i a l i z e d [0] , & d y n S m e m W O s t r i n g s S p e c i a l i z e d [ boxDim ] , & d y n S m e m W O s t r i n g s S p e c i a l i z e d [ boxDim *2] }; diag_t * tmpPointer ;

27 28 29 30

int totalStrides = boxDim / blockDim . x ; volatile int strideWidth = boxDim / totalStrides ; volatile int tot alDiago nals = boxDim *2 - 1;

31 32 33 34 35 36

// pre - calculate index values for volatile int b o u n d a r y X O f f s e t R e a d int b o u n d a r y X O f f s e t W r i t e volatile int b o u n d a r y Y O f f s e t R e a d int b o u n d a r y Y O f f s e t W r i t e

y / boundaryX = n *( boxJ -1) + boxDim = boundaryXOffsetRead = n *( boxI -1) + boxDim = boundaryYOffsetRead

* + * +

boxI ; n; boxJ ; n;

37 38 39

// scaling int resultScaling = ( boxJ ==0) ? 0 : boundaryX [ b o u n d a r y X O f f s e t R e a d ];

40 41 42 43

// no sync needed // calculate the cost matrix for ( int slice = 0; slice < totalD iagonal s ; ++ slice ) {

44 45 46 47 48

// for each stride // stride counter is char , fix for register usage // # pragma unroll 8 for ( char stride = 0; stride < totalStrides ; ++ stride ) {

49 50 51 52

// update i , j int i = threadIdx . x + ( stride * strideWidth ) ; int j = slice - i ;

53 54 55 56 57 58

if (!( j <0 || j >= boxDim ) ) { // calculate int northWestValue , result ; int northValue = j ==0 ? boundaryX [ b o u n d a r y X O f f s e t R e a d + i ] resultScaling : subMatrix [ NOMATCH ][ i ]; // scaling int westValue = i ==0 ? boundaryY [ b o u n d a r y Y O f f s e t R e a d + j ] resultScaling : subMatrix [ NOMATCH ][ i -1]; // scaling

59 60 61 62 63

if ( j ==0) { // border to the north north WestVal ue = boundaryX [ b o u n d a r y X O f f s e t R e a d +i -1] resultScaling ; // scaling } else if ( i ==0) {

61


B. K ERNEL S OURCE C ODE // border to the west nort hWestVa lue = boundaryY [ b o u n d a r y Y O f f s e t R e a d +j -1] resultScaling ; // scaling } else { // not on border , read from own cost values nort hWestVa lue = subMatrix [ MATCH ][ i -1]; }

64 65 66 67 68 69 70

result = max ( westValue , nort hWestVa lue + ( yStr [ boxJ * boxDim + j ] == xStr [ boxI * boxDim + i ]) ); result = max ( northValue , result ) ;

71 72 73 74 75 76 77

subMatrix [ RESULT ][ i ] = result ; // end of calculation

78 79 80

// on south / east edge ? Save in output boundary if ( j == boxDim -1) { // south edge , corner is only saved here and not in boundaryY boundaryX [ b o u n d a r y X O f f s e t W r i t e + i ] = subMatrix [ RESULT ][ i ] + resultScaling ; // scaling } else if ( i == boxDim -1) { // east edge boundaryY [ b o u n d a r y Y O f f s e t W r i t e + j ] = subMatrix [ RESULT ][ i ] + resultScaling ; // scaling }

81 82 83 84 85 86 87 88 89 90

}

91

}

92 93

// memory wrap around tmpPointer = subMatrix [2]; subMatrix [2] = subMatrix [0]; subMatrix [0] = subMatrix [1]; subMatrix [1] = tmpPointer ;

94 95 96 97 98 99

__syncthreads () ;

100

}

101 102

}

62


B.1. Forward pass kernels

B.1.5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

B P K ERNEL

/* * * Computes the backward pass using boundaries and pinned memory . * * @param xStr X’ * @param yStr Y’ * @param subBoundaryX boundaries in X direction * @param subBoundaryY boundaries in Y direction * @param boxDim a multiple of # threads per block * @param subBoxDim equal to # threads per block * @param hostLcs pinned memory for the LCS string * @param globalL csIndex index of current LCS char * @param current_trace where are we in the global cost matrix * @param current_box where are we in the current tile */ // Defines for bank conflict fix # define JUMPCNT 4 // Number of entries within one bank . # define WARPLENGTH 32 // number of threads in a warp __global__ void kernel_lcs_bp ( char * xStr , char * yStr , int * subBoundaryX , int * subBoundaryY , int boxDim , int subBoxDim , char * hostLcs , int * globalLcsIndex , int2 * current_trace , int2 * current_box ) {

19 20

// Bank conflict mapping , mod operations can be optimized by compiler .

21 22 23

/* index of a warp within a warpCol lection */ int subWarpIndex = ( threadIdx . x / WARPLENGTH ) % JUMPCNT ;

24 25 26

// collection index . one collection calcs i =0.. warpsize * e ntriesP erBank int w a r p C o l l e c t i o n I d x = threadIdx . x /( WARPLENGTH * JUMPCNT ) ;

27 28 29 30

// all threads IDs are relative to a warp int warpIndex = threadIdx . x % WARPLENGTH ; int i = ( warpIndex * JUMPCNT ) + subWarpIndex + w a r p C o l l e c t i o n I d x *128;

31 32

/* bank conflicts fixed */

33 34 35 36

// shared y / x strings with the width of a subBoxDim __shared__ char xStr_shared [ M A X _ B A C K W A R D _ P A S S _ N ]; __shared__ char yStr_shared [ M A X _ B A C K W A R D _ P A S S _ N ];

37 38 39

// DPM for the current subBox __shared__ char subMatrix [ M A X _ B A C K W A R D _ P A S S _ N * M A X _ B A C K W A R D _ P A S S _ N ];

40 41 42 43

int totalDi agonals = subBoxDim *2 - 1; int s u b B o u n d a r y X O f f s e t R e a d ; int s u b B o u n d a r y Y O f f s e t R e a d ;

44 45 46 47

// index to current subBox __shared__ int2 subBox ; // volatile int2 subTrace ;

48 49

int lcsIndex = * globa lLcsInd ex ;

50 51 52 53 54

// initialize thread 0 variables if ( i == 0) { subBox . x = floor ( ( float ) ( current_trace - > x % boxDim ) / subBoxDim ) ; subBox . y = floor ( ( float ) ( current_trace - > y % boxDim ) / subBoxDim ) ;

55

subTrace . x = current_trace - > x % subBoxDim ; subTrace . y = current_trace - > y % subBoxDim ;

56 57 58

}

59 60

__syncthreads () ; // make sure all is in sync before the while loop

61 62 63 64 65 66

// loop until we are outside the matrix while ( subBox . x >= 0 && subBox . y >= 0) { // Copy X - string and Y - string needed for the current box xStr_shared [ i ] = xStr [ subBox . x * subBoxDim + i ]; yStr_shared [ i ] = yStr [ subBox . y * subBoxDim + i ];

63


B. K ERNEL S OURCE C ODE 67

// index to sub boundary saved in the forward pass for the current subBox s u b B o u n d a r y X O f f s e t R e a d = boxDim *( subBox .y -1) + subBoxDim * subBox . x ; s u b B o u n d a r y Y O f f s e t R e a d = boxDim *( subBox .x -1) + subBoxDim * subBox . y ;

68 69 70 71

int resultScaling = subBoundaryX [ s u b B o u n d a r y X O f f s e t R e a d ];

72 73

// sync all threads in tile __syncthreads () ;

74 75 76

// calculate the sub - matrix for the current subBox for ( int slice = 0; slice < totalD iagonals ; ++ slice ) { int j = slice - i ; int resultOffset = j * subBoxDim + i ;

77 78 79 80 81

if (!( j <0 || j >= subBoxDim ) ) { int northWestValue , result ; int northValue = j == 0 ? subBoundaryX [ s u b B o u n d a r y X O f f s e t R e a d + i ] - resultScaling : subMatrix [ resultOffset - subBoxDim ];

82 83 84 85 86 87

int westValue = i == 0 ? subBoundaryY [ s u b B o u n d a r y Y O f f s e t R e a d + j ] - resultScaling : subMatrix [ resultOffset - 1];

88 89 90 91

if ( j ==0) { // border to the north north WestVal ue = subBoundaryX [ s u b B o u n d a r y X O f f s e t R e a d +i -1] resultScaling ;

92 93 94 95

// boxJ >0 could be removed by adding one extra value to boundaryY north WestVal ue = ( i == 0 && subBox . x ==0 && subBox .y >0) ? subBoundaryY [ s u b B o u n d a r y Y O f f s e t R e a d +j -1] - resultScaling : north WestValu e ;

96 97 98 99 100

} else if ( i ==0) { // border to the west north WestVal ue = subBoundaryY [ s u b B o u n d a r y Y O f f s e t R e a d +j -1] resultScaling ; } else { // not on border , read from own sub - matrix values north WestVal ue = subMatrix [ resultOffset - subBoxDim -1]; }

101 102 103 104 105 106 107 108

result = max ( westValue , north WestVal ue + ( yStr_shared [ j ] == xStr_shared [ i ]) ); result = max ( northValue , result ) ;

109 110 111 112 113 114

subMatrix [ resultOffset ] = result ;

115 116

}

117 118

__syncthreads () ; // sync all threads after each diagonal

119

}

120 121

// sub - matrix is done , do backward pass using one thread

122 123

// one thread backtrace if ( i == 0) { // backtrace

124 125 126 127

while ( subTrace . x >= 0 && subTrace . y >=0) { // inside tile

128 129 130

if ( yStr_shared [ subTrace . y ] == xStr_shared [ subTrace . x ]) { // match , save result , local and global hostLcs [ lcsIndex ] = yStr_shared [ subTrace . y ];

131 132 133

64


B.1. Forward pass kernels lcsIndex - -; subTrace .x - -; subTrace .y - -; } else { // no match , go north or west ? int northValue = subTrace . y == 0 ? subBoundaryX [ s u b B o u n d a r y X O f f s e t R e a d + subTrace . x ] - resultScaling : subMatrix [( subTrace .y -1) * subBoxDim + subTrace . x ]; // scaling int westValue = subTrace . x == 0 ? subBoundaryY [ s u b B o u n d a r y Y O f f s e t R e a d + subTrace . y ] - resultScaling : subMatrix [ subTrace . y * subBoxDim + subTrace .x -1]; // scaling

134 135 136 137 138 139

140

141

if ( northValue >= westValue ) { subTrace .y - -; // go north } else { subTrace .x - -; // go west } } // end if } // while end ( outside tile )

142 143 144 145 146 147 148 149

// done , on border - update boxI / J and subTrace . x / J if ( subTrace . x == -1) { // go west subBox .x - -; // flip trace I subTrace . x = subBoxDim -1; }

150 151 152 153 154 155 156 157

if ( subTrace . y == -1) { // go north subBox .y - -; // flip trace J subTrace . y = subBoxDim -1; }

158 159 160 161 162 163 164

} // one thread is done with the trace for current subTile

165 166

__syncthreads () ; // backward trace done for current subTile , sync } // all is done

167 168 169

// thread 0 updates global lcs index and pinned traces if ( i ==0) { * gl obalLcsI ndex = lcsIndex ;

170 171 172 173

// update current trace based on subTrace and box ... current_trace - > x = ( current_box - > x * boxDim + subBox . x * subBoxDim + subTrace . x ) ; current_trace - > y = ( current_box - > y * boxDim + subBox . y * subBoxDim + subTrace . y ) ;

174 175 176 177 178

}

179 180

}

65


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.