I. Statement of the problem Finding all the occurrences of a substring in a string is a common operation which has a number of usages in computer science, such as: -
text search within a document search of patterns in DNA sequences plagiarism detection normal usage within a programming framework, including detection of regular expressions
To put it formally, the string-matching problem takes as an input a text T*1…n+, array of length n, and pattern P*1…m+ of length m ≤ n. The elements of T and P are characters which belong to a finite alphabet ∑. Thus, we say that pattern P occurs with shift s in text T if 0 ≤ s ≤ n – m and T[s+1…s+m+ = P[1..m] . The string matching problem demands finding all valid shifts for which the pattern P appears in text T. II. Proposed algorithms to solve the problem Programmers are hardly required to write code for detecting substrings in a string, since this method is already built-in in many applications. However, if we take a closer look into the way it has been implemented, we would be surprised of the ingenuity of the algorithms and the relative good performance achieved by them. In this survey, I propose for analysis just a few of the most famous algorithms for finding substrings in a string, which are used today in some programming languages such as C++ and Java, or implemented in Unix utilities such as grep. As we will see later, these algorithms make use of special data structures for accomplishing the results, and we will focus specially on: -
Knuth-Morris-Pratt algorithm (aka KMP), first appeared in June 1977 Boyer-Moore algorithm, appeared in October 1977 Bitap (aka Shift-Add) algorithm, appeared in 1992 Robin-Karp algorithm, enhanced with Hashing a string matching tool, published in March 2014
III. Naïve method (brute force) The naïve algorithm will act as sliding window upon the text T, at each position comparing the current substring of T with size m, against the pattern P. The pseudo-code [1] would look as below: NAÏVE-STRING-MATCHER(T, P) m length(T) n length(P) for s 0 to n-m do if P*1..m+ == T*s+1 … s+m+ then print “Pattern occurs with shift”, s