I. Statement of the problem Finding all the occurrences of a substring in a string is a common operation which has a number of usages in computer science, such as: -
text search within a document search of patterns in DNA sequences plagiarism detection normal usage within a programming framework, including detection of regular expressions
To put it formally, the string-matching problem takes as an input a text T*1…n+, array of length n, and pattern P*1…m+ of length m ≤ n. The elements of T and P are characters which belong to a finite alphabet ∑. Thus, we say that pattern P occurs with shift s in text T if 0 ≤ s ≤ n – m and T[s+1…s+m+ = P[1..m] . The string matching problem demands finding all valid shifts for which the pattern P appears in text T. II. Proposed algorithms to solve the problem Programmers are hardly required to write code for detecting substrings in a string, since this method is already built-in in many applications. However, if we take a closer look into the way it has been implemented, we would be surprised of the ingenuity of the algorithms and the relative good performance achieved by them. In this survey, I propose for analysis just a few of the most famous algorithms for finding substrings in a string, which are used today in some programming languages such as C++ and Java, or implemented in Unix utilities such as grep. As we will see later, these algorithms make use of special data structures for accomplishing the results, and we will focus specially on: -
Knuth-Morris-Pratt algorithm (aka KMP), first appeared in June 1977 Boyer-Moore algorithm, appeared in October 1977 Bitap (aka Shift-Add) algorithm, appeared in 1992 Robin-Karp algorithm, enhanced with Hashing a string matching tool, published in March 2014
III. Naïve method (brute force) The naïve algorithm will act as sliding window upon the text T, at each position comparing the current substring of T with size m, against the pattern P. The pseudo-code [1] would look as below: NAÏVE-STRING-MATCHER(T, P) m length(T) n length(P) for s 0 to n-m do if P*1..m+ == T*s+1 … s+m+ then print “Pattern occurs with shift”, s
This simple algorithm has a time complexity of O((n-m+1)*m), which can be very improved, since the naïve string-matcher has no preprocessing involved and it is believed to be inefficient because “the information gained about the text for one value of s is entirely ignored in considering other values of s” [1] .
IV. The Knuth-Morris-Pratt algorithm KMP is the first algorithm to achieve linear time complexity, actually obtained by improving the previous naïve string-matching algorithm. In contrast to that, KMP solves the inefficiency issue by keeping the information wasted and using it to advance in the string. It also has a preprocessing part to indicate how much of the previous failed comparison can be used as it continues. The algorithm has two parts:
1. A preprocessing part which occurs in the initiation, whose result is building a prefix-function π . The pattern is matched against itself (shifts of itself), in order to find the size of the largest prefix of P[1… j] that is also a suffix of P[2..j+1]. This information helps avoiding the test of useless shifts as it previously happened in the naïve algorithm. The pseudo-code for this part can be written as follows: Compute-Prefix-Function ( P ) m length(P) π[1] 0 k0 for q 2 to m do while k > 0 and P[k+1] ≠ P[q] do k π[k] if P[k+1] == P[q] then k k + 1 π[q] k return π To illustrate this function, I will take an example where P = “ababaacba” and try to compute the prefix array. a 0
b a
b a
a
P π
Initially, m = 6 and k = 0. For q = 2 and k = 0: k == 0 (no while) P[k+1] == P[q] P[1] == P[2] (false) and π[2] = 0 For q = 3 and k = 0: k == 0 (no while) P[1] == P[3] (true) => k = 1 and π[3] = 1 For q = 4 and k = 1: k > 0 but P[2] ==P [4] (no while) P[2] == P[4] (true) => k = 2 and π[4] = 2 For q = 5 and k = 2: k > 0 but P[3] == P[5] (no while) P[3] == P[5] => k = 3 and π[5] = 3 For q = 6 and k = 3: k > 0 and P[4] ≠ P[6] => k = π[3] = 1 k > 0 and P[2] ≠ P[6] => k = π[1] = 0 (end while) P[1] == P[6] => k = 1 and π[6] = 1 The new π will look like:
0 0 1 2
3 1
The meaning of π in position 4, for example, is that the size of the largest prefix of P[1..3] which is also a suffix of P[2..4] is 2. Indeed, the largest such prefix for this case is “ab”: at position 4, the last 2 characters processed represent the largest prefix for the current sub-sequence. The prefix function has a running time of O(m).
2. The string-match computation part, which scans the text from left to right in search for all possible matches. KMP-match (T, P) n length(T)
m length(P) π = Compute-Prefix-Function(P) q0 for i 1 to n do
// scan the text left to right
while q > 0 and P[q+1] ≠ T[i] do q π[q]
// next char does not match
if P[q+1] == T[i] then q q + 1 // next char matches if q == m then // if all P is matched print “Pattern occurs with shift”, i-m q π[q] // look for the next match To illustrate on an example, we use the same pattern “ababaa” and the text to search in T = “abbababaaa”. a
b
b
a
b
a
b
a
a
a
n = 10, m = 6, π = [0,0,1,2,3,1], q = 0, P = “ababaa” i=1 P[1] == T[1] (true) q = 1 i=2 q > 0 and P[2] ≠ T[2] (false, no while) P[2] == T[2] q = 2 (last two characters matched) i=3 q > 0 and P[3] ≠ T[3] (true) q = π[2] = 0 (reset the matching) P[1] == T[3] (false) i=4 P[1] == T[4] q = 1 i=5
q > 0 and P[2] ≠ T[5] (false, no while) P[2] == T[5] q = 2 i=6 q > 0 and P[3] ≠ T[6] (false, no while) P[3] == T[6] q = 3 (advance) i=7 P[4] == T[7] q = 4 i=8 P[5] == T[8] q = 5 i=9 P[6] == T[9] q = 6 and q == m “Pattern occurs with shift 3” (from position 4) Look for the next match: q π[5] = 3 i = 10 P[7] == T[10] q = 4 End. This algorithm has a complexity O(n), therefore the total complexity for the two parts makes up to O(m+n), which is better than the one for the naïve algorithm. The secret of KMP algorithm is that whenever it needs to “back-up” in the pattern string, it does so by taking into account what is already being matched from the current sub-pattern against the text. Advantages:
Optimal running time O(n+m), which is very fast No need to back up until the first element, in the input pattern, when a mismatch occurs
Disadvantages:
It does not run well if ∑ (size of the alphabet) increases
As for the data structures and variables it uses, the KMP algorithm only makes use of an additional array π of size m, and two state variables q and k.
V. The Boyer-Moore algorithm “The fast string searching algorithm” published in the same year with the KMP algorithm has a different approach by matching the text against the last characters of the pattern, all the way until the first characters [3]. This algorithm is suitable for matching when either the alphabet is reasonably large or the pattern is very long (as it happens in bioinformatics applications). Along with the right-to-left approach, Boyer-Moore has two other specific rules, which can be used either alone or, for better performance, together: the “bad character shift rule” and the “good suffix shift rule”. Usually the running time is sub-linear, because it generally looks at fewer characters than it passes. It was proved that the longer the pattern is, the faster the Boyer-Moore algorithm goes.
The Bad character shift rule: as we start matching at the end of pattern P and we find a mismatch after a series of k matches in the text, we can increase the shift by k+1 without being worried of a potential match The Good suffix shift rule: if t is the longest suffix of P that matches T in the current position, then P can be shifted so that the previous occurrence of t in P matches T.
1. Pseudo-code for The bad character shift rule: function Compute-Bad-char-shift-rule () for k 1, length(last) do last[k] -1 for j length(P), 1 do if last[P[j]] < 0 then last[P[j]] = j This function computes the last occurrence of the character P[j] in the pattern P, where the array last has the size of the alphabet. I will illustrate this function on our given pattern:
a
b a
b a
a
last will have the size of the alphabet, which is 2, ∑ = ,a, b-, and it is initialized with -1. As we scan the pattern from backwards, we find last*‘a’+ = 6 and last*‘b’+ = 4 This function has a running time of O(m) in the worst case.
2. Pseudo-code for computing the Good suffix rule: function Compute-Suffix () suffix[length(suffix)] length(suffix) j length (suffix)
6
4
for i length(suffix) – 1, 1 do while j < length(suffix) and P*j+ ≠ P[i] do j suffix[j+1] – 1 if P[j] == P[i] then j j – 1 suffix[i] j + 1 suffix is here an auxiliary array of size m and the function to compute it is similar to the KMP steps at failure, except that it is a backwards version. Suffix*i+ is the smallest j > I such that P*j … m-1] is a prefix of P[i.. m-1], and if there is no such j, then suffix[i] = m. The complexity of this operation is O(m). 3. Pseudo-code for computing the matching – it stores the results into a new array match, such that the following property holds:
Match(j) =
{
min , s | 0 < s ≤j and P[j-s+ ≠ P[j] and P[j-s+1..m-s-1] is suffix of P[j+1..m-1] }, if such s exists, OR min , s | j+1 ≤ s ≤ m and P[0.. m-s-1] is suffix of P[j+1.. m-1]} if such s exists, OR m, otherwise
function Compute-Match () initialize match as an array with length(match) as elements Compute-Suffix() // try to compute match using the first criteria for i 1, length(match) do j suffix[i+1] – 1 if suffix[i] > j then match[j] j-i else match[j] min (j-i+match[i] , match[j]) // compute the remaining positions in match using the second criteria If suffix[1] < length(P) then for j suffix[1] , 1 do If suffix[0] < match[j] then match[j] suffix[0] j suffix[1] k suffix[j]
while k ≤ length(P) do while j < k do if match[j] > k then match[j] k jj+1 k suffix[k] Finally, after having processed the match array we can start the searching main part of algorithm: i j length(P) while i ≤ length(T) do if P[j] == T[i] then if j == 1 then return i jj–1 ii–1 else do i i + length(P) – j + max (j-last[text[i]], match[j]) j length(P) In the worst case the Boyer-Moore algorithm has a complexity of O(n+m), but only if the pattern does not appear in the text. When the pattern occurs, the running time is O(nm) .
VI. Rabin-Karp algorithm In 1987 M. Rabin and R. Karp came with the idea of hashing the pattern and check it against a hashed substring of the text. We notate with ts the hashed value of the length m substring T*s+1…s+m+ and with p the hashed value of the pattern, therefore ts = p pattern P is a substring of T from position s+1. A popular and efficient hash function treats each substring as a number in a certain radix. For example, if the substring is “hi” and the radix is 101, the hash value is 104*101+105 = 10609 (if we consider the ASCII values of each letter). In the pseudo-code below, we make the following additional notations: -
q is a large prime number, used to compute the modulus whenever the value would exceed the allowed superior margin
-
d is the radix to use, typically taken as the size of the alphabet if we take it as ∑ = {0, 1, .. , d-1}; therefore dx would mean value of x in the radix d.
n length(T) m length(P) h dm-1 mod q p t0 0 for i 1, m do // preprocessing p (dp + P[i]) mod q t0 (dt0 + T[i]) mod q for s 0 to n-m do if p == ts then if P[1..m+ == T*s+1…s+m+ then print “Pattern occurs with shift”, s if s < n – m then ts+1 ( d(ts – T[s+1] *h) + T[s+m+1] ) mod q To provide an example, we can take P = “cab” and T = “aabbcaba”. The radix considered is d = 26 (for the 26 letters in the alphabet) and the prime number q = 3. m = 3, n = 8, h = 262 mod 3 = 1 Hash (P) = (3+1+2) % 3 = 0 p = 0 a
a
b
b
c
a
b
a
Hash(“aab”) = (1+1+2) % 3 = 1 ≠ 0 so we need to shift right to the next position and calculate its hash value a
a
b
b
c
a
b
a
Hash(“abb”) = (Hash(“aab”) – Hash(“a”) *h) * d + Hash(“b”) mod 3 = (1 – 1* 1) * 26 + 2
mod 3
= 2 ≠ 0, so we shift right to the next position
a
a
b
b
c
a
b
a
Hash (“bbc”) = (Hash(“abb”) – Hash(“a”) * h) * d + Hash(“c”) mod 3 = (2 – 1 * 1) * 26 + 3 mod 3 2 ≠ 0, so we shift right again a
a
b
b
c
a
b
a
Hash (“bca”) = (2 – 2 * 1) * 26 + 1 mod 3 1 mod 3 = 1 ≠ 0 a
a
b
b
c
a
b
a
Hash(“cab”) = (1 – 2 * 1) * 26 + 2 mod 3 -24 % 3 = 0 , so we found a potential string match: we verify the match by linear comparison and see it is true. It could also happen to obtain a spurious hit, in which case the hash value coincides but the substring is different than the pattern. The Rabin-Karp algorithm has a running time of O((n-m+1)m) in the worst case if it obtains many valid shifts which need to verified. However, usually it will not perform as many character matches as the naïve algorithm would do. In practice, the prime number q is taken as large enough (q ≥ m) and the expected matching time is only O(n+m). Since m ≤ n, we can expect a O(n) in one of the best cases. As for the data structures employed, the algorithm is not very demanding, as it is enough to have a new array ts and a couple of other variables whose meaning I have denoted above.
VII. BITAP In the “A new approach to text searching” *4+, appeared in 1992, it is described an approximate string matching algorithm which is comparable with the KMP applied for any pattern length and with the Boyer-Moore when it is applied to short patterns. However, in terms of patterns with don’t care symbols, this is the first suitable algorithm which can also be applied at the hardware level. The algorithm tells whether a given text contains a substring “approximately equal” to a given pattern, where the approximation is calculated using the Levenstein distance. All the computation is done in terms of bitmasks and bitwise operations which give the algorithm an increased speed. Perhaps the most famous application of BITAP is found in the agrep (approximate grep) Unix utility, licensed by the University of Arizona. The C short implementation given in the publication uses register unsigned integer variables and char arrays for defining the pattern. The bitwise operations are represented by complements, left or right shifts as well as AND and OR operations. In the following lines I will examine the exact matching. Supposing we have a pattern and a text to search in, we can determine the size S of the alphabet and create a matrix of S rows and 31 columns, where the 31 positions are in the form of bit values {0, 1} to form an integer, and each row is dedicated to a symbol from the alphabet. We can represent
occurrences of each symbol into our pattern by assigning M[s][pos] = 0, where s is the symbol and pos is the position in the pattern where s occurs. For the rest of the matrix values, they are assigned to 1. We use a new variable named state of size 31 initialized with ~1 = 111…..10. The process will run by iterating through the characters of the text and performing the two operations below: 1) select the current character of the text and bitwise-OR the corresponding pattern (from the matrix) with the state 2) left-shift the state For example, taking the pattern string “aba”, the matrix will look like: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 | Pattern(a) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 | Pattern(b) The initial state will be: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 The text we want to search in is “baba” Iterating through the first character “b”, we perform Pattern(b) | State, then State << 1, with updating the State after each operation, to result the following new state: 1111111111111111111111111111110 Repeating the process over and over and going through different states, we can check if the pattern was found by testing the lowest m+1th bit of the state against 0. This algorithm has a runtime of O( |mb/w| (n + S)) , for the preprocessing time, where | mb/w | should be considered as the upper bond of mb/w, representing the time to compute a constant number of operations on integers of mb bits using a word of size w. As for the search time, the complexity in the worst and average case is O( |mb/w| n).
VIII. Comparison and contrast; advantages and disadvantages To summarize and quickly compare the approaches offered by the five algorithms discussed, I created the following table:
Complexity Naïve
O((n-m+1)*m)
Additional data structures/ techniques -
Most suitable for
Disadvantages
Applications
Very short Slow and For sized text & inefficient for educational pattern large strings purposes
KMP
BM
Rabin-Karp
BITAP
O(n+m)
Array π, state Binary strings variables k and q O(n+m) but in 3 new arrays ∑ moderately practice sub- of size m sized and the linear pattern is performance relatively long Worst case: Hashing with Finding O((n-m+1)m) the use of multiple but O(n) in radix d and pattern the best case modulo q matches
Does not run fast if ∑ increases Does not work well with binary strings or very short patterns Is as slow as the naïve algorithm but requires more space
O( |mb/w| n), Matrix for Long patterns usually sub- occurrences (it speeds up) linear of the alphabet in the pattern; bitwise operations
Does not perform well ugrep on large size of the alphabet
In C++, Boost library; Text editors, commands Text processing, bioinformatics, compression, detection of plagiarism
IX. REFERENCES 1. Introduction to Algorithms, Thomas Cormen, 2nd edition, chapter 32 - “String Matching” 2. “Fast pattern matching in strings”, D. Knuth, J. Morris, V. Pratt, appeared in June 1977 in SICOMP 3. “A fast string searching algorithm”, R. Boyer, J. Moore, ACM 1977 4. “Efficient randomized pattern-matching algorithms”, Karp & Robin, 1987 5. “A new approach to text searching”, R. Baeza-Yates, G. Gonnet, 1992