Full Download Computing patterns in strings 1st edition bill smyth PDF DOCX by Education Libraries

https://ebookmass.com/product/computing-patterns-in-

Instant digital products (PDF, ePub, MOBI) ready for you

Download now and discover formats that fit your needs...

Edge Computing Patterns for Solution Architects Anonymous

https://ebookmass.com/product/edge-computing-patterns-for-solutionarchitects-anonymous/

ebookmass.com

Social Value in Public Policy 1st ed. Edition Bill Jordan

https://ebookmass.com/product/social-value-in-public-policy-1st-ededition-bill-jordan/

ebookmass.com

Intuition in Kant: The Boundlessness of Sense Daniel Smyth

https://ebookmass.com/product/intuition-in-kant-the-boundlessness-ofsense-daniel-smyth/

ebookmass.com

A Rose Among Thorns (Below the Salt Book 2) Elizabeth Rose

https://ebookmass.com/product/a-rose-among-thorns-below-the-saltbook-2-elizabeth-rose/

ebookmass.com

https://ebookmass.com/product/ikinci-dunya-savasi-tarihi-fotograflarharitalar-cizimler-1st-edition-john-walton/

ebookmass.com

Legal studies for VCE, access and justice 14th Edition Lisa Filippin

https://ebookmass.com/product/legal-studies-for-vce-access-andjustice-14th-edition-lisa-filippin/

ebookmass.com

Dynamic Physical Education for Elementary School Children 18th Edition – Ebook PDF Version

https://ebookmass.com/product/dynamic-physical-education-forelementary-school-children-18th-edition-ebook-pdf-version/

ebookmass.com

Ornamentalism Anne Anlin Cheng

https://ebookmass.com/product/ornamentalism-anne-anlin-cheng/

ebookmass.com

Coding for Kids 5 Books in 1: Javascript, Python and C++ Guide for Kids and Beginners (Coding for Absolute Beginners) Mather

https://ebookmass.com/product/coding-for-kids-5-books-in-1-javascriptpython-and-c-guide-for-kids-and-beginners-coding-for-absolutebeginners-mather/

ebookmass.com

The Oxford History of the Holy Land 1st Edition Robert G. Hoyland

https://ebookmass.com/product/the-oxford-history-of-the-holy-land-1stedition-robert-g-hoyland/

ebookmass.com

Bill Smyth McMaster University Curtin University of Technology

Pearson Education Limited Edinburgh Gate Harlow Essex CM20 2JE England and Associated Companies throughout the world Visit us on the World Wide Web at: www.pearsoneduc.com First published 2003 © Pearson Education Limited 2003 The right of William F. Smyth to be identified as author of this work has been asserted by him in accordance with the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without either the prior written permission of the publisher or a licence permitting restricted copying in the United Kingdom issued by the Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London WIT 4LP. All trademarks used herein are the property of their respective owners. The use of any trademark in this text does not vest in the author or publisher any trademark ownership rights in such trademarks, nor does the use of such trademarks imply any affiliation with or endorsement of this book by such owners. ISBN 0 201 39839 7 British Library Cataloguing-in-Publication Data A catalog record for this book is available from the British Library Library of Congress Cataloging-inPublication Data A catalog record for this book is available from the Library of Congress 10987654321 07 06 05 04 03 Typeset by 68 Printed and bound in Great Britain by Biddies Ltd, Guildford and King's Lynn

Computing Patterns in Strings

PEARSON Education We work with leading authors to develop the strongest educational materials in computer science, bringing cutting-edge thinking and best learning practice to a global market. Under a range of well-known imprints, including Addison-Wesley, we craft high-quality print and electronic publications which help readers to understand and apply their content, whether studying or at work.

To find out more about the complete range of our publishing, please visit us on the World Wide Web at: www.pearsoneduc.com

V .-.i Preface Part I Strings and Algorithms Chapter 1 Properties of Strings 1.1 Strings of Pearls 1.2 Linear Strings 1.3 Periodicity 1.4

Necklaces Chapter 2 Patterns? What Patterns? 2.1 Intrinsic Patterns (Part II) 2.2 Specific Patterns (Part III) 2.3 Generic Patterns (Part IV) Chapter 3 Strings Famous and Infamous 3.1 Avoidance Problems and Morphisms 3.2 Thue Strings B,3) 3.3 Thue Strings C,2) 3.4

Fibostrings B,4) Chapter 4 Good Algorithms and Good Test Data 4.1

Good Algorithms 4.2 Distinct Patterns 4.3 Distinct Borders IX 1 3 3 5 14 25 35 35 41 51 61 61 65 72 76 89 89 94 100

vi Contents Part II Computing Intrinsic Patterns 109 Chapter 5 Trees

Derived from Strings 111 5.1 Border Trees 111 5.2 Suffix Trees 113 5.2.1 Preliminaries 114 5.2.2 McCreight's Algorithm 117 5.2.3 Ukkonen's Algorithm 121 5.2.4 Farach's Algorithm 126 5.2.5

Application and Implementation 137 5.3 Alternative Suffix-Based Structures 140 5.3.1 Directed Acyclic Word Graphs 140 5.3.2 Suffix Arrays — Saving the Best till Last? 149 Chapter 6 Decomposing a String 157 6.1 Lyndon Decomposition: Duval's Algorithm 158 6.2 Lyndon Applications 167 6.3 s-Factorization: Lempel-Ziv 175 Part III

Computing Specific Patterns 179 Chapter 7 Basic Algorithms 181 7.1

Knuth-Morris-Pratt 181 7.2 Boyer-Moore 187 7.3 Karp-Rabin 198 7.4 D6molki-(Baeza-Yates)-Gonnet 202 7.5 Summary 206 Chapter 8 Son of BM Rides Again! 207 8.1 The BM Skip Loop 208 8.2 BM-Horspool 210 8.3 Frequency Considerations and BM-Sunday 212 8.4 BM-Galil 219 8.5 Turbo-BM 223 8.6 Daughter of KMP Rides Too! 226 8.7 Mix

Your Own Algorithm 231 8.8 The Exact Complexity of Exact PatternMatching 233 Chapter 9 String Distance Algorithms 237 9.1 The Basic Recurrence 238 9.2 Wagner-Fischer etal. 241 9.3 Hirschberg 244 9.4 Hunt-Szymanski 250

Contents vjj 9.5 Ukkonen-Myers 256 9.6 Summary 263 Chapter 10

Approximate Pattern-Matching 265 10.1 A General Distance-Based Algorithm 266 10.2 An Algorithm for ^-Mismatches 269 10.3

Algorithms for ^-Differences 274 10.3.1 Ukkonen's Algorithm 276 10.3.2 Myers'Algorithm 279 10.4 A Fast and Flexible Algorithm — Wu and Manber 286 10.5 The Complexity of Approximate PatternMatching 292 Chapter 11 Regular Expressions and Multiple Patterns 295 11.1 Regular Expression Algorithms 297 11.1.1 NonDeterministic FA 297 11.1.2 Deterministic FA 302 11.1.3 Algorithm WM Revisited 305 11.2 Multiple Pattern Algorithms 309 11.2.1 AhoCorasick FA: KMP Revisited 309 11.2.2 Commentz-Walter FA: BM Revisited 313 11.2.3 Approximate Patterns: WM Revisited Again! 315 11.2.4 Approximate Patterns: (Baeza-Yates)-Navarro 317 Part IV Computing Generic Patterns 327 Chapter 12 Repetitions (Periodicity) 329 12.1 All Repetitions 331 12.1.1 Crochemore 331 12.1.2 Main and Lorentz 340 12.2 Runs 349 12.2.1 Leftmost Runs — Main 350 12.2.2

All Runs — Kolpakov and Kucherov 356 Chapter 13 Extensions of Periodicity 359 13.1 All Covers of a String —Algorithm LS 360 13.2

All Repeats — Algorithm FST 370 13.2.1 Computing the NE Tree 372 13.2.2 Computing the NE Array 375 13.3 A;-Approximate Repeats — Schmidt 380 13.4 ^-Approximate Periods — SIPS 396 Bibliography 403 Index 415

In the beginning was the Word, and the Word was with God, and the Word was God. —John 1:1 The computation of patterns in strings is a fundamental requirement in many areas of science and information processing. The operation of a text editor, the lexical analysis of a computer program, the functioning of a finite automaton, the retrieval of information from a database — these are all activities which may require that patterns be located and/or computed. In other areas of science, the algorithms that compute patterns have applications in such diverse fields as data compression, cryptology, speech recognition, computer vision, computational geometry, and molecular biology. And computing patterns in strings is not a topic whose importance lies only in its current practical applications: it is a branch of combinatorics that includes many simply-stated problems which often turn out to have solutions — and often more solutions than one — of great subtlety and elegance. It is surprising therefore that academic Departments of Mathematics or Computer Science do

not generally include in their undergraduate or graduate curricula courses which provide an introduction to this interesting, important, and heavily researched topic. It is perhaps even more surprising that so few texts have been written with the purpose of putting together in a uniform way some of the basic results and algorithms that have appeared over the past quarter-century. I know of five books [St94, CR94, G97, SM97, CHL01] and three fairly long survey articles [B Y89a, A90, NavOl] whose subject matter overlaps significantly with that of this volume. Of these, the survey articles, [St94], and [CR94] are written more as summaries of research than as texts for students; while [G97] and [SM97] focus heavily on the (important) applications in molecular biology. The final monograph [CHL01] does indeed function as both a monograph and a textbook on string algorithms, and is moreover both clearly and elegantly written; unfortunately, it is currently available only in French. The purpose of this book then is to begin to fill a gap: to provide a general introduction to algorithms for computing patterns in strings that is useful to experienced researchers

Preface in this and other fields, but that also is accessible to senior undergraduate and graduate students. Let us linger a moment over three of the words used in the preceding sentence: "accessible", "algorithms", and "patterns". An overriding objective in this book is to make the material accessible to students who have completed or nearly completed a mathematics or computer science undergraduate curriculum that has included some emphasis on discrete structures and the algorithms that operate upon them. A first consequence of this objective is that the mathematical background required to read this book is general rather than specific to strings. It would certainly be provided by the standard IEEE/ACM courses in Discrete Mathematics, Data Structures, and Analysis of Algorithms. The reader will know what stacks, queues, linked lists and arrays are, for example, and will have some familiarity with the analysis of algorithms and the "asymptotic complexity" notation used for this purpose, some experience with mathematical assertions and the methods used to establish their correctness, and some knowledge of

important algorithms on graphs and trees. In addition, the assumption is made that the reader is familiar with some computer programming language, and has the ability to read and understand algorithms expressed in such a language. A second consequence is that no claim is made to completeness: my objective is to lure the student and the reader into a fascinating field, not to write an encyclopaedia of algorithms that compute patterns in strings. In particular, I have been selective in two main ways: I focus on results that are (I believe) important and that moreover can be explained with reasonable economy and simplicity. Inevitably, this means that the exposition of some interesting and valuable material is omitted. However, I hope that, both by providing references to much of this material and by stimulating interest, I will encourage readers to investigate the literature for themselves. The underlying subject matter of this book is a mathematical object called by most computer scientists a "string" (or, in Europe and by most mathematicians, a "word"). But the focus of this book is on algorithms — that is, on precise methods or procedures for doing something — and it is thus more properly thought of as a text in computer science rather than in mathematics. This book will therefore take quite a different approach from that of the classic monograph [L83], and its descendants [L97, L02], that elucidates mathematical properties above all. We shall rather be interested primarily in algorithms that find various kinds of patterns in strings; for the most part, only as a byproduct of that focus will interest be displayed in the mathematical properties of the strings themselves. This does not mean that results will not be proved rigorously, only that the selection of those results will generally depend on their relevance to the behaviour of some algorithm. A final remark here: in the exposition I confine myself to sequential algorithms on strings in one dimension, making no reference to the extensive literature on the corresponding parallel (especially PRAM) algorithms or to the growing literature on multi-dimensional (especially two-dimensional) strings. Another focus of this book will be on patterns. That is, the algorithms we discuss will virtually all be devoted to finding some sort of a pattern in a string. I say "some sort" of pattern, because

three main kinds will be distinguished — specific, generic, and intrinsic — that provide this book with three of its four main divisions. A specific pattern is one that can be specified by listing characters in their required order; for example, if we were searching for the pattern u = abaab in the string x = abaababaabaab we would find it (three times), but we would not find the pattern u = ababab.

Preface xj (Sometimes the pattern that we are looking for can contain "don't-care" symbols, and sometimes the match that we seek need only be "approximate" in some well-defined sense, but these are refinements to be dealt with later.) By contrast, a generic pattern is one that is described only by structural information of some kind, not by a specific statement of the characters in it. For example, we might ask for all the "repetitions" in x—that is, for all cases in which two or more adjacent substrings of x are identical. (The response in this case would be a list of repetitions including (aba)(aba), (abaab) (abaab), aa (three separate times), and several others — see if you can find them all.) I call the final kind of pattern that we search for an intrinsic pattern — one that requires no characterization, one that is inherent in the string itself. Here I discuss various patterns that in one way or another expose the periodic structure of a given string x; for example, normal form, suffix tree, Lyndon decomposition, s-factorization. These intrinsic patterns are used so frequently in algorithms that compute specific or generic patterns as to be almost ubiquitous. Collectively, they form the basis for the efficient processing of strings. The variety of these intrinsic patterns is remarkable: the normal form of our example string x is (abaababa) (abaab) while its Lyndon decomposition is (ab)(aabab)(aab)(aab) and its s-factorization is (a) (b)(a)(aba)(baaba)(ab) — all of these patterns are computationally useful. The organization of this book is as follows. Part I gives basic information about both strings and algorithms on strings. It provides the terminology, notation and essential prop- properties of strings that will be used throughout; in addition, it describes the main kinds of algorithms to be presented and illustrates some of them on

certain famous strings; these strings are also "infamous" as examples of worst case behaviour for many string algorithms. It is particularly Chapter 2 that provides a kind of key to the rest of the book: it explains precisely which problems are to be solved and directs the reader to the later sections that present the algorithms that solve them. Thus the book may be used fairly easily by the reader who has selective interests. Part I also discusses qualities of "good" algorithms on strings, and raises the interesting question of how implementation of these algorithms should in practice be validated. Parts II-IV deal with algorithms for computing intrinsic, specific and generic patterns, respectively, as described above.

Altogether there are 13 chapters distributed over the four parts. As indicated in the table of contents, these are broken down into sections, each of which ends with a collection of exercises. Where appropriate, chapters include a summary of the topics/algorithms covered and a discussion of related results, additional topics and open problems. A note about the exercises, of which there are some 500 or so: they are an integral part of the book, used for four main purposes: M to make sure the reader has understood; m to clarify, or to put into a different context, principles or ideas that have arisen in the text — in a phrase, to make connections; m to handle extensions or modifications of algorithms or mathematical results that the reader should be aware of, but should not need to have explained to him in detail; M to deal with details (of algorithms, of proofs) that would otherwise unnecessarily clutter the presentation.

xii Preface Wherever possible, by explaining in the text only what really needs to be explained, I have tried through the exercises to involve the reader in the development or analysis of the algorithms presented. What I myself have discovered by taking this approach is that for the most part the algorithms, and the improvements to algorithms, depend on a very simple new idea, an insight that is not complicated but that has somehow previously escaped other researchers. Once that idea is captured, what remains consists mainly of technical- technicalities — tedious and convoluted perhaps, but still a direct consequence of the main idea. This observation

seems to be true of string algorithms: I wonder what fields of study it is not true of? A note also about the dreaded index: if on page p of the text I have cited a work authored or co-authored by person P, then the index entry for P should include p. I am very sensible of the fact that sometimes I can be hasty (as Treebeard [T55] would say), consequently error-prone. I have therefore spent a great deal of time reworking this book in order to correct errors, rectify oversights, or smooth over inelegancies; nevertheless I cannot imagine that there will not be many defects to be found. I will be maintaining a website http://www.cs.curtin.edu.au/ smyth/patterns.shtml to record corrections and suggestions for improvement, and I would be grateful if readers would contact me at smythQcomputing.edu.au Or smythSmcmaster.ca with their comments. The material in this book is at least sufficient to cover two one-semester A2-14 week) courses for graduate or advanced undergraduate students. Indeed, I hope that it is more than sufficient: I hope that it is also suitable. The initial chapters have already been presented several times to graduate computer science students in the Departments of Computer Science & Systems and of Computing & Software at McMaster University, Hamilton, Ontario, Canada; also to graduate students in the Department of Computer Science, University of Debrecen, Hungary. These students have contributed materially to the book's development. I wish particularly to express my deep gratitude to the School of Computing, Curtin University, Perth, Western Australia, and to its past and present Heads of School, Dennis Moore, Terry Caelli, Svetha Venkatesh and Geoff West, for generous support and encour- encouragement, both intellectual and practical, over a period of several years. Most of this book has been written during my sheltered visits to Curtin. I am grateful also to Professor Petho Attila, Chair of the Department of Computer Science at Debrecen, for his interest and sup- support: it was in Debrecen late in 2001 that the last bits of ETgX were finally keyed in. It is a pleasure to express my debt to my friends and colleagues, Leila Baghdadi, Jerry Chappie, Franya Franek, Costas Iliopoulos, Thierry Lecroq, Yin Li, Dennis Moore, Pat Ryan, Jamie Simpson and Xiangdong Xiao for their valuable contributions. And

many thanks also to two anonymous referees whose constructive comments have contributed materially to the final form of the book. Finally, kudos to Jocelyn Smyth for her entertaining selection and careful verification of "string" and "word" quotations! W F. S.

To my parents.

Algorithms

A word in time saves nine. — Anonymous rties of Strings 1.1 Springs of Pearls Wprds form the thread on which we string our experiences. - Aldous HUXLEY A894-1963), Brave New World Consider a string of pearls. Imagine that the string is laid out on the table before you, so that one end is on the left and the other on the right. Suppose that there are n pearls in the string, and that each pearl has a tiny label pasted on it. Suppose further that the labels are integers in the range l..n and that they satisfy the following rules: IS the label on the leftmost pearl is 1; M for every integer i = 1,2,..., n - 1, the pearl to the right of the pearl labelled i has labeli+l. These rules seem to satisfy our intuitive idea of what makes a string of pearls a "string": the pearls all lie on a single welldefined path, and the path can be traversed from one end to the other by moving from the current pearl to an adjacent one. Reflecting on the rules, however, we realize that we need not be so specific. First of all, of course, we do not really need to speak of "pearls": we can speak more generally of (undefined) elements. But a second, more fundamental, observation is that the labels do not need to be chosen in any spedfic order, and they do not need to be integers: they could be colours, for example, or letters of the alphabet. What really matters is that

Chapter 1. Properties of Strings @) every element has a label that is unique; A) every element with some label x (except at most one, called the leftmost) has a unique determinablepredecessor labelled p(x); B) every element with some label x (except at most one, called

the rightmost) has a unique determinable successor labelled s(x); C) whenever an element with label x is not leftmost, x = s(p(x)); D) whenever an element with label x is not rightmost, x = p(s(x)); E) for any two distinct elements with labels x and y, there exists a positive integer k such that either x = sk(y) or x = pk(y). These rules capture the essential idea of concatenation: each element has either a unique predecessor or a unique successor, and, except at the extremes, actually has both. Further- Furthermore, by following a finite sequence of either successors or predecessors, we can reach any element with label y from any other element with label x.

Fortified with these observations, then, we boldly state: Definition 1.1.1 A string is a collection of elements that satisfies rules 0-5. A critical feature of this definition, not as yet discussed, is the condition, included in rules 1 and 2, that there be at most one leftmost or rightmost element. For consider what happens when the clasp is fastened on the original string of pearls, forming a necklace. Now there is no longer either a "leftmost" or a "rightmost" element — but that turns out not to be a problem, since rules 0-5 continue to apply. According to our definition, a necklace is also a string! Furthermore, suppose that the number of pearls in the original string were infinite: beginning at the lefthand edge of the table but stretching away without end toward a forever unseen edge at the right. We see that this infinite string also is covered by the definition: it has a leftmost element, but no rightmost one. And, perhaps most surprising of all, we see that even a string which extends to infinity in both directions is covered by rules 0-5: in this case there is again neither a leftmost nor a rightmost element. In this book we will at various times become interested in all of these different kinds of strings. To prevent confusion, therefore, we adopt the following conventions. A string with a finite number of elements including both a leftmost and a rightmost element will be called a linear string. A string with a finite but nonzero number of elements and neither a leftmost nor a rightmost element will be called a necklace (in the literature also called a circular string). A string with an infinite number of elements, of which one is leftmost, will be called an infinite string; while a string with an infinite number of elements, of

which none is either leftmost or rightmost, will be called an infinite necklace. When the meaning is clear from the context, we will just use the word "string" to refer to any object satisfying Definition 1.1.1. In practice it will be easy to distinguish between linear strings and necklaces, because necklaces will normally be defined in terms of a corresponding linear string x and written

Liriear Strings 5 C(x) — we think of C(x) as being the necklace formed from x by making its leftmost element the successor of its rightmost element. Exercises 1.1 1. Explain why concatenation rule 3 is required. Give an example of a mathematical object which satisfies rules 0-2 but not rule i. 2. Can rule 4 for concatenation be derived from rules 0-3? Explain your answer. 3. Explain why concatenation rule 5 is required. Characterize the mathematical objects that satisfy rules 0-4 but not rule 5. 4. Does the infinite set {a, <r \ of strings contain an infinite string? 5. According to Definition 1.1.1, can a string consist of a single element a? Could such a string be a necklace? 6. Is Definition 1.1.1 satisfied by a string with no elements in it (a so-called empty string)? Does the above definition of a linear string include the empty string? What about the definition of a necklace ? 7. In view of the preceding exercise, how many dilTerent kinds of siring are included in the classification of strings given in this section? 8. Our classification of strings omits the following cases: (a) a string with an infinity of elements including a rightmost one but no leftmost one; (h) a string with a finite number of elemenrs including either a leftmost one or a rightmost one, but not both. Comment on these omissions. 9. Is there any way to prove that Definition 1.1.1 defines a string? 1.2 L He who has been bitten by a snake fears a piece of string. near Strings — Persian proverb Frbm the discussion of the preceding section, it becomes clear that the idea of a string, though a simple one, is also very general. A string might be

Chapter 1. Properties of Strings ¦ a word in the English language, whose elements are the upper and lower case English letters together with apostrophe (') and hyphen (-); ¦ a text file, whose

elements are the ASCII characters; ¦ a book written in Chinese, whose elements are Chinese ideographs; ¦ a computer program, whose elements are certain "separators" (space, semicolon, colon, and so on) together with the "words" between separators; ¦ a DNA sequence, perhaps three billion elements long, containing only the letters C, G, A and T, standing for the nucleotides cytosine, guanine, adenine and thymine, respectively; ¦ a stream of bits beamed from a space vehicle; ¦ a list of the lengths of the sides of a convex polygon, whose values are drawn from the real numbers. All of these examples are instances of what we have called in Section 1.1a "linear string". Indeed, most of this book will deal with linear strings, and so in this section we introduce notation and terminology useful for talking about them. Much, but not all, of this terminology will also apply to necklaces, infinite strings, and the empty string. The examples make clear that an important feature of any string is the nature of its elements: bits, members of {C, G, A, T}, real numbers, as the case may be. It is in fact customary to describe a string by identifying a set of which every element in the string is a member. This set is called an alphabet, and so naturally its members are referred to as letters — though, as we have seen, the term "letter" must be interpreted much more broadly than is usual in English. We say then that a string is defined on its alphabet. Of course, if a string x is defined on an alphabet A, then x is also defined on any superset of A, so an alpha- alphabet for x is not unique. A minimum alphabet for x is just the set of all the distinct elements that actually occur in x. Sometimes it is convenient to define the alphabet of a string as the minimum one ("bits"), sometimes as a set that is far from minimum ("real numbers"). Throughout this book A will denote an alphabet and a = \A\ its cardinality. In the common cases that a is 2,3 or 4, we say that the alphabet A is binary, ternary or quaternary, respectively; as we shall see, there are many interesting strings on a binary alphabet, and a quaternary alphabet is of particular importance because of applications to the analysis of DNA sequences. In general, apart from the distinctness of the elements of A that follows from the set property, we place no other restriction upon them: the elements of the alphabet may be finite (even zero!)

in number, countably infinite (for example, the integers), or uncountably infinite (for example, the reals). And the elements of the alphabet may be totally ordered (so that a comparison of any distinct pair of them yields the result "less" or "greater"), unordered, or somewhere in between ("partially ordered"). For many of the algorithms discussed in this book, it will be sufficient to use an unordered alphabet; as discussed in Chapter 4, the nature of the alphabet on which an algorithm operates is very important to the selection of test data for the algorithm as well as to its computational efficiency. For a given alphabet A, let A+ be the set of all possible nonempty finite concatenations of the letters of A. Thus, for example, if A = {a}, then A+ = {a, a2, }, where we write a2 for aa, a3 for aaa, etc.; and if A = {0,1}, then A+ consists of all distinct nonempty finite sequences of bits, and so may be thought of as including all the nonnegative integers. As suggested in Exercise 1.1.6, it is convenient also to introduce the idea of the empty string, usually written e, which we use to define the sets A' — A U {e} and A* = A+ U {e}.

Linear Strings 7 Thi.5 terminology allows us to express another definition of linear strings equivalent to that given in the previous section: Definition 1.2.1 An element of A+ is called a linear string on alphabet A. An element of A* is called a finite string on A. Thu s A+ is the set of all linear strings on a given alphabet A. Note that the empty string is not a linear string: after all, it has neither a leftmost nor a rightmost element! ", throughout this book strings will consistently be denoted by boldface letters, almost always lower case: p, t, x, and so on.,We will implement the rules 0-5 of Section 1.1 by treating strings as one-dimensional arrays; alternatively, we might have used a linked-list or some other representation, but arrays are a simple and natural model, as we have seen with the string of pearls. Thus for any string, say x, containing n > 0 letters drawn from an alphabet A, the implicit declaration will be x : array [l..n] of A. In this case we will say that the string has length n = \x\ and positions 1,2,..., n. For any integer % e l..n, the letter in position % is x[i], so that we may write x = aj[l]ai[2] • • -x[n], which we

recognize from Definition 1.1.1 as a concatenation of n strings of length 1. In fact, in this formulation the position i plays the role of the label introduced in Definition 1.1.1. Note that the array model works also for the empty string x = e which corresponds to an e mpty array and has length 0. Digression 1.2.1 We have said that arrays are a "simple and natural" representation of strings, a statement that obscures a significant computational issue. Certainly an array is a si nple data structure, but whether or not it is natural depends on assumptions about the mechanisms by which the elements of the array are accessed. As we have seen, strings are defined in terms of concatenation, and so it would seem to be "natural" to access eleraents using the successor (next) and predecessor (previous) operations introduced in Sec pos ion 1.1. These are mechanisms compatible with a linked-list representation, where access to an element at position i from the "current" position j would at least require time proportional to \j — i|. However, an arbitrary element in a computer array can normally be accessed in constant time simply by specifying its location i, quite independent of the array tion j most recently visited. An array representation of strings is thus more powerful than a liiked-list representation, and so arguably not suitable, since it could justify execution time estimates for algorithms lower than those attainable using list processing. In practice, algorithms on strings almost always begin either at the left (position 1) or at tl le right (position n), and inspect adjacent (successor or predecessor) positions one by one. On the other hand, the output of string algorithms often specifies string positions, the imp licit assumption being that the user can access these positions in constant time. One way

8 Chapter 1. Properties of Strings to reconcile these different models of string access is to suppose that strings are initially available as linked-lists — so that their elements are accessible only one-by-one, either left-to-right or right-to-left — but copied as they are input into an array. This copying, if it were necessary, would require only 0(n) time and Q(n) additional space, and so would not affect the asymptotic complexity of any of the algorithms considered in this

book. We therefore adopt the rather odd convention that strings are processed as linked-lists on input, but may in some cases be regarded as arrays on output. We promise to alert the reader if ever we deviate from this convention. ? Equality between strings is defined in an obvious way. A string x of length n and a string y of length m are said to be equal (written x = y) if and only if n = m and x[i] = y[i] for every i = 1,..., n. Thus the empty string e is equal only to itself. Note further that, by this definition, prepending or appending e to a given string x does not change its value; thus we may if we please write x = exe. Corresponding to any pair of integers i and j that satisfy 1 < % < j < n, we may define a substring x[i..j] of x as follows: We say that x [i. .j] occurs at position i of x and that it has length j—i+l.lfj—i+l < n, then x [i. .j] is called a proper substring of x. Two noteworthy kinds of substring are x = x [1. .n] of length n, andx[i] = x[i..i] of length 1. For every pair of integers i and j such that % > j, we adopt the convention that ccfi.j] = e, a substring of length zero. As we have already seen, since x = exe, e may be regarded as a substring of any element of A*. Let k ? 1. .n be an integer, and consider positions ik = 1,2,..., n — k +1 of x. Each of these n — k + 1 positions represents the starting position of a substring x[ik-.ik + k — 1] of length k. Thus every string of length n has n — k + 1 (not necessarily distinct) substrings of length k. If u is a substring (respectively, proper substring) of x, then x is said to be a superstring (respectively, proper superstring) of u. Of course substrings and superstrings are also strings. If A is ordered, we may use the substring notation to define a corresponding induced order on the elements of A* called lexicographic order — that is, dictionary order. More precisely, suppose we are given two strings x = x[l..n] and y = y[l..m], where n > 0, m > 0. We say that x < y (x is lexicographically less than y) if and only if one of the following (mutually exclusive) conditions holds: H n < m and cc[l..n] = j/[l..n] (as we shall see shortly, this is the case in which x is a "proper prefix" of y); m x[l..i — 1] = y[l..i1] and x[i] < y[i] for some integer % G 1.. min{n, m} (this is the case in which there is a first position i in which x and y differ). Then, for example, using the order of the English alphabet: ¦ ab < abc

(because i = 2 = n<3 = m); ¦ e < a (because i = 0 = n<l = m); ¦ ab < aca (because i = 2 and* < c).

of has Linear Strings 9 Observe that this definition is valid also in cases where one or both of n, m is infinite; that is, also for infinite strings. Based on this definition, the other order relations (<, >, >) are defined in the usual way: x < y if and only if x = y or x < y, x > y if and only if y < x, x > y if and only if y < x. Writing x = uiu2 • • • uk where the ui are nonempty substrings, i G l..k, is called afcictorization or decomposition of cc into factors ui (see Section 1.4 and Chapter 6). Thus a factor is just a nonempty substring. There are two special kinds of substring x[i..j] which are of particular importance, and to which we give special names. For any integer ,7 G O..n, we say that x[l..j] is aprefix of x, sometimes written pref(cc); if in fact j < n, we say that xfl.j] is a proper prefix of x. Similarly, for any integer i G l..n + 1, we say that x[i..n] is a suffix of x, written suff(cc), and a proper suffix if i > 1. Note that, in accordance with the identity x = exe, these definitions allow us to include the empty string e as both a proper prefix and a proper suffix :. Thus, for example, the string / = abaababaabaab prefixes e, a, ab, aba, ...,/ = abaababaabaab and suffixes e, 6, ab, aab,...,/. The proper prefixes and suffixes of / are obtained simply by omitting / itself from these lists. A concept whose value is not immediately apparent, but which we will find to be useful in many different contexts, is that of a "border". Definition 1.2.2 A border b ofx is any proper prefix ofx that equals a suffix qfx. We see that, according to this definition, x always has an empty border b = e of length -- 0, but that x itself is not a border of x. In general, we use the symbol E to denote the length |6| of b. Often we will be particularly interested in the longest border, denoted by with length /?* = |6*|, where 0 < /?* < n — 1. The string / introduced above has two nonempty borders: ab and abaab. The string g = abaabaab has exactly the same two borders, but observe that in this case the longest one, abaab, overlaps with itself. Similarly, the string an has borders a, a2,... ,an~\ of which, for i =¦. |~(n + l)/2], the borders a\ ai+1,..., a" overlap. As we shall discover presently, overlapping borders are in fact characteristic of

strings that contain repetitive substrings: in the above example, observe that g can be written in the form (abaJab = (aba)(aba)ab.

10 Chapter 1. Properties of Strings thus representing it as a string in which two occurrences of aba are followed by a prefix of aba. We now apply the idea of a border to generalize this observation and to derive what we call a "normal form" for a given nonempty string x of length n. Suppose that a border b and its length j3 have been computed. (We shall see in Section 1.3 how to compute every border of x in 6(n) time.) By Definition 1.2.2 it must be true that x[l..p]=x[n-0 + l..n], A.1) from which we see that the quantity p = n- C>l measures the displacement between positions of x that are required to be equal. (Observe that the larger the value of C, the smaller the value of p.) Thus, for every integer i e 1..0, it must be true that x[i] = x[i+p]. A.2) In particular, if C = 0 (p = n), we see that A.1) and A.2) are trivially true; while if 2C > n Bp < n), then x must contain at least two equal adjacent substrings ai[L.p] and x\p + 1..2p]. More precisely, we see that x consists of \n/p\ identical substrings, each of length p, followed by a possibly empty suffix of length n mod p. Setting r = n/p and letting u = ai[l..p], we see that the values p and r which we have derived from C permit us to express any string x in the form * = uLrV, A.3) where u' = x [l..n — [rjp] is a proper prefix (possibly empty) of u. Alternatively, we can separate r into its integral and fractional parts by writing r = [rj+k/p for some integer A; G O..p — 1. Then, interpreting uk/p = u[l..p]k/p to mean simply tt[l..A;], we find that A.3) can be rewritten in the compact form x = ur. A.4) We call p aperiod and r an exponent of cc. The prefix u = x [1. .p] we call a generator of cc. Note that since every string x has an empty border b = e, it therefore has a trivial period p = n, a trivial exponent r = 1, and a trivial generator x. Looking over the previous paragraph, we see that what has essentially been done is to compute a period p = p{C) and a corresponding exponent r = r(C) as functions of 13. It is clear that p is monotone decreasing and r monotone increasing in j3. Therefore with the choice C = C*, p achieves its minimum value p* and r its maximum r*, the minimum period and the maximum exponent

respectively. Generally, the values p* and r* will be the ones we are most interested in, and so, when there is no ambiguity, we will simply refer to p* as the period and r* as the exponent. Similarly, we refer to u = ai[l..p*] as the generator.

Linoar Strings 11 Definition 1.2.3 Let p* be the minimum period of x = x[l..n], and let r* = n/p*, u = x[l..p*]. Then the decomposition x = ur A.5) is called the normal form ofx. of Th stro peri som sen be ide ino Tie normal form A.5) leads to a useful and important taxonomy, or classification system, trings: if r* = 1, we say that x is primitive; otherwise, x is periodic; if r* > 2, we say that x is strongly periodic; if 1 < r* < 2, x is said to be weakly periodic; if r* > 2 is an integer, we say that a; is a repetition (or, equivalently, that x is repetitive); in the special cases that r* = 2 or 3, x is called a square or a cube, respectively. ; we see that x must be either primitive or periodic, and, if it is periodic, then one of igly periodic or weakly periodic; further, if x is repetitive, then it must also be strongly 3dic. Observe that r* > 2 if and only if x has a border of length 0 > n/2. Here are e examples of these definitions: x = aaabaabab is primitive (p* =n); f = abaababaabaab = (abaababa) (abaab) is weakly periodic with period p* = 8, exponent r* = 13/8 and generator abaababa; g = abaabaab = (abaJab is strongly periodic with period p* = 3, exponent r* =8/3 and generator aba; x = (abL is repetitive with period p* = 2, exponent r* = 4 and generator ab; x = (abcabdJ is a square with period p* = 6 and generator abcabd. Ve remark that the normal form A.5) is actually a kind of "intrinsic" pattern, in the e of Part II of this book: every string x has the pattern called a "normal form" that can ed to assign it its place in a taxonomy of strings. We remark further that the simple introduced in this section (e.g. border, primitive, period) will recur again and again ir discussions of various string algorithms. Exercises 1.2 1. Try to reconcile the definitions of linear string given in Sections 1.1 and L2; that is, to prove that a linear string according to one definition is necessarily a linear string according to the other.

Chapter 1. Properties of Strings 2. Suppose that an alphabet A contains exactly a letters. Given some nonnegative integer n, how many elements of A* have length n? How many have length at most n? 3. It was remarked above that for A = {0,1}, A+ may be thought of as including all the nonnegative integers. How many times is each integer included? 4. Based on the definition of equality in strings, is it true that e = e2? Justify your answer. 5. Using the definitions of equality and lexicographic order, prove that for arbitrary strings x and y on an ordered alphabet, x = y if and only ifxyty and y ft x. In particular, demonstrate that this result holds in the case x = y = e, and thus show that e is the unique lexicographically least element of A*. 6. Based on the usual ordering of the English lower-case letters, arrange the following strings in increasing lexicographical order: abbae, abbba, abb, abc, a, e2, ab, aba, eb. 7. Prove that the operator < as we have defined it satisfies transitivity; that is, that x < y and y < z =» x < z. 8. Give an independent definition of the order relation >, then use it together with the definition of < given in the text to show that x > y if and only if y < x. 9. Given a string x of length n, find the length of the following substrings of x, and state the conditions on i and k for which your answer is valid: (a) x[i..i + k - 1); (b) x[i-k + l..i]: (c) a:[i + l..fc-l]; (d) ea:[l.Jfc]. 10. What is the maximum number of distinct substrings that there can possibly be in a string x of length r>? Give an example of a string that attains this maximum. Then characterize the set of strings of length n that attains it. 11. In the preceding exercise there is an unstated assumption that the alphabet size should be regarded as unbounded. Vaiyi Sandor suggests that the question becomes more interesting (and much more difficult) if the size a of the alphabet is finite and fixed. What progress can you make with this apparently unsolved research problem? Hint: Perhaps a good place to start is with the following question: for given positive integers a. = \A\ and k, what is the longest string on alphabet A that contains no substring of length A; more than once? 12. Describe an algorithm that computes all the distinct substrings in x. Establish your algorithm's correctness and asymptotic complexity (try to achieve 0Gi2)).

Linear Strings 13. If y is a nonempty string of length m and k is a positive integer, determine \x.\ as a function of m and k in each of the following cases: (a) x = yk; (b) x = yM; (c) x = yM\ 14. What is the length of the string x -= {ab)nu[ab)n-ln ¦ ¦ • (abJa[abW? 15. Are the ideas of prefix, suffix and border defined for the empty string e? 16. Determine the longest border, the period, and the exponent of each of the following strings, and hence classify each one as primitive, strongly periodic, weakly periodic, or repetitive: (a) abaababaabaababaababu: (b) abcabacabcbacbcacb; (c) abcabdubcabdabcabri; Cd) 17. A string x of length n is said to be a palindrome if it reads backwards the same a.s forwards; more precisely, if for every I -- 1.2 !_n/^j' x'f. = xln "~L + J j- Jamie Simpson believes every border of a palindrome is itself a palindrome. Prove him right or wrong. Remark: To inspect some nontrivial palindromes, you could consult the following URLs: www.growndodo.com/wordplay/palindrcmes/dogseesada.html complex.gmu.edu/neural/personnel/ernie/witty/palindromes.html 18. Show that period and exponent are "well-defined"; in other words, whcne\er x = uku' for some string u, some positive integer k, and a proper prefix u' of //, that there exists a unique corresponding border. 19. Show that if ur* is the normal form of as. then u is not a repetition. (This fact becomes important in Section 2.3 when we consider the encoding of the repetitions in a string.) 20. Consider (ab)*' = aba. What is ({ab)'A- 2J7 Is it (abn«-? Or is it (ab)W- = (n6J3? 21. Show that no string has two distinct primitive generators. 22. Can you find a way to compute the number of strings on {a, b\ that arc of length // and primitive?

14 Chapter 1. Properties of Strings Hint: Observe that for odd n, an arbitrary string sc[l..n] can be formed from strings x [l..(n - l)/2] and x [(n + l)/2..n], and for even n from strings x[l..n/2] and x[n/2 + l..n]. If on due reflection you still have trouble with this exercise, consult [GO81a.GO81b]. 23. The taxonomy of strings given above is expressed in terms of values r*. Re-express it in terms of values of the longest border /?*. 24. Observe that what we have called the "normal form" of x should really more properly be called the "left"

normal form. That is, instead of writing x in the form uru', where u' is a prefix of u, we might just as well have written x = v'vs, where v' is a suffix of v. (a) Derive the right normal form of x. (b) Show that, in the taxonomy of strings, the classification of x according to right normal form is the same as it is according to left normal form. 25. Can a necklace be a substring of a linear string? Can a linear string be a substring of a necklace? 26. Draw a labelled tree which represents the taxonomy of strings introduced in this section. 1.3 Periodicity Be not careless in deeds, nor confused in words, nor rambling in thought. - Marcus Aurelius ANTONINUS A21-180), Meditations VIII As we shall see in subsequent chapters, the maximum border of a string x turns out to be a very useful quantity in numerous contexts, not only as a means of classifying strings. Therefore, in this section we take the time to show in detail how to compute the maximum border; indeed, we show how to compute the maximum border of every nonempty prefix of x, and to do so in 0(n) time. We then go on to prove an important lemma that describes a fundamental periodicity property of strings. Of course there is an obvious way to compute the length E* of the maximum border b* of x. Begin by setting j3* <— 0. Then compare x[l] with x[n]\ if equal, set E* <— 1. Next compare x[1..2] with x[n — l..n]; if equal, set C* <— 2. Proceed in this way, adding one to the length of the prefixes and suffixes compared at each stage, until finally x[l..n — 1] is compared with x[2, n], with C* set to n - 1 in case of equality. The final value of ft* is then the length of the longest border b* of x. We do not present this algorithm formally because it is inefficient: we leave as Exercise 1.3.1 the exact determination of its asymptotic complexity.

Periodicity 15 i = Imagine that successive values of/?* are stored in an array (or string!) f3[l..n]; then for 1, 2,..., n, /3[i] gives the length of the longest border of a:[Li]. We call P[l..n] the border array of jcfL.n]. Then f3[n] is the length of the longest border of x, the quantity required to compute the normal form of x. We make the following observations: P[l] = 0 (since e is the longest border of jc[1..1]); for 2 < i < n, if x[l..i] has a border of length A; > 0, then

x[Li — 1] has a border of length A; — 1; thus, in particular, for 1 < i < n - 1, j3[i + 1] < P[i] + 1; for 1 < i < n — 1, /3[i + 1] = f3[i] + 1 if and only if x[i + 1] = x [P[i] + ll (since P[i] + 1 is the position of x immediately to the right of the prefix x [l../3[i]\ which is the longest border of a;[Li]); if 6 is a border of x, and b' is a border of b, then b' is a border of x. These observations, particularly the third and fourth, suggest that it may be possible to compute /3[i + l] from/3[l], /3[2],..., P[i]. The chief difficulty evidently arises when P[i] > 0 but x[i +1] 7^ x [f3[i] +1]: in this case, it is necessary to look at the second-longestborder of x[l.i]. If this border is empty, then /3[i + 1] = 1 (if x[l] = x[i + 1]) or 0 (otherwise). If this border is not empty, then, denoting its length by f32 [i] > 1, we will need to compare x[i + 1] with a;[/92[i] + l]: if equal, then j3[i + 1] <- C2[i] + 1; if not, then we go on to consider the third-longest border, of length C3[i], of a:[Li]. And so on, until finally for somt k,/3k[i] =0. The argument made here will be easier to follow in terms of an example. Suppose that x = abaababa*, where the * indicates that the 9th letter is not as yet known. The border array of x is then 00112323?, where the ? indicates that the 9th entry remains to be computed. Observe that, for i = 8, /3[i] = 3. If in fact it turns out that aj[9] = x[i + 1] = x[/3[i] + l] = x[4] = a, then Thu j = is a (jlong we immediately have j3[9] = 4. If not, however, then we must consider the second border of x[1..8]; that is, /32[8] =/3[/3[8]] =/3[3] = 1. jif aj[8 + 1] = x[f32[8] + l] = x[2] = 6, we have /3[9] = 2. But if x[9] is neither a nor b, and instead turns out to be a new letter c, we conclude that /3[9] = 0. general, consider the quantities /3 J'[i], j = 1,2,..., A;, representing the jth-longest borcers of x[l..i]. (Here we take f3x[i] = p[i] and pk\i] = 0.) Since for every 2,3,..., k, p j [i] < p j~1[i], it follows that border of x [l.-P^1^]]. That is, the jth-longest border of a:[Li] is a border of the ) [P^]]jg [] l)th-longest border of xfl.i], and, in particular, it is the longestborder of the (jl)th- est border of a;[Li]! In symbols, &j[i\=P[Pj-1\i]]. (L6)

16 Chapter 1. Properties of Strings If x[i + 1] 7^ x [0 i~x [i] + l], then we can determine the next position to be tested against x [i +1] by computing 0 j [i] according to A.6), so that the next test

would compare x[i +1] against x [0 J [i] + l]. It will be useful to summarize these observations in a lemma: Lemma For some integer n > 1, let x = x[l..n] be a string with border array 0 = 1.3.1 Let k be the least integer such that /3k[n] = 0. Then (a) for every integer j e 1.. A;, x [ 1.. n] has a border x [ 1.. f3 *'[n] ]; (b) for any choice of letter X, the only possible borders ofx[l..n + 1] = jc[l..n] A are those whose lengths are members of the descending sequence > A.7) ?

Equation A.6) leads to an efficient algorithm for the computation of the array which contains the lengths of the longest borders of each of the prefixes x[l..«], i = 1,2,..., n, of the given string x. This is presented as Algorithm 1.3.1. As we shall see in Chapter 7, all of these lengths, not only f3[n], are potentially useful when a; is a specific pattern to be matched against some other string: they are used to determine how far to shift the pattern along the other string in case of a mismatch or "failure". For this reason the computation of /3[l..n] has sometimes been called the failure function algorithm [AHU74]. Algorithm 1.3.1 (Border Array) - Compute the border array C ofx [1. .n] r- 0 for i«— 1 to n — 1 do b <- 0[i] while b > 0 and x[i + 1] ^ as[6 + 1] do b <- {3[b} if as[iH-1] =a;[6+l] then 0[i + 1] <b + 1 else 0[i + 1] <- 0 Theorem 1.3.2 Algoritlm 1.3.1 correctly computes 0[l..n]. Proof First consider the while loop: this loop handles the computation A.6), replacing b = 0? 1 [i] by b = 03 [i]. Thus b is monotone decreasing and so the while loop must terminate. Since

Periodicity 17 the for loop terminates when i = n, it follows that Algorithm 1.3.1 terminates after n - 1 steps. Let us consider the while loop further. Exit from this loop occurs if either b = 0 or x[i f 1] = x[b + 1], or both. If aj[i + 1] = x[b + 1], we should set /3[i + 1] <- 6 + 1 regardless of whether or not 6 = 0; otherwise, it is clear that we must set j3[i +1] -(-0. Thus the if structure in the algorithm deals correctly with the result of the while loop: it selects the greatest possible length from the sequence A.7). We conclude that Algorithm 1.3.1 does indeed correctly compute j3[i + 1] for every i e 0..n - 1. ? Now we consider the time required by the algorithm. All the steps within the for loop, except possibly the while loop, require

only constant time. Then Algorithm 1.3.1 requires Q(n) time plus the total time used within the while loop. To estimate this total time, consider the values that b assumes during the execution of the algorithm: b is initially zero, and can be increased in value only by one, and only by the assignment 0[i + 1] <- b + 1 at the end of an iteration for i, followed by the assignment at the beginning of the iteration for i + 1. Thus b can be incremented by one at most n — 2 times, and each such incrementation uses up one iteration of the for loop. At the same time, the only way that b can be decreased is in the while loop: since the number of decrements (always by at least one) cannot be greater than the number of increments (always by exactly one), it follows that the assignment statement b «— /3[b] within this loop can be executed no mors than n — 2 times in total. Hence Theorem 1.3.3 Algorithm 1.3.1 requires &{n) time and constant additional space for its execution. U Digression 1.3.1 This is the first time of many in this book that we have occasion to discuss the asymptotic space and time complexity of an algorithm. There are a couple of important points to be made about the assumptions that underlie the analysis leading to jrem 1.3.3: Consider the integer values i and b of Algorithm 1.3.1, as well as those stored in the border array f3: each of these values may be as large as n - 1, and so for each one |~log2 n] bits need to be reserved for storage. Thus, strictly speaking, the storage requirement, therefore the processing requirement, for an element of f3 is actually log2 n/w G O(logn), where w is the word length of the computer. This is not constant at all!

18 Chapter 1. Properties of Strings Based on this analysis, the space required for i and b is 0(log n), while that for C is O(n log n), and the overall processing time becomes O(n log n) rather than 6 (n).

We resolve this difficulty here, and throughout this book, by making the assumption that the size of the problem considered is not gigantic, that it is subject to some reasonable bound. Thus, if n is the problem size, we are supposing that there always exists a (small) constant A; such that log2 n/w < k. Indeed, in any practical context, we will not go far wrong if we assume that A; = 1: in a

modern computer, w > 32, and so log2 n/w < 1 provided n < 232 « 4.3 x 109. If we are dissatisfied with this limitation, we may instead luxuriate in the supposition that A; = 2, hence that n < 264 « 1.8 x 1019, certainly large enough to deal with any normal person's string processing requirements. In fact, my pocket calculator tells me that just to scan 1.8 x 1019 computer words (string elements) at a rate of one billion per second will require 570 years — it is not we, but rather our distant descendants, who will finally get to the end of such a string! That being said, I am still not unsympathetic to the purist who multiplies occur- occurrences of n in the time and space complexities quoted in this book by a log n factor. In some sense the factor should "really" be there: it is only in deference to ordinary practical considerations that it is omitted. The second point to be made relates to the sets O, fl, and 6 used throughout this book to characterize the asymptotic time and space requirements of algorithms. We suppose that the reader is familiar with the definitions of these sets; if not, the seminal reference is [K76], and [RS92] may also be found useful. But the story does not end there. When O(n), for example, is used to describe a property of an algorithm, an intellectual leap is made that requires justification. If we say execution time G O(n), we treat the execution time of the algorithm as if it were a function of the size n of the problem instance, when of course it is no such thing — it is instead a function of the problem instance itself. As a rule there will be a large number, in fact usually an infinite number, of problem instances corresponding to any given problem size n. Nevertheless, the approach adopted in this book will be to treat both execution space and execution time as if'they were functions of problem size. Thus, for example, to say that execution time G 6 (n) A.8)

Periodicity 19 will mean that, over all problem instances whose size exceeds some fixed value no, the algorithm requires time at least kin and at most k2n, where k\ and fo are also fixed. By contrast, the statement execution space e 0{n) makes the considerably weaker assertion that, again over all problem instances of size greater than no, the algorithm requires space at most k2n. Thus, in this case, it is

possible that there may exist problem instances whose space requirement is proportional to a quantity, perhaps log n or s/n or n/ log n or even 1, that is asymp- asymptotically less than n; in fact, it is even possible that all problem instances have space requirement proportional to such a quantity. Throughout this book, we will use 0 to describe algorithmic properties only when the strong condition illustrated in A.8) is satisfied. Usually, when we use O(f(n)) (respectively, f2(/(n))), we will mean that there do exist collections of problem instances for which the property is in fact exactly of order f{n), but that also there exist other collections of problem instances for which the property is asymptotically less than (respectively, greater than) order f{n). We have seen that Algorithm 1.3.1 requires 0(n) time for its execution over all problem instances; since any algorithm that computed the border array would need to access each of n positions at least once, we are therefore justified in describing Algorithm 1.3.1 as asymp- asymptotically optimal: no algorithm could compute /3[l..n] in less than 0(n) time and constant ;e for any problem instance of size n. As discussed in Section 4.1, Algorithm 1.3.1 is in a certain sense an on-line algorithm, yielding at each step a result for the current spai alsc position i as the given string is processed from left to right. Thus, without backtracking, the algorithm yields a result, not only for the given string, but also for every prefix of it. However, in a stricter sense, Algorithm 1.3.1 would not be described as on-line: since any position i' < i in x may need to be visited in order to assist in the calculation for i, the entire string needs to be available for processing at every step of the calculation. As an example of a border calculation, suppose that the string 1 2 3 4 5 6 7 8 9 10 11 12 13 f = abaababaab a a b, introduced in Section 1.2, is given. Then the border array /3[l..n] of / is 0011232345645. Observe how, in a string such as / with many repetitive substrings, the values in the border am.y may fall back and then rise again. For example, /3[6..8] = 323 and /3[11..13] = 645. The way in which the border array C of a given string x is calculated makes it clear that all of the borders of x (indeed, of every prefix of x) can be computed from /3. This follows from the observation stated in A.6) that the second-longest border of x[l..i] is