Iaetsd effective method for searching substrings in large databases by Iaetsd Iaetsd

INTERNATIONAL CONFERENCE ON DEVELOPMENTS IN ENGINEERING RESEARCH, ICDER - 2014

Effective Method for searching substrings in large Databases By using Query based approach J.RamNaresh Yadav nd

M.Tech 2 year, Dept. of CSE, ASCET, Gudur, India Email: ramnaresh167@gmail.com

Abstract – This paper mainly deal with the direct

occurrence positions of a query word easily. How

substring search in long string and a query point in

can a search engine, however, present an

the string for searching large database. Traditional

informative list of the search results? Showing all

approaches only focuses on effective search using

occurrence positions. On the one hand, if the

approximation string search but they don’t

length is short, some snippets may be identical and

consider the point that how to find exact substring

thus

query for searching in large database. We address

distinguished.

those

occurrences

still

cannot

this problem in this paper and we develop an

A smarter way is to present for each

effective algorithm for query – answering. First,

occurrence a smallest snippet that contains the

we develop an algorithm to answer a smallest

query term and is different from all other snippets

substring queries in O(n) time using suffix tree

of the query term. The above simple yet effective

index. Second we also compute unique substring

application in document search introduces an

in every position of a given string. Once the

interesting novel problem to be tackled in this

smallest unique substrings are pre-computed,

paper. Given a (long) string S and a query point q

smallest unique substring queries can be answered

in S, we want to conduct a smallest unique

online in constant time.

substring query that finds a smallest unique

Index terms- substring queries, Query-Answering, suffix tree.

substring containing q. Shortest unique substring queries have

INTRODUCTION

You are searching the Complete Works of William Shakespeare using query term “king”. The term “king” occurs 1, 546 times in 1, 392 speeches within 40 works, even without counting those related words like “king’s” and “kings”. Using modern information retrieval techniques, such as an inverted index, one can find all

many potential applications. In addition to the above document search example, shortest unique substring queries can be used in bioinformatics. Moreover, finding shortest unique substrings on DNA sequences can help polymerase chain reaction (PCR) primer design in molecule biology. Also, it can help to identify unique DNA signatures of closely related species or organisms. The shortest unique substring of the event under

INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT 108

www.iaetsd.in

INTERNATIONAL CONFERENCE ON DEVELOPMENTS IN ENGINEERING RESEARCH, ICDER - 2014

investigation may serve as a concrete working base of the event context. Answering

shortest

A. Smallest Unique Substring Queries Let S be the string of length n. Denoted by

unique

substring

S[i] the value of i-th position of S and S[i,j] =

queries efficiently is far from trivial. A brute-force

S[i]….S[j] the substring at position i and ending at

(heuristic) search may easily lead to cost in time

position j. Substring starting at position p. if two

quadratic to the length of the string, which is

string are identical, denoted by X = Y, and for

unacceptable in practice when the string is long

every 1 ≤ ≤ | |, [ ] = [ ]. X is called a

and queries are expected to be answered online.

substring of Y. we call X a proper substring of Y,

In this paper, we address the problem of answering

denoted by

shortest unique substring queries

from the

Definition 1 (Minimal unique Substring (MUS)):

algorithmic point of view and make several

A substring S[i, j] is unique in S if there doen not

contributions. First, we model shortest unique

exist another substring S[ ′ , ′ ] such that S[i,j] =

substring queries and explore their properties

S[ ′ , ′ ]. S[i,j]is called a minimal unique substring

thoroughly. The properties clearly distinguish

if S[i, j] is unique and there is not any proper

shortest unique substring queries from the existing

substring of S[i, j] that is also unique.

related problems, such as computing global

Definition 2 (small unique Substring (SUS)) :

minimal substrings.

Given a string S and a position p in S, substring

⊂ ,

⊆

≠ .

Second, we present an algorithm to

S[i, j] is a small unique substring at position p if

answer a shortest unique substring query in O(n)

S[i,j] is unique and contains p, and there does not

time using a suffix tree index, which can be

exists another unique substring S[ ′ , ′ ] such that

constructed in O(n) time and space, where n is the

S[ ′ , ′ ] is also contains p and

′

−

′

< − .

length of string S. Third, we show that, using O(n · h) time and O(n) space, we can compute a shortest unique substring for every position in a given string, where h is variable theoretically in O(n) but on real data sets often much smaller than n and can be

Definition 3 (Problem definition): Given a String S and a position p, the small unique substring query (SUSQ) is to find a SUS at position p. Any number in SUS (p) is valid answer. In our algorithm design, we often consider two types of unique substring that may be candidates

treated as a constant. II. SMALLEST UNIQUE SUBSTRING QUERIES

of SUSs. We give the definitions here and will pursue further discussion later.

In this section, we formulate the shortest unique

substring

queries,

and

properties of several critical concepts.

discuss

the

Definition 4 Given a string S and a Position p in S, a substring S[p, j] is called the left-bound SUS (LSUS) position p, denoted by LSUS(p), if S[p, j] is unique and no other substring

[ , ′ ] is also

INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT 109

www.iaetsd.in

INTERNATIONAL CONFERENCE ON DEVELOPMENTS IN ENGINEERING RESEARCH, ICDER - 2014

≤

′

[ , ] is

S of length n suffixes S[i,n], i=1, …..,n. in the

called the right-bound SUS (RSUS) for position p,

suffix tree of S, each edge represents a substring S,

denoted by RSUS(p), if [ , ] is unique and no

and a path from the root to a leaf node represents

other substring [ ′ , ] is also unique for

exactly on suffix of S.

unique for

< . Symmetrically,

′

≤

Ukkonen’s proposed a well known suffix

. Moreover, we define the left SUS be the

tree construction method that requires only linear

SUS whose starting point is smallest, denoted by

time and space when the alphabet size of a string

leftmost-SUS(p).

is a constant. Taking S=11011001 as an example,

Figure 1 shows the relationship among LSUS (p), RSUS(p) and the MUSs contain p.

we briefly show to construct its suffix tree as shown in Figure 3, using Ukkonen’s algorithm.

Figure 2 illustrates the concept of leftmost SUS.

Figure1. The relationship among three cases

Figure2. The leftmost SUS at a position p. Figure3. The suffix tree of S= 11011001

It is easy to see the following property. Property 1 (LSUS and RSUS): Given a string S, for

The construction procedure is illustrated in Figure

every position p in S, LSUS (p) and RSUS(p), if

exist, are respectively. In some cases, LSUSs or RSUSs may not exist. Moreover, position p in S, LSUSs or RSUSs may not be SUSs. III. QUERY ANSWERING USING SUFFIX TREES In this section, we first review suffix tree and the construction. Then we discuss how to use a suffix tree as index to answer smallest unique substring queries. A. Suffix Trees and Construction A suffix tree is data structure that concisely records all possible suffixes of a given

Figure 4 The construction of suffix tree

string allows fast string search operations. A string

S = 11011001.

INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT 110

www.iaetsd.in

INTERNATIONAL CONFERENCE ON DEVELOPMENTS IN ENGINEERING RESEARCH, ICDER - 2014

To extend the suffix tree of S[1, i] to S[1, i+1] we need to extend S[j,i] for 1 ≤ ≤ , S[i+1]. There are three possible cases. 1)

Given a string S, we first build its suffix tree in O(n) space and O(n) time using Ukkonen’s

[ , ] ends at a leaf node. Then, we pad

algorithm. We further store all leaf node into an

[ + 1] to the corresponding leaf edge.

array so that we can access a specific leaf node

[ , ] does not end at a leaf node but followed

Leaf(i), its edge Edge(Leaf(i)) and its associated

by [ + 1]. Then, we split the edge and creat a new node. 3)

B. Query Answering Using Suffix Trees

[ , ] does not end at a leaf node but followed

string Sedge(Leaf(i)) in constant time. We can use the suffix tree to get LSUS (p) in constant time, as shown in Algorithm 1.

[ + 1]. In this case, we do not need to do Algorithm 1 the LSUS finding algorithm

anything. When we expand the tree from j = 1 to j = I during phase i+1, the occurrences of this three phases follow some properties. First, after case 2 or case 3 happen, and then case 1 will never happens. With these properties, once we meet case 3 at step j of phase i, we can immediately finish the current phase and start the phase i+1 at step j. To

ensure

O(n)

construction

time,

Input: string S[1, n], a position p, and the suffix tree T of S Output: LSUS (p) 1. Find the leaf node of S[p, n] in T; //The leaf node can be indexed during the construction of the suffix tree, so the access to the leaf node costs O(1) time 2. If the label of leaf edge is $ then return null; 3. end if

ukkonne’s algorithm uses the suffix links and the

4. l←the length of the lablel of the leaf edge;

skip/count technique during the tree construction.

// the padded terminal charater is not counted into

A suffix link is a directed path from an internal

the length of the leaf edge.

node associated with substring S [i, j] to another

5. Return [ , − + 1]

internal node associated with substring S[i+1,j], which allows fast jump to the next extension point

We first target at the corresponding leaf

in the tree. The skip/count technique enables us to

node, of p in the suffix tree. Backtracking along

add the new character S[i+1] at phase i+1 quickly.

the leaf node, we meet an internal node. Base on

To save more space, instead of storing copies of

the property of the suffix tree, the represented

substrings, we label edges using start and end

string from the root to this internal node is a

indexes. To end index of a leaf edge is omitted and

common prefix of different suffixes.

denoted by− . Finally, an end symbol $ is padded

With LSUS (p), we can now find a SUS(Smallest

at each path as a leaf node. As a result, the space

Unique Substring) containing position p as shown

used for a suffix tree is reduced to O(n).

in Algorithm 2.

INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT 111

www.iaetsd.in

INTERNATIONAL CONFERENCE ON DEVELOPMENTS IN ENGINEERING RESEARCH, ICDER - 2014

Algorithm 2 the baseline SUS finding algorithm

Theorem 1 Given a string S and a position p in S,

Input: string S, a position p, and the suffix tree

if [ , ]is a SUS at portion p but not a MUS, then

T of S

either i = p or j = p.

Output: the leftmost SUS containing position p

Theorem 2 indicates that, by one scan of S and obtain LSUS at each position, we can find all

1. Find LSUS (p) ( )≠

2. if

[ , ] be

MUSs and LSUSs that are not MUSs. To

them ( );

determine whether a SUS is a RSUSs we have the following result.

4. else 5.

Theorem 3 Given a string S and a position p, a

[ , ] be [1, ];

substring

6. else if 7.

for k← p-1; k>0 and

−

≤ − ;

←

[ , ] contains only one MUS [ , ]( ≤ ).

Given a string S, our algorithm first

8. if LSUS(k) is null then continue;

constructs a suffix tree. This takes (| |) time and

9. end if

11.

[ , ] be the LSUS (k); <

then

← p;

14.

(| |) space. For each position p, algorithm maintains a currently shortest MUS that contains position p, denoted by p.cand. It also takes

12. end if 13. if −

( ) if and only only if

B. The Framework

− 1 do

10.

[, ]

≤ − 1 then = ; = ,

end if

(| |)

space to store of the nMUS obtained at the last position. Therefore, our algorithm needs only (| |) space overall.

15. End for 16. Return S[i,j];

Algorithm 3 shows the pseudo-code of our method.

IV. A CONSTANT TIME ONLINE QUERY ANSWERING ALGORITHM In this section, we develop a method that

Algorithm 3 the pre-computation algorithm. Input: string s Output: a SUS for each position 1 ≤

pre-computes the leftmost SUS for every position

using linear space. Then, online query answering

2. Initialize

can be conducted in constant time. A. Ideas

LSUS, or a RSUS. This can be achieving by using various theorems. In this paper, we briefly define theorem which are useful.

Build a suffix tree for String S; .

←

for 1 ≤

≤

| |; (1), denoted by [1, ] as

3. Output LSUS

We first observe that a smallest unique substring must fall into three cases: a MUS, a

≤| |

the SUS at Position 1; 4.

← 1,

← ;

// use LSUS (1) to initialize the SUS at position p-1 5. For p=2 to | | do

INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT 112

www.iaetsd.in

INTERNATIONAL CONFERENCE ON DEVELOPMENTS IN ENGINEERING RESEARCH, ICDER - 2014

[, ]

6. Let

( ) obtained from the

code of the propagation procedure.

suffix tree; 7. Let S[i, j] be the shortest substring among the following .

strings:

. ,

≠

(1)

[ , ],

and (4) [

1: procedure PROPAGATE(MUS [ , ]), the

(3)

position k to propagate

]

sybstring having the smallest length, pick the left most one. Output

[ , ] as a SUS at

= [ , ] .

8. suppose . ≠

≠

;

> ℎ

call PROPAGATE ( [ , ], + 1

10.

( ) = [ , ] is a MUS and is not

[ , ], and

> ℎ

call PROPAGTE( [ , ], + 1 13.

end if ← ,

14. 15.

> then

3: k is not in the range of [i ,j] 4:

return

5: else if .

I null then ← [ , ]; return

7: end if 8: supposed . 9:

′−

′

> − 1 then .

′

; is longer than

[, ]

11. end if 12. if

position ;

9. if .

Algorithm 4 the Propagation procedure.

(2)

≥ . If there are more than one

(p) and RSUS (p). Algorithm 4 gives the pseudo-

10: . 11:

At the beginning, we initialize p.cand to

′> ℎ

12: 13:

← ;

← [ , ]

call PROPAGATE (

′

−

′

< −

S from the beginning to the end. At position 1,

17: else [ , ]

LSUS (p) is the only SUS containing position 1.

18:

if < ′ then

At each position p (p>1), we compute LSUS (p)

19:

using the suffix tree in constant time.

20:

smallest MUS for each position, we do not need to explicitly store one MUS at each p.cand. Instead, we only need to store at those positions p where the smallest MUS may not be obtained by LSUS

, + 1);

′ < then

[, ] 16:

Although we reverse space to record the

′

is shorter than [ , ] and ends before

null for all positions p. our algorithm scans string

C. MUS Propagation

′

end if

14: else if 15:

[ , ] ends before .

21:

call PROPAGATE ( [ , ], ′ + 1);

24: 25:

have the same length

← [ , ]

call PROPAGATE( else

′

, + 1)

> ′

call PROPAGATE ( [ , ], ′ + 1)

22: 23:

end if end if return

26: end procedure

INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT 113

www.iaetsd.in

INTERNATIONAL CONFERENCE ON DEVELOPMENTS IN ENGINEERING RESEARCH, ICDER - 2014

V. EXPECTED RESULTS Suppose we want to conduct extensive experiments on three real data sets and a group of synthetic data sets to evaluate our methods. Mainly we consider three real data sets will be used in our experiments. the first data set is an introduction of R language, appear of the FAQa on the R project website (http://www.r-project.org/). The second real data set is the genome sequence of Mycoplasma genitalium, the pathogentic bacterium that has one of the smallest genomes known for any free-living organism. The third data is the Bible.

REFERENCES [1] B. Haubold, N. Pierstorff, F. M¨oller, and T. Wiehe, “Genome comparison without alignment using

shortest

unique

substrings,”

BMC

Bioinformat- ics, vol. 6, no. 123, May 2005. [2]

Weiner,

“Linear

pattern

matching

algorithms,” in Proc. of the 14th Annual Symposium on Switching and Automata Theory (swat 1973), 1973, pp. 1–11. [3] U. Manber and G. Myers, “Suffix arrays: a new method for on-line string searches,” in Proceedings of the first annual ACM-SIAM symposium on Discrete algorithms, 1990, pp.

For

the R-sequence,

we chose the

positions where the language name R appeared as the query index. For a given sequence, we can

319–327. [4] E. Ukkonen, “On-line construction of suffix trees,” Algorithmica, vol. 14, pp. 249–260, 1995.

compute one SUS at each position. We want to observe the distribution of the SUS counts over

[5] M. Farach, “Optimal suffix tree construction

different SUS lengths on the R-Sequence. In

with large alphabets,” in Proc. of the 38th Annual

addition to the original R-sequence itself, we

Symposium on Foundations of Computer Science

further generated three mutations with the same

(FOCS’97), 1997.

string length using the alphabet set. For the

[6] S. J. Puglisi, W. F. Smyth, and A. H. Turpin,

original R-sequence, most SUSs are of length of 3.

“A taxonomy

As the length increases, the corresponding counts

algorithms,” ACM Computer Survey, vol. 39, no.

decreases.

2, July 2007. VI.

CONCLUSION

suffix

array

construction

[7] G. Nong, S. Zhang, and W. H. Chan, “Linear time suffix array construction using d-critical

In this paper, we formulated a novel type

substrings,”

Proc.

20th

Annual

of interesting queries- smallest unique substring

Symp.Combinatorial Pattern Matching, 2009, pp.

queries, which have many applications. We

54–67.

developed efficient algorithms. Furthermore, our

[8] D. Cusfield, Algorithms on Strings, Trees, and

study leads to new direction on string queries. As

Sequences: Computer Science and Computational

future work, it is interesting to extend and

Biology. Cambridge University Press, 1997.

generalize smallest unique substring queries.

INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT 114

www.iaetsd.in

INTERNATIONAL CONFERENCE ON DEVELOPMENTS IN ENGINEERING RESEARCH, ICDER - 2014

[9] L. Ilie and W. F. Smyth, “Minimum unique substrings and maximum repeats,” Fundamenta Informaticae, vol. 110, no. 1-4, pp. 183–195, 2011. [10] K. Ye, Z. Jia, Y. Wang, P. Flicek, and R. Apweiler, “Mining unique-m substrings from genomes,”

Journal

Proteomics

and

Bioinformatics, vol. 3, no. 3, pp. 99–100, 2010. AUTHORS First Author Second Author

INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT 115