INTERNATIONAL CONFERENCE ON DEVELOPMENTS IN ENGINEERING RESEARCH, ICDER - 2014
Effective Method for searching substrings in large Databases By using Query based approach J.RamNaresh Yadav nd
M.Tech 2 year, Dept. of CSE, ASCET, Gudur, India Email: ramnaresh167@gmail.com
Abstract – This paper mainly deal with the direct
occurrence positions of a query word easily. How
substring search in long string and a query point in
can a search engine, however, present an
the string for searching large database. Traditional
informative list of the search results? Showing all
approaches only focuses on effective search using
occurrence positions. On the one hand, if the
approximation string search but they don’t
length is short, some snippets may be identical and
consider the point that how to find exact substring
thus
query for searching in large database. We address
distinguished.
those
occurrences
still
cannot
be
this problem in this paper and we develop an
A smarter way is to present for each
effective algorithm for query – answering. First,
occurrence a smallest snippet that contains the
we develop an algorithm to answer a smallest
query term and is different from all other snippets
substring queries in O(n) time using suffix tree
of the query term. The above simple yet effective
index. Second we also compute unique substring
application in document search introduces an
in every position of a given string. Once the
interesting novel problem to be tackled in this
smallest unique substrings are pre-computed,
paper. Given a (long) string S and a query point q
smallest unique substring queries can be answered
in S, we want to conduct a smallest unique
online in constant time.
substring query that finds a smallest unique
Index terms- substring queries, Query-Answering, suffix tree.
substring containing q. Shortest unique substring queries have
I.
INTRODUCTION
You are searching the Complete Works of William Shakespeare using query term “king”. The term “king” occurs 1, 546 times in 1, 392 speeches within 40 works, even without counting those related words like “king’s” and “kings”. Using modern information retrieval techniques, such as an inverted index, one can find all
many potential applications. In addition to the above document search example, shortest unique substring queries can be used in bioinformatics. Moreover, finding shortest unique substrings on DNA sequences can help polymerase chain reaction (PCR) primer design in molecule biology. Also, it can help to identify unique DNA signatures of closely related species or organisms. The shortest unique substring of the event under
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT 108
www.iaetsd.in
INTERNATIONAL CONFERENCE ON DEVELOPMENTS IN ENGINEERING RESEARCH, ICDER - 2014
investigation may serve as a concrete working base of the event context. Answering
shortest
A. Smallest Unique Substring Queries Let S be the string of length n. Denoted by
unique
substring
S[i] the value of i-th position of S and S[i,j] =
queries efficiently is far from trivial. A brute-force
S[i]….S[j] the substring at position i and ending at
(heuristic) search may easily lead to cost in time
position j. Substring starting at position p. if two
quadratic to the length of the string, which is
string are identical, denoted by X = Y, and for
unacceptable in practice when the string is long
every 1 ≤ ≤ | |, [ ] = [ ]. X is called a
and queries are expected to be answered online.
substring of Y. we call X a proper substring of Y,
In this paper, we address the problem of answering
denoted by
shortest unique substring queries
from the
Definition 1 (Minimal unique Substring (MUS)):
algorithmic point of view and make several
A substring S[i, j] is unique in S if there doen not
contributions. First, we model shortest unique
exist another substring S[ ′ , ′ ] such that S[i,j] =
substring queries and explore their properties
S[ ′ , ′ ]. S[i,j]is called a minimal unique substring
thoroughly. The properties clearly distinguish
if S[i, j] is unique and there is not any proper
shortest unique substring queries from the existing
substring of S[i, j] that is also unique.
related problems, such as computing global
Definition 2 (small unique Substring (SUS)) :
minimal substrings.
Given a string S and a position p in S, substring
⊂ ,
⊆
≠ .
Second, we present an algorithm to
S[i, j] is a small unique substring at position p if
answer a shortest unique substring query in O(n)
S[i,j] is unique and contains p, and there does not
time using a suffix tree index, which can be
exists another unique substring S[ ′ , ′ ] such that
constructed in O(n) time and space, where n is the
S[ ′ , ′ ] is also contains p and
′
−
′
< − .
length of string S. Third, we show that, using O(n · h) time and O(n) space, we can compute a shortest unique substring for every position in a given string, where h is variable theoretically in O(n) but on real data sets often much smaller than n and can be
Definition 3 (Problem definition): Given a String S and a position p, the small unique substring query (SUSQ) is to find a SUS at position p. Any number in SUS (p) is valid answer. In our algorithm design, we often consider two types of unique substring that may be candidates
treated as a constant. II. SMALLEST UNIQUE SUBSTRING QUERIES
of SUSs. We give the definitions here and will pursue further discussion later.
In this section, we formulate the shortest unique
substring
queries,
and
properties of several critical concepts.
discuss
the
Definition 4 Given a string S and a Position p in S, a substring S[p, j] is called the left-bound SUS (LSUS) position p, denoted by LSUS(p), if S[p, j] is unique and no other substring
[ , ′ ] is also
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT 109
www.iaetsd.in
INTERNATIONAL CONFERENCE ON DEVELOPMENTS IN ENGINEERING RESEARCH, ICDER - 2014
≤
′
[ , ] is
S of length n suffixes S[i,n], i=1, …..,n. in the
called the right-bound SUS (RSUS) for position p,
suffix tree of S, each edge represents a substring S,
denoted by RSUS(p), if [ , ] is unique and no
and a path from the root to a leaf node represents
other substring [ ′ , ] is also unique for
exactly on suffix of S.
unique for
< . Symmetrically,
<
′
≤
Ukkonen’s proposed a well known suffix
. Moreover, we define the left SUS be the
tree construction method that requires only linear
SUS whose starting point is smallest, denoted by
time and space when the alphabet size of a string
leftmost-SUS(p).
is a constant. Taking S=11011001 as an example,
Figure 1 shows the relationship among LSUS (p), RSUS(p) and the MUSs contain p.
we briefly show to construct its suffix tree as shown in Figure 3, using Ukkonen’s algorithm.
Figure 2 illustrates the concept of leftmost SUS.
Figure1. The relationship among three cases
Figure2. The leftmost SUS at a position p. Figure3. The suffix tree of S= 11011001
It is easy to see the following property. Property 1 (LSUS and RSUS): Given a string S, for
The construction procedure is illustrated in Figure
every position p in S, LSUS (p) and RSUS(p), if
4.
exist, are respectively. In some cases, LSUSs or RSUSs may not exist. Moreover, position p in S, LSUSs or RSUSs may not be SUSs. III. QUERY ANSWERING USING SUFFIX TREES In this section, we first review suffix tree and the construction. Then we discuss how to use a suffix tree as index to answer smallest unique substring queries. A. Suffix Trees and Construction A suffix tree is data structure that concisely records all possible suffixes of a given
Figure 4 The construction of suffix tree
string allows fast string search operations. A string
S = 11011001.
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT 110
www.iaetsd.in
INTERNATIONAL CONFERENCE ON DEVELOPMENTS IN ENGINEERING RESEARCH, ICDER - 2014
To extend the suffix tree of S[1, i] to S[1, i+1] we need to extend S[j,i] for 1 ≤ ≤ , S[i+1]. There are three possible cases. 1)
2)
Given a string S, we first build its suffix tree in O(n) space and O(n) time using Ukkonen’s
[ , ] ends at a leaf node. Then, we pad
algorithm. We further store all leaf node into an
[ + 1] to the corresponding leaf edge.
array so that we can access a specific leaf node
[ , ] does not end at a leaf node but followed
Leaf(i), its edge Edge(Leaf(i)) and its associated
by [ + 1]. Then, we split the edge and creat a new node. 3)
B. Query Answering Using Suffix Trees
[ , ] does not end at a leaf node but followed
string Sedge(Leaf(i)) in constant time. We can use the suffix tree to get LSUS (p) in constant time, as shown in Algorithm 1.
[ + 1]. In this case, we do not need to do Algorithm 1 the LSUS finding algorithm
anything. When we expand the tree from j = 1 to j = I during phase i+1, the occurrences of this three phases follow some properties. First, after case 2 or case 3 happen, and then case 1 will never happens. With these properties, once we meet case 3 at step j of phase i, we can immediately finish the current phase and start the phase i+1 at step j. To
ensure
O(n)
construction
time,
Input: string S[1, n], a position p, and the suffix tree T of S Output: LSUS (p) 1. Find the leaf node of S[p, n] in T; //The leaf node can be indexed during the construction of the suffix tree, so the access to the leaf node costs O(1) time 2. If the label of leaf edge is $ then return null; 3. end if
ukkonne’s algorithm uses the suffix links and the
4. l←the length of the lablel of the leaf edge;
skip/count technique during the tree construction.
// the padded terminal charater is not counted into
A suffix link is a directed path from an internal
the length of the leaf edge.
node associated with substring S [i, j] to another
5. Return [ , − + 1]
internal node associated with substring S[i+1,j], which allows fast jump to the next extension point
We first target at the corresponding leaf
in the tree. The skip/count technique enables us to
node, of p in the suffix tree. Backtracking along
add the new character S[i+1] at phase i+1 quickly.
the leaf node, we meet an internal node. Base on
To save more space, instead of storing copies of
the property of the suffix tree, the represented
substrings, we label edges using start and end
string from the root to this internal node is a
indexes. To end index of a leaf edge is omitted and
common prefix of different suffixes.
denoted by− . Finally, an end symbol $ is padded
With LSUS (p), we can now find a SUS(Smallest
at each path as a leaf node. As a result, the space
Unique Substring) containing position p as shown
used for a suffix tree is reduced to O(n).
in Algorithm 2.
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT 111
www.iaetsd.in
INTERNATIONAL CONFERENCE ON DEVELOPMENTS IN ENGINEERING RESEARCH, ICDER - 2014
Algorithm 2 the baseline SUS finding algorithm
Theorem 1 Given a string S and a position p in S,
Input: string S, a position p, and the suffix tree
if [ , ]is a SUS at portion p but not a MUS, then
T of S
either i = p or j = p.
Output: the leftmost SUS containing position p
Theorem 2 indicates that, by one scan of S and obtain LSUS at each position, we can find all
1. Find LSUS (p) ( )≠
2. if
[ , ] be
3.
MUSs and LSUSs that are not MUSs. To
them ( );
determine whether a SUS is a RSUSs we have the following result.
4. else 5.
Theorem 3 Given a string S and a position p, a
[ , ] be [1, ];
substring
6. else if 7.
for k← p-1; k>0 and
−
≤ − ;
←
[ , ] contains only one MUS [ , ]( ≤ ).
Given a string S, our algorithm first
8. if LSUS(k) is null then continue;
constructs a suffix tree. This takes (| |) time and
9. end if
11.
[ , ] be the LSUS (k); <
then
← p;
14.
(| |) space. For each position p, algorithm maintains a currently shortest MUS that contains position p, denoted by p.cand. It also takes
12. end if 13. if −
( ) if and only only if
B. The Framework
− 1 do
10.
[, ]
≤ − 1 then = ; = ,
end if
(| |)
space to store of the nMUS obtained at the last position. Therefore, our algorithm needs only (| |) space overall.
15. End for 16. Return S[i,j];
Algorithm 3 shows the pseudo-code of our method.
IV. A CONSTANT TIME ONLINE QUERY ANSWERING ALGORITHM In this section, we develop a method that
Algorithm 3 the pre-computation algorithm. Input: string s Output: a SUS for each position 1 ≤
pre-computes the leftmost SUS for every position
1.
using linear space. Then, online query answering
2. Initialize
can be conducted in constant time. A. Ideas
LSUS, or a RSUS. This can be achieving by using various theorems. In this paper, we briefly define theorem which are useful.
Build a suffix tree for String S; .
←
for 1 ≤
≤
| |; (1), denoted by [1, ] as
3. Output LSUS
We first observe that a smallest unique substring must fall into three cases: a MUS, a
≤| |
the SUS at Position 1; 4.
← 1,
← ;
// use LSUS (1) to initialize the SUS at position p-1 5. For p=2 to | | do
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT 112
www.iaetsd.in
INTERNATIONAL CONFERENCE ON DEVELOPMENTS IN ENGINEERING RESEARCH, ICDER - 2014
[, ]
6. Let
( ) obtained from the
code of the propagation procedure.
suffix tree; 7. Let S[i, j] be the shortest substring among the following .
4
strings:
. ,
≠
(1)
[ , ],
,
<
and (4) [
,
1: procedure PROPAGATE(MUS [ , ]), the
(3)
position k to propagate
]
sybstring having the smallest length, pick the left most one. Output
[ , ] as a SUS at
= [ , ] .
8. suppose . ≠
≠
;
> ℎ
call PROPAGATE ( [ , ], + 1
10.
( ) = [ , ] is a MUS and is not
[ , ], and
> ℎ
call PROPAGTE( [ , ], + 1 13.
end if ← ,
14. 15.
<
> then
3: k is not in the range of [i ,j] 4:
return
5: else if .
I null then ← [ , ]; return
.
7: end if 8: supposed . 9:
′−
′
=
′
,
> − 1 then .
′
; is longer than
[, ]
11. end if 12. if
2:
6:
position ;
9. if .
Algorithm 4 the Propagation procedure.
(2)
≥ . If there are more than one
if
(p) and RSUS (p). Algorithm 4 gives the pseudo-
10: . 11:
At the beginning, we initialize p.cand to
′> ℎ
12: 13:
← ;
← [ , ]
call PROPAGATE (
.
′
−
′
< −
S from the beginning to the end. At position 1,
17: else [ , ]
LSUS (p) is the only SUS containing position 1.
18:
if < ′ then
At each position p (p>1), we compute LSUS (p)
19:
.
using the suffix tree in constant time.
20:
smallest MUS for each position, we do not need to explicitly store one MUS at each p.cand. Instead, we only need to store at those positions p where the smallest MUS may not be obtained by LSUS
, + 1);
′ < then
[, ] 16:
Although we reverse space to record the
′
,
is shorter than [ , ] and ends before
null for all positions p. our algorithm scans string
C. MUS Propagation
′
end if
14: else if 15:
[ , ] ends before .
21:
call PROPAGATE ( [ , ], ′ + 1);
24: 25:
have the same length
← [ , ]
call PROPAGATE( else
′
,
′
, + 1)
> ′
call PROPAGATE ( [ , ], ′ + 1)
22: 23:
.
end if end if return
26: end procedure
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT 113
www.iaetsd.in
INTERNATIONAL CONFERENCE ON DEVELOPMENTS IN ENGINEERING RESEARCH, ICDER - 2014
V. EXPECTED RESULTS Suppose we want to conduct extensive experiments on three real data sets and a group of synthetic data sets to evaluate our methods. Mainly we consider three real data sets will be used in our experiments. the first data set is an introduction of R language, appear of the FAQa on the R project website (http://www.r-project.org/). The second real data set is the genome sequence of Mycoplasma genitalium, the pathogentic bacterium that has one of the smallest genomes known for any free-living organism. The third data is the Bible.
REFERENCES [1] B. Haubold, N. Pierstorff, F. M¨oller, and T. Wiehe, “Genome comparison without alignment using
shortest
unique
substrings,”
BMC
Bioinformat- ics, vol. 6, no. 123, May 2005. [2]
P.
Weiner,
“Linear
pattern
matching
algorithms,” in Proc. of the 14th Annual Symposium on Switching and Automata Theory (swat 1973), 1973, pp. 1–11. [3] U. Manber and G. Myers, “Suffix arrays: a new method for on-line string searches,” in Proceedings of the first annual ACM-SIAM symposium on Discrete algorithms, 1990, pp.
For
the R-sequence,
we chose the
positions where the language name R appeared as the query index. For a given sequence, we can
319–327. [4] E. Ukkonen, “On-line construction of suffix trees,” Algorithmica, vol. 14, pp. 249–260, 1995.
compute one SUS at each position. We want to observe the distribution of the SUS counts over
[5] M. Farach, “Optimal suffix tree construction
different SUS lengths on the R-Sequence. In
with large alphabets,” in Proc. of the 38th Annual
addition to the original R-sequence itself, we
Symposium on Foundations of Computer Science
further generated three mutations with the same
(FOCS’97), 1997.
string length using the alphabet set. For the
[6] S. J. Puglisi, W. F. Smyth, and A. H. Turpin,
original R-sequence, most SUSs are of length of 3.
“A taxonomy
As the length increases, the corresponding counts
algorithms,” ACM Computer Survey, vol. 39, no.
decreases.
2, July 2007. VI.
CONCLUSION
of
suffix
array
construction
[7] G. Nong, S. Zhang, and W. H. Chan, “Linear time suffix array construction using d-critical
In this paper, we formulated a novel type
substrings,”
in
Proc.
20th
Annual
of interesting queries- smallest unique substring
Symp.Combinatorial Pattern Matching, 2009, pp.
queries, which have many applications. We
54–67.
developed efficient algorithms. Furthermore, our
[8] D. Cusfield, Algorithms on Strings, Trees, and
study leads to new direction on string queries. As
Sequences: Computer Science and Computational
future work, it is interesting to extend and
Biology. Cambridge University Press, 1997.
generalize smallest unique substring queries.
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT 114
www.iaetsd.in
INTERNATIONAL CONFERENCE ON DEVELOPMENTS IN ENGINEERING RESEARCH, ICDER - 2014
[9] L. Ilie and W. F. Smyth, “Minimum unique substrings and maximum repeats,” Fundamenta Informaticae, vol. 110, no. 1-4, pp. 183–195, 2011. [10] K. Ye, Z. Jia, Y. Wang, P. Flicek, and R. Apweiler, “Mining unique-m substrings from genomes,”
Journal
of
Proteomics
and
Bioinformatics, vol. 3, no. 3, pp. 99–100, 2010. AUTHORS First Author Second Author
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT 115
www.iaetsd.in