International Journal of Computer & Organization Trends – Volume 4 – January 2014
Keyword Based Searching Techniques over XML Data – Survey Paper Jeetendra Kapase#1, Prof. Sharmila Shinde*2 #1
Student of Computer Engineering Department, JSCOE, Pune University, India. *2 Head of Computer Engineering Department, JSCOE, Pune University, India.
Abstract— XML data format is widely used across devices to exchange data due to its main feature of supporting platform independency. Existing or current system used to find given keyword in XML data, a user need to know XML query language to create keyword query, submit the query for processing and get the results. In this approach user having limited knowledge about XML data, often feels complex and difficult to find require keyword in XML data and sometimes they has to use try and see approach to get relevant information. In this paper we will study different keyword based searching methods focusing on fuzzy search approach to find keyword in XML data. This searching techniques allows user explore data based on keyword as they typed even with minor mistakes in their keyword. Our survey paper helps to understand keyword based alternative method for searching XML data. Using this techniques user is able to find require sort of information from XML based file, without knowing XML query language and XML file content. Our survey paper focuses on techniques based on keyword search to retrieve top-k results, top-k algorithm to achieve high result quality and search efficiency and effective ranking method, early algorithm termination techniques while identifying top-k results and effective index structures. Keywords— fuzzy search, index structures, keyword search, type-ahead search, Top-k algorithm, XML Data
I. INTRODUCTION In recent times Extensible Mark-up Language (XML) data is majorly used information exchange form. XML based data is platform independent makes more popular and usable across application areas like online banking transaction, Information exchange applications like traffic, stock, weather systems.XML are also used to database storage. So searching data efficiently and fast in XML data has been wide research area. Existing system uses XML query languages such as XQuery and XPath to find XML data. This is power tool but complex and unfriendly to non-database users. For example, XQuery doc (“books.xml”)/bookstore/book / [price<30], such query difficult to write and need good knowledge of XML query languages. In a traditional keyword-search system over XML data, a user composes a query, submits it to the system, and retrieves relevant answers from XML data. This approach requires knowledge about XML data structure and data
ISSN: 2249-2593
content. In the case where user has limited knowledge, feels difficult to adopt this approach. Sometimes user will require trying few possible keywords, this could be tedious and time consuming scenario. This survey papers helps to understand to new approach to search XML data based on keyword search. As we all are aware of that keyword search is widely accepted methodology to access information. Keyword based searching approach is alternative technique and most familiar to internet users. New fuzzy search based approach is simple and does not require to database expert users. One of the commonly used methods is Autocomplete, which predicts a word or phrase based on partial string entered by user. More and more websites support this feature. As an example, almost all the major search engines nowadays automatically suggest possible keyword queries as a user types in partial keywords. Both You Tube (http://www. youtube.com/) and Netflix (http://www.netflix.com/) support Searching for videos interactively as user’s type in keywords. One limitation of Autocomplete is that the system treats a query with multiple keywords as a single string; thus, it does not allow these keywords to appear at different places. To address this problem, Bast and Weber [1], [2] Proposed Complete Search in textual documents, which can find relevant answers by allowing query keywords appear at any places in the answer. However, Complete-Search does not support approximate search that is it cannot allow minor errors between query keywords and answers. An easy way to comply with the conference paper formatting requirements is to use this document as a template and simply type your text into it. We studied fuzzy type-ahead search in textual documents [3]. It allows users to explore data as they type, even in the presence of minor errors of their input keywords. Type-ahead search can provide users instant feedback as users type in keywords, and it does not require users to type in complete keywords. However, existing methods cannot search XML data in a type-ahead search manner, and it is not trivial to extend existing techniques to support fuzzy type-ahead search
http://www.ijcotjournal.org
Page 45
International Journal of Computer & Organization Trends – Volume 4 – January 2014
Fig.1 XML document tree structure form. in XML data. This is because XML contains parent-child relationships, and we need to identify relevant XML sub trees that capture such structural relationships from XML data to answer keyword queries, instead of single documents. In this paper, we discussed Fuzzy type-ahead search method in XML data. It searches the XML data on the fly as user’s type in query keywords, even in the presence of minor errors of their keywords. It provides a friendly interface for users to explore XML data, and can significantly save users typing effort. In this Paper, we study research challenges that arise naturally in this computing paradigm. Main focus of this survey paper is to study effective index structures and efficient algorithms to answer keyword queries in XML data, ranking function, early termination techniques of top-k algorithm and forward index structure to improve search efficiency. II. TRADITIONAL XML QUERY TECHNIQUES XQuery, XPath: Searching XML data there are mainly two types of query languages XPath and XQuery. XPath is query language for XML for selecting node from XML file. Additionally helps to compute string and Boolean values. It provides simple syntax for addressing part of on Xml document. In XPath collection of element can be retrieved by specifying a directory like path with zero or more condition place on the path. XPath treat an a XML document as a logical tree with nodes for each element, attribute text, processing instruction, comment, namespace and root reference [4]. The basic of the addressing mechanism is the context node (start node) and location path which describe a path from one point in an XML document to
ISSN: 2249-2593
Another. Xpointer can be used specify on absolute location or relative location. Location of path is composed of a series of step joined with “/” each move down the preceding step. XQuery is query and functional programming language that incorporate feature from query language for relational system (SQL) and Object oriented system (OQL). XQuery support query operation on document order and can transform structured-unstructured data, extract and restructure document. W3c query working group has proposed a query language for XML called XQuery. Values always express a sequence node can be a document, element, attribute, text, namespace. Top level path express are ordered according to their position in the original hierarchy, top-down, left-right order [5]. The important parts are Data-Centric document and Document-Centric document. Data-centric document XPath are complex for understand. It can originate both in the database and outside the database. These documents are used for communicating data between companies. These are primarily processing by machine; they have fairly regular structure, fine-grained data and no mix content. DocumentCentric are document usually designed for human consumption, they are usually composed directly in XML or some other format(RTF,PDF, SGML) which is then converted to XML. Document-Centric need not have regular structure, larger gained data and lots of mixed content [6]. III. TECHNIQUES FOR KEYWORD SEARCH OVER XML DATA In this section we showed various proposed methods or techniques for searching over XML data. As seen complexities and difficulties associated with traditional XML query language and processing tool (XPath, XQuery).Survey paper focuses on LCA-Based Fuzzy type ahead search and ELCA based search method [7] [8]. Minimal cost tree (MCT)
http://www.ijcotjournal.org
Page 46
International Journal of Computer & Organization Trends – Volume 4 – January 2014 references are better and efficient [9]. Effective index structure and top-k algorithm to achieve high interaction speed [9]. A. Fuzzy Type-ahead Search LCA-Based Method The lowest common ancestor (LCA) is a concept of graph theory and computer science. Let T be a rooted tree with n nodes. The lowest common ancestor between two nodes x and y is defined as the lowest node in rooted tree T that has both x and y as descendants nodes The LCA of x and y in rooted tree T is the shared ancestor of x and y that is located farthest from the root node. There are multiple ways to answer user query on an XML document; one commonly used method is LCA based method [10]. Many algorithms use this method for searching keyword in XML document. Content nodes are the parent node of the keyword. For example consider keyword db in fig1 then content node of db is node 13 and node16.The server contains index structure of xml document which each node is letter in keyword and leaf node contain all nodes that contain the keyword this leaf node is called inverted list. Procedure For keyword query the LCA based method retrieves content nodes in xml that are in inverted lists. Identify the LCAs of content nodes in inverted list Takes the sub tree rooted at LCAs as answer to the query for example suppose the user typed the query “www db” then the content nodes of db are{13,16} and for www are3 ,the LCAs of these content nodes are nodes ,12,15,2,1.here the nodes 3,13,12,15 are more relevant answers but nodes 2 and 1 are not relevant answers. Limitations It gives irrelevant answers to some user queries The results are not of good quality always.
C. MCT (Minimal cost tree) We studied a new framework to find relevant answers based on keyword over an XML document proposed by [9]. In the framework, each node on the XML date tree is potentially relevant to the query answer with different scores or rank value. For each node, we define its corresponding answer to the query as its sub tree with paths to nodes that include the query keywords given by user. This sub tree is called the “minimal-cost tree” for this node. Different nodes correspond to different answers to the keyword query. Definition (MINIMAL-COST TREES) Given XML document D, n nodes in XML D, and keyword query Q = {k1, k2,…kl}, a minimal cost tree of query Q and n node is sub tree rooted at n ,and for each keyword ki Ɛ Q, if node n is a quasi-content node of ki, the sub tree includes the pivotal paths for ki and node n. Proposed parameterized top-k result algorithm executes in two phases. First one is structure algorithms that on a Problem instance construct a trie structure of feasible size, and the second stage is an enumerating algorithm that produces the top-k best solutions to the keyword based on the structure. We studied and proposed new techniques that support efficient enumerating algorithm. For example: Consider the parent-child relationship tree based XML Document in Fig. 1 and user keyword query Q = {“dB”,” tom”,” www”}. Nodes {3, 13, 14, 16, and 17} are content Nodes of the three keywords; nodes {1, 2, 5, 8, 9, 12, and 15} are their quasi-content nodes. Node 3 is pivotal node for node 2 and keyword “www”. Node 16 is pivotal node for node 2 and keyword “db”. Node 17 is pivotal node for node 2 and keyword “tom”. The MCT of node 2 is the sub tree rooted at node 2, which contains the paths: n2 → n3, n2 → n15 → n16, and n2 → n15 → n17. The main advantage of this approach is that, even if node doesn’t have descendant nodes that it includes all the keywords of query and this node could still be considered as appropriate answer. D. Compute Ranking of Sub Tree
B. ELCA Based Method To address the limitation and issues of LCA based method exclusive lowest common ancestor (ELCA) method [7] is proposed. It states that an LCA is ELCA if it is still an LCA after excluding its LCA descendants. for example suppose the user typed the query “db tom” then the content nodes of db keyword are {13, 16} and for tom keyword are {14.17}, the LCAs of these content nodes are nodes2, 12, 15, 1.here the ELCAs are 12, 15.the sub tree rooted with these nodes is displayed which are relevant answers Node 2 is not an ELCA as it is not an LCA after excluding nodes 12 and 15. Xu and Papakonstantinou [7] proposed a binary-search based method to efficiently identify ELCAs.
There are mainly two ranking function to compute rank or score between node n and keyword ki [11]. The case I shows that n contains ki. The case II shows that n does not contain ki but has a descendant containing ki. Case I: n contains keyword ki the relevance or score of node n and keyword ki is calculated by:
Where, tf (ki, n) – number of occurrences of ki in sub tree rooted n
ISSN: 2249-2593
http://www.ijcotjournal.org
Page 47
International Journal of Computer & Organization Trends – Volume 4 – January 2014 idf (ki ) - ratio of number of nodes in xml to number of nodes that contain keyword ki ntl (n) - length of n /nmax length, nmax = node with max terms s - Constant set to 0.2 (Assumption) Assume user composed a query containing keyword “db” SCORE (13, db) = ln (1+1)*ln (27/2) --------------------(1- 0.2)+ (0.2*1) = 1.52 Case II: node n does not contain keyword ki whereas its descendant has ki Second ranking function to calculate the score between n and ki is given by:
Construct max heap tree data structure, such that each node contain <node, score>. The top element of max heap is highest score node and is deleted, max heap is adjusted. Deleted node with score<=T (threshold) are taken into result set and return the result set if the top-k answers are retrieved.
Where, P - Set of pivotal nodes α - constant set to 0.8(Assumption) - Distance between n and p Assume the user entered query “db” SCORE2 (12, db) = (0.8)*score1 (13, db) = 0.8 *1.52 =1.21
Fig. 2 Extended tier structure
E. Ranking for Fuzzy Search Algorithm Given a user keyword query Q= {k1, k2 …kl} in terms of fuzzy search, a MCT may not contain the exact input keywords, but contain predicted words for each keyword. Let predicted words be {w1, w2... wl} the best similar prefix of wi could be considered to be most similar to ki. The function to quantify the similarity between ki and wi is given by
Where ed – edit distance, ai – prefix, wi – predicted word, γ – constant F. Continuously Compute Top-k Results The index structure is used to compute the answers. The leaf node inverted list contains the content nodes and quasi contend nodes, scores of the keyword. For computing top-k results heap based method [9] is used which uses the partial virtual inverted lists which contain the higher score nodes so to avoid the union of lists which is expensive. Procedure Sort the scores or ranks in the inverted lists. If the inverted list is long then generate partial virtual inverted list.
ISSN: 2249-2593
Fig. 3 Top-k result architecture For example assume user composed the query “db”. The inverted list of db contains the nodes 13, 16, 12,15,9, 2, 8, 1, 5.The scores of these nodes computed by two ranking functions are 1.52, 1.52, 1.21, 1.21, 0.9728, 0, 495, 0.77, 0, 396, and 0.6225 respectively. These scores have to be sorted and max heap is constructed and a threshold is fixed be 10 so the top elements< (13, 1.52>, <16, 1.52>, <12, 1.21>, <15, 1.21> the top e results are retrieved. This technique is more effective and optimal.
http://www.ijcotjournal.org
Page 48
International Journal of Computer & Organization Trends – Volume 4 – January 2014 G. Improvement Using Forward Index Structure Approach In this section, we propose the forward index to improve search performance. We can utilize “random access” based on the forward index to do an early termination in the algorithms. That is, given an XML element and an input keyword, we can get the corresponding score of the keyword and the element using the forward index, without accessing inverted lists. Fagin et al. have proved that the threshold-based algorithm using random access is optimal over all algorithms that correctly find the top k answers [12]. Thus, in this section, we propose a forward index to implement random access. Procedure Construct a trie structure to maintain the keyword contained in the element Each leaf node in the forward index keeps score of element e to the corresponding keyword of the leaf node. Given partial keyword we can efficiently check element e contains a word having similar prefixes
This paradigm gives the relevant results the user wants. Fuzzy search over xml data is studied which gives approximate results. Various methods for querying on xml data LCA based method, ELCA, heap based method are presented and of all these methods heap based method gives high quality results. To further improve the search performance, forward indexes can be used which eliminates the need to visit inverted list and gives the corresponding score of element. ACKNOWLEDGMENT I express true sense of gratitude towards my project guide Prof. S.M. Shinde, head of computer department for her invaluable co-operation and guidance that she gave me throughout my project study. I like to specially thank our P.G coordinator Prof. M.D.Ingle for inspiring me and providing me all the lab facilities, which made this survey work very convenient and easy. I would also like to express my appreciation and thanks to JSCOE principal Dr. M.G. Jadhav and all my friends who knowingly or unknowingly have assisted me throughout my hard work. REFERENCES [1]
[2]
[3]
[4] [5]
[6]
[7] Fig.4 The forward index approach for element, Element contains keywords {“saad,” “search,” “segi,” and “sigmod.”} [8]
The time complexity of sorted access O (1) and for random access is O(ed * AN), where ed is edit distance threshold and AN is active number of nodes [27]. Suppose ed * A > I, we will not maintain forward index, where I is average inverted list length. The main advantage of forward index avoids unnecessary element access compare to extended trie structure. Similarly, we can use forward index to improve performance.
[9]
[10]
[11]
[12]
IV. CONCLUSION This study paper presents and focuses on the various methods for keyword oriented search over the XML data. This method helps to implement system featuring user-friendly and there is no need for the user to study about the XML data.
ISSN: 2249-2593
H. Bast and I. Weber, “Type Less, Find More: Fast Auto completion Search with a Succinct Index,” Proc. Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR), pp. 364371, 2006. H. Bast and I. Weber, “The Complete search Engine: Interactive, Efficient, and towards Ir &db Integration,” Proc. Biennial Conf. Innovative Data Systems Research (CIDR), pp. 88-95, 2007. S. Ji, G. Li, C. Li, and J. Feng, “Efficient Interactive Fuzzy Keyword Search,” Proc. Int’l Conf. World Wide Web (WWW), pp. 371380,2009. H.Williamson,”The complete Reference”, The McGrew-Hill Companies, Inc., New York 2009 Bolin Ding, Jeffrey Xu Yu, Shan Wang, Lu Qin, Xiao Zhang Xuemin Lin” Finding top-k Min-cost –connected Tree in Database”, The Chinese university of Hong Kong China G.Li, Jian Hua Feng, Lizhu Zhou, “Interactive search in XML Data” Department of Computer Science and Technology, Tshinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing 100084, China Y. Xu and Y. Papakonstantinou, “Efficient LCA Based Keyword Search in XML Data,” Proc. Int’l Conf. Extending Database Technology: Advances in Database Technology (EDBT), pp. 535-546, 2008. L. Guo, F. Shao, C. Botev, and J. Shanmugasundaram, “Xrank: Ranked Keyword Search over Xml Documents,” Proc. ACM SIGMOD Int’l Conf. Management of Data, pp. 16-27, 2003. Jianhua Feng and Guoliang Li, “Efficient Fuzzy Type-Ahead Search in XML Data”. IEEE Transactions on Knowledge and Data Engineering, Vol. 24, NO. 5, MAY 2012. Y. Xu and Y. Papakonstantinou, “Efficient Keyword Search for Smallest LCA in XML Databases,” Proc. ACM SIGMOD Int’l Conf. Management of Data, pp. 537-538, 2005. Z. Liu and Y. Chen, “Identifying Meaningful Return Information for Xml Keyword Search,” Proc. ACM SIGMOD Int’l Conf. Management of Data, pp. 329-340, 2007. R. Fagin, A. Lotem, and M. Naor, “Optimal Aggregation Algorithms for Middleware,” Proc. ACM SIGMOD-SIGACTSIGART Symp. Principles of Database Systems (PODS), 2001.
http://www.ijcotjournal.org
Page 49