International Journal of Computer Trends and Technology (IJCTT) – volume 7 number 2– Jan 2014
Deriving the Probability with Machine Learning and Efficient Duplicate Detection in Hierarchical Objects D.Nithya, M.phil1, K.Karthickeyan, M.C.A, M.Phil, Ph.D 2 1
Mphil Scholar, Department of Computer Science, Dr. SNS Rajalakshmi collage of arts & science, Coimbatore, Tamilnadu, India 2 Associate professor, Department of Computer Science, Dr. SNS Rajalakshmi collage of arts & science, Coimbatore
Abstract— Duplicate detection is the major important task in the data mining, in order to find duplicate in the original data as well as data object. It exactly identifies whether the given data is duplicates or not. Real world duplicates are multiple representations that related to same real world data object. Detection of duplicates can performed to any places, it takes major important in database. To perform this hierarchical structure of duplicate detection in single relation is applied to XML data .In this work existing system presents a method called XMLDup. XMLDup uses a Bayesian network to establish the conditional probability between two XML elements being duplicates, taking into consideration not. Bayesian network based system conditional probability values are derived manually it becomes less efficient when compare to machine learning based results improves the efficiency of the duplicate detection proposed system finds the duplicate detection of XML data and XML Objects with different structure representation of the input files. Derive the conditional probability by applying Support vector machines (SVMs) models with associated learning algorithms that analyze XML Duplicate data object. In this method the number of XML Data is considered as input and the predicts the conditional probability value for each data in the hierarchical structure. Finally proposed SVM based classification performs better and efficient as well as effective duplicate detection. Keywords— Duplicate detection, XML data, SVM classification, Bayesian networks and entity resolution.
I. INTRODUCTION Duplicate detection is the major important task to determining dissimilar representation of data XML data object and XML data for real world object. Duplicate detection is a essential process in data cleaning [1,2] and is significant for data integration [3], individual data management [4], and several areas . The difficulty has been considered expansively for data stored in a particular relational table with adequate attributes to formulate rational comparisons. Though, a great deal data comes in supplementary composite structures, so conservative approaches cannot be useful. For example, inside XML data, XML elements might not have some text. Though, the
ISSN: 2231-2803
hierarchical associations through further elements potentially present sufficient information for significant comparisons. The difficulty of XML duplicate detection is mainly tackling in applications like catalog integration or online data process. In this research present a duplicate detection of XML data with hierarchical representation of data object as well as data ,rather than general approaches such as top-down and bottom up approaches among all types of relationship such as one to one ,one to many and many to one between XML data objects based hierarchical representation. The essential scheme has been formerly outlined in a poster [5]. Essentially, believe an object to depend on one further object if the last helps result duplicates of the primary. For illustration, actors help out discover duplicates in movies based on actors and their relationship among actors. Appropriate to communal dependencies to be capable of take place, and detection of XML data used to help or find duplicate XML data objects in database. Consequently, algorithms such as [4] use dependencies to increase efficiency by performing arts pairwise results additional than just the once. Methods used for duplicate detection in a solitary relation that may not straight be relevant to XML data, appropriate to the dissimilar among data or data models [5]. For example consider data with have same data object type whereas each and every dissimilar organization at the example level, while tuples inside relations forever contain the similar organization. However, further significantly, the hierarchical relationships in XML make available helpful further information that helps get better mutually the runtime and the excellence of duplicate detection. It was shown in the figure with different XML elements illustrated in figure 1. Together symbolize person objects and are labeled pro. Generally these two XML element is named as name and date of birth(dob).They nest those elements additionally and represented as place of birth(pob) and contacts (cnt). The contact details of the XML elements are represented as add and email(eml) and child of these details are represented as
http://www.ijcttjournal.org
Page75