International Journal of Computer Trends and Technology (IJCTT) – volume 7 number 2– Jan 2014
Deriving the Probability with Machine Learning and Efficient Duplicate Detection in Hierarchical Objects D.Nithya, M.phil1, K.Karthickeyan, M.C.A, M.Phil, Ph.D 2 1
Mphil Scholar, Department of Computer Science, Dr. SNS Rajalakshmi collage of arts & science, Coimbatore, Tamilnadu, India 2 Associate professor, Department of Computer Science, Dr. SNS Rajalakshmi collage of arts & science, Coimbatore
Abstract— Duplicate detection is the major important task in the data mining, in order to find duplicate in the original data as well as data object. It exactly identifies whether the given data is duplicates or not. Real world duplicates are multiple representations that related to same real world data object. Detection of duplicates can performed to any places, it takes major important in database. To perform this hierarchical structure of duplicate detection in single relation is applied to XML data .In this work existing system presents a method called XMLDup. XMLDup uses a Bayesian network to establish the conditional probability between two XML elements being duplicates, taking into consideration not. Bayesian network based system conditional probability values are derived manually it becomes less efficient when compare to machine learning based results improves the efficiency of the duplicate detection proposed system finds the duplicate detection of XML data and XML Objects with different structure representation of the input files. Derive the conditional probability by applying Support vector machines (SVMs) models with associated learning algorithms that analyze XML Duplicate data object. In this method the number of XML Data is considered as input and the predicts the conditional probability value for each data in the hierarchical structure. Finally proposed SVM based classification performs better and efficient as well as effective duplicate detection. Keywords— Duplicate detection, XML data, SVM classification, Bayesian networks and entity resolution.
I. INTRODUCTION Duplicate detection is the major important task to determining dissimilar representation of data XML data object and XML data for real world object. Duplicate detection is a essential process in data cleaning [1,2] and is significant for data integration [3], individual data management [4], and several areas . The difficulty has been considered expansively for data stored in a particular relational table with adequate attributes to formulate rational comparisons. Though, a great deal data comes in supplementary composite structures, so conservative approaches cannot be useful. For example, inside XML data, XML elements might not have some text. Though, the
ISSN: 2231-2803
hierarchical associations through further elements potentially present sufficient information for significant comparisons. The difficulty of XML duplicate detection is mainly tackling in applications like catalog integration or online data process. In this research present a duplicate detection of XML data with hierarchical representation of data object as well as data ,rather than general approaches such as top-down and bottom up approaches among all types of relationship such as one to one ,one to many and many to one between XML data objects based hierarchical representation. The essential scheme has been formerly outlined in a poster [5]. Essentially, believe an object to depend on one further object if the last helps result duplicates of the primary. For illustration, actors help out discover duplicates in movies based on actors and their relationship among actors. Appropriate to communal dependencies to be capable of take place, and detection of XML data used to help or find duplicate XML data objects in database. Consequently, algorithms such as [4] use dependencies to increase efficiency by performing arts pairwise results additional than just the once. Methods used for duplicate detection in a solitary relation that may not straight be relevant to XML data, appropriate to the dissimilar among data or data models [5]. For example consider data with have same data object type whereas each and every dissimilar organization at the example level, while tuples inside relations forever contain the similar organization. However, further significantly, the hierarchical relationships in XML make available helpful further information that helps get better mutually the runtime and the excellence of duplicate detection. It was shown in the figure with different XML elements illustrated in figure 1. Together symbolize person objects and are labeled pro. Generally these two XML element is named as name and date of birth(dob).They nest those elements additionally and represented as place of birth(pob) and contacts (cnt). The contact details of the XML elements are represented as add and email(eml) and child of these details are represented as
http://www.ijcttjournal.org
Page75
International Journal of Computer Trends and Technology (IJCTT) – volume 7 number 2– Jan 2014
cnt. Leaf nodes of these structure or hierarchical representation of XML elements contains original text node data. In this instance, the objective of duplicate detection is to discover so as to together persons are duplicates, regardless of the dissimilar in the data. To do this by making the comparison among Leaf nodes and their corresponding parent node values among both objects. In this work suggest that the hierarchical organization of XML data helps in detecting duplicate pro elements, because descendant elements can be detected to be equivalent which increases the resemblance of the ancestors, and consequently resting on in a top-down manner.
Name @dob Masi
13-3US
Masi.meena 26@gmail.c om
2/231 raja street ,CBE
2/123 raju street ,CBE
Figure 1: XML data with tree strucutre
Our contributions, particularly compared to our earlier work, can be summarized as follows: 1. First taking into consideration the dissimilar construction with XML object and then introduces a machine learning algorithm to develop the CP principles previous to the stage the network pruning algorithm 2. After that present a novel pruning algorithm and studying how the regulate in which nodes are processed affect runtime. 3. Obtain the conditional probabilities using SVM Classification methods and as well as analyze XMLData object. 4. Finally illustrate how to amplify effectiveness when a small drop in recall, i.e., in the numeral of recognized duplicates, is suitable. This procedure can
ISSN: 2231-2803
be physically tuned or performed repeatedly, using well-known duplicate objects from further databases; II. LITERATURE REVIEW
Early on effort in XML duplicate detection was apprehensive through the well-organized achievement of XML join operations. Guha et al [6] proposed an algorithm to present joins in xml databases of related elements capably. They were paying attention on completion of a tree edit distance [6], which might be useful in an XML join algorithm. Carvalho and Silva proposed a resolution to the difficulty of Web data that are extracted from hierarchical representation tree. Two hierarchical illustrations of individual person elements are compared by transforming original data into vector and similarity among data or elements are measured by using a cosine similarity determination. Linear combinations of weighted similarities are taken and do not take benefit of the helpful features presented in XML databases. The difficulties of identifying numerous demonstration of a similar real-world object the difficulty have been addressed in great remains of effort. Ironically, it appear below many names itself, such as record linkage [7], object matching [1], object consolidation [8], and reference reconciliation [4] to person's name presently a little. Generally words, study in duplicate detection may categorize into two ways: methods to get better efficiency and methods to get better effectiveness. Investigate on the previous difficulty is concerned through improving precision and recall, for example by increasing complicated resemblance procedures. Examples are [4] and [9], wherever the associations amongst objects are used to get better the superiority of the outcome. Investigate on the subsequent difficulty assumes a known similarity measure and proceed algorithm so as to try to evade having to affect the quantify to all pairs of objects. An instance is the sorted neighborhood technique, which trades off efficiency for superior competence by comparing simply objects inside a certain window. In a first algorithm, this regulate is used to get the similar effectiveness as some arrange, but quicker; in a subsequent alternative additional improved effectiveness except at the price of misplaced several duplicates. Objective of hierarchical representation for finding duplicates XML data object in XML databases [12], [6], [13]. These works is dissimilar beginning than earlier approaches because they were particularly intended to develop the characteristic of XML object representations and the semantics inherent in the XML labels.
http://www.ijcttjournal.org
Page76
International Journal of Computer Trends and Technology (IJCTT) – volume 7 number 2– Jan 2014
The method presented by [14] proposed a DogmatiX structure aim at together effectiveness and efficiency in duplicate detection [14]. The structure consists of three major steps that is candidate description, duplicate description, and duplicate discovery. While the primary two present the definition essential for duplicate detection, the third constituent include the definite algorithm, an addition to XML data of the work of Ananthakrishna et al. [15] Hierarchical representation of XML data objects for personalized data are represented as tree referred to original personal information management (PIM) data [7] in database. The subsequent measurement discerns among three methods used to execute duplicate detection: machine learning and similarity measures are performed to learn duplicate data objects, clustering algorithms and finally iterative algorithms, which iterate process until data’s are detected as duplicate, which in revolve are aggregated to clusters, using transitive closure. Earlier survey shows that the data mining are well suited for detection of duplicates and mining data efficiently in larger database or datawarehouse. But in datacleaning and data preprocessing step required more time to discover or find irrelevant as well as detect duplicate in structure manner. Data cleaning difficulty arises since information as of a variety of heterogeneous sources is combined to generate a single database [10]. Numerous dissimilar data cleaning issues have been recognized as dealing with missing data, managing incorrect data, record linkage etc. At this point deal with individual challenge called reference disambiguation, it is also known as “fuzzy match” and "fuzzy lookup". This difficulty arises whilst entity in a database enclose reference to additional entities. If entities be referred to via single identifiers, then disambiguating individual’s reference would be uncomplicated. III. SVM BASED XML DUPLICATE DETECTION WITH BN In this work proposed XML Duplicate detection approach is furthermore second-hand to detect the XML duplicate data object as well as data in structure .The demonstration of XML data uses a Bayesian network to represent data and method used for detecting duplicates [12]. This Bayesian network demonstration is a directed acyclic graph. a.
Bayesian Network Construction
Bayesian network model whereas a node in the structure denotes the variables and edges denotes the relationship among the nodes in the data objects. In this work first define the Bayesian Network demonstration for duplicate detection, and calculate the similarity among the XML dataobject between nodes in the Bayesian network. Based on similarity
ISSN: 2231-2803
measure it classifies duplicate data in structure. Mapping schema maps the relationship among the XML dataobject in network; its result must first be validated to make sure a high excellence mapping. In Bayesian network model still the deriving the conditional probability values becomes assumption ,to overcome these problem proposed a machine learning methods to automatically derive conditional probability values for every node in the Bayesian network ,it results exact probability values for each and every attribute in the XML data ,it efficient detects the duplicate files b.
Support vector machines (SVM)
SVM are supervised learning models that investigate data and recognize patterns. The fundamental SVM takes set of input and predicts exact result for each input data from learning initial phase ,it takes two outcomes for each input data that is positive and negative class forms of output creation.SVM model categorizes data into two phase learning and training phase one category or the other. SVM model is an illustration of the examples as points in space, map so that the examples of the divide categories are divided by an obvious gap to facilitate as broad as probable. Instinctively, a good partition is achieved by the hyperplane that have the biggest distance to the nearest instruction data point of one class, because in allpurpose the superior the margin the lesser the generalization error of the classifier. While the unique difficulty might be confirmed in a restricted dimensional space, it regularly happens that the sets to distinguish are not linearly separate in that space. In SVM methods a set of XML data from Bayesian network structure result considered as input for SVM data .The following conditional probabilities values are derived using machine learning methods after that values are derived then duplicate data are detected. c.
Deriving the conditional probabilities
Conditional probability 1 (CP1) denotes the probability of nodes values duplicate detection .In this CP1 if the nodes having the duplicate values consider each attributes as duplicate values else all nodes having original XML data . Conditional probability 2 (CP2) denotes the probability of offspring nodes duplicates and detection of duplicate from those nodes according to corresponding attributes. It says that the many offspring nodes in tree have higher probability values rather those parent nodes it is considered as duplicates. Conditional probability 3 (CP3) denotes the probability of both parent and child nodes .In this step two nodes of attributes duplicate are detected.
http://www.ijcttjournal.org
Page77
International Journal of Computer Trends and Technology (IJCTT) – volume 7 number 2– Jan 2014
Conditional probability 4 (CP4) denotes the probability of a position of nodes of the similar category individual duplicates known that every pair of entity nodes in the set is duplicates.
Algorithm steps for conditional probability values Input: Number of the training samples with XMLData structure constructed from BN = { , , … } as input file for SVM classification Output: Classification result and prediction of the conditional probability values CP1,CP2,CP3, CP4 1. 2. 3. 4.
5. 6. 7.
8.
9. 10. 11. 12. 13.
14.
15.
16.
Procedure SVM ) // input training data results from the SVM methods Begin Begin Initialize C=0 //initially C be the set of conditional probability values either positive or negative probability values for the class labels should be zero Get input file for training //the input data from BN result as the example for training and prediction results of the CP values as result. Read the number of XMLData structure constructed from BN result from BN structure with XML data . + = 0 // XMLData structure constructed from BN is represented as matrix and denoted by and is the weight value matrix whose product is summed with bias value to give the class value. . + = 1 // This above equation marks a central classifier margin. This can be bounded by soft margin at one side using the following equation. Decision function f(W) = x . w − b //decision function f(w) decides the class labels for the SVM classification training examples , If ( ) ≥ 1 for is the first class // if the F(w) is greater than or equal to the 1 is labeled as first class Else ( ) ≤ −1 for is the first class // if the f(w) is less than or equal to the value of -1 is labeled as second class The prediction result for (i=1,…n) number of document //after the classification result are performed then check the classification result by testing phase it is check the below function ( . − ) ≥ 1 //if the function is greater than one the results or classified conditional probability as predicted Display the result //finally display the conditional probability values are classification result
ISSN: 2231-2803
d.
Network Pruning for BN
BN assessment point in time, suggest a lossless pruning approach. This approach is simply objected pairs incompetent of attainment a known replacement probability threshold are not needed. As confirmed earlier than, network valuation is performed by responsibility a proliferation of the previous probabilities, in a substructure up fashion, until accomplishment the topmost node. The previous probabilities are obtained by applying a similarity determines by the pleased of the leaf nodes. Computing such similarities is the mainly exclusive procedure in the network valuation and in the duplicate detection procedure in universal. Consequently, the designs subsequent our pruning suggestion lies in avoid the computation of previous probabilities, except they are exactingly required. Each and every similarity values are assumed to one before proceeding pruning step. The design is to, at each step of the procedure; preserve a same probability value. At every one step, at any time a novel similarity is evaluated taking into deliberation the previously recognized similarities and the unidentified similarities set to be one. There are still two challenges to pruning the strategy compensation by means of a pruning aspect lesser than one. Primary, ever since dissimilar attributes in an XML object have dissimilar characteristics with dissimilar pruning factor. Subsequent, fine-tuning factors physically be able to be a multifaceted task, particularly if the consumer have small information of the database, thus be supposed to be capable to calculate every one pruning factors mechanically. It improves the efficiency order to optimize competence, while minimizing the failure in efficiency. Algorithm 2: Network Pruning (N; T) Require: The node N, for which intend to compute the probability score from SVM classification; threshold value T, less than T are considered non-duplicates in XML nodes ,current score is derived from SVM. Ensure: Duplicate probability of the XML nodes represented by N- which is also derived from SVM 1: ← ( ){Get the ordered list of parents nodes} 2: parentScore[n] ← 1; for all n belongs to L {Maximum probability of each parent node} 3: currentScore ←0 4: for each node n in L do {Compute the duplicate probability} 5: if n is a value node then 6: score ←getSimilarityScore(n) {For value nodes, compute the similarities} 7: else 8:newThreshold ← getNewThreshold(T; parentScore) 9:score←NetworkPruning(n; newThreshold) 10: end if
http://www.ijcttjournal.org
Page78
International Journal of Computer Trends and Technology (IJCTT) – volume 7 number 2– Jan 2014
11: parentScore[n] ← score 12:currentScore ←computeProbability(parentScore) 13: if currentScore < T then 14: End network evaluation 15: end if 16: end for 17: return currentScore IV. EXPERIMENTAL RESULTS In this work evaluate the performance of proposed SVM based machine learning method then evaluate the performance of proposed system of XML duplicate detection with SVM our proposed pruning optimization with different pruning threshold changed accordingly based on SVM result. The experiments results were evaluated between duplicate detection results. It shows that proposed XMLDup detection results to less error values than duplicated detection XMLdataobject with no one considerable degradation of the outcome and even performs effectively when dealing with reasonable amounts of missing data. The use of attributes with less important information is able to decrease the system’s efficiency. Effectiveness tests exposed that our network pruning approach be able to advantage starting consider nodes in an exacting order, known by a predefined arrangement heuristic. Experimental results were evaluated by measuring the duplicate detection result between precision and recall value in many areas .The proposed SVM Based pruning with Bayesian network is achieves higher precision and recall values in XML duplicate detection results regarding both competence and efficiency. The efficiency result of proposed system was shown in figure 2,it shows that enhanced work indicated in blue lines between precision and recall achieves better enhanced result than the red line of basework.
Figure 2: Performance comparison vs parameters
ISSN: 2231-2803
V. CONCLUSION AND FUTURE WORK In this work network pruning technique derive condition probabilities are derived from using learning methods; it becomes more accurate results of XMLDup detection than general methods. SVM machine learning algorithm derives condition probability values routinely as an alternative of universal probabilities. After that BN Pruning was performed to eliminate or remove duplicate detection of XML data and XML data objects. Though, the representation is too extremely flexible and dissimilar way of probability values are derived from SVM. To progress the runtime competence of XMLDup, a network pruning strategy with SVM is moreover presented. Additionally, the subsequent approach can be performed repeatedly, not including user involvement. It produces best efficient results for duplicate detection. Expand the BN model structure algorithm to additional types of machine learning and optimization algorithm such as bee, artificial immune system and BAT algorithm to derive conditional probability values.
References 1.
H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C. Saita,” Declarative data cleaning: Language, model, and algorithms”, In International Conference on Very Large Databases (VLDB), pages 371–380, Rome, Italy, 2001. 2. E. Rahm and H. H. Do,” Data cleaning: Problems and current approaches”, IEEE Data Engineering Bulletin, Volume 23, pages 3-13, 2000. 3. Doan, Y. Lu, Y. Lee, and J. Han,” Object matching for information integration: A profiler-based approach”, IEEE Intelligent Systems, pages 54-59, 2003. 4. X. Dong, A. Halevy, and J. Madhavan,” Reference reconciliation in complex information spaces”, In International Conference on the Management of Data (SIGMOD), Baltimore, MD, 2005. 5. M. Weis and F. Naumann,” Detecting duplicates in complex XML data”, In International Conference on Data Engineering (ICDE), Atlanta, Georgia, 2006. 6. D. Milano, M. Scannapieco, and T. Catarci, "Structure Aware XMLObject Identification," Proc. VLDB Workshop Clean Databases(CleanDB), 2006. 7. W. E. Winkler, “Advanced methods for record linkage”, Technical report, Statistical Research Division, U.S. Census Bureau, Washington, DC, 1994. 8. Z. Chen, D. V. Kalashnikov, and S.Mehrotra,” Exploiting relationships for object consolidation” ,In SIGMOD-2005 Workshop on Information Quality in Information Systems, Baltimore, MD, 2005. 9. M. Weis and F. Naumann,” Duplicate detection in XML”, In SIGMOD-2004 Workshop on Information Quality in Information Systems, pages 10–19, Paris, France, 2004. 10. D.V. Kalashnikov and S. Mehrotra, "Domain- Independent DataCleaning via Analysis of Entity- Relationship Graph."ACM Trans.Database Systems, vol. 31, no. 2, pp. 716-767, 2006. 11. Luı´s Leitao, Pavel Calado, and Melanie Herschel,” Efficient and Effective Duplicate Detection in Hierarchical Data”, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 5, MAY 2013,pp 1028-1041. 12. L. Leita˜o, P. Calado, and M. Weis, “Structure-Based Inference of XML Similarity for Fuzzy Duplicate Detection,” Proc. 16th ACM Int’l Conf. Information and Knowledge Management, pp. 293-302, 2007.
http://www.ijcttjournal.org
Page79
International Journal of Computer Trends and Technology (IJCTT) – volume 7 number 2– Jan 2014
13.
14.
15.
S. Puhlmann, M. Weis, and F. Naumann, “XML Duplicate Detection Using Sorted Neighborhoods,” Proc. Conf. Extending Database Technology (EDBT), pp. 773-791, 2006. M. Weis and F. Naumann, “Dogmatix Tracks Down Duplicates in XML,” Proc. ACM SIGMOD Conf. Management of Data, pp. 431-442, 2005. R. Ananthakrishna, S. Chaudhuri, and V. Ganti, “Eliminating Fuzzy Duplicates in Data Warehouses,” Proc. Conf. Very Large Databases (VLDB), pp. 586-597, 2002.
ISSN: 2231-2803
http://www.ijcttjournal.org
Page80