Proceedings of International Conference on Advancements in Engineering and Technology
www.iaetsd.in
A SURVEY ON ONE CLASS CLUSTERING HIERARCHY FOR PERFORMING DATA LINKAGE S.Rajalakshmi, Assistant Professor, Department of CSE, Velammal Engineering College,Anna University, Chennai,India. raji780@yahoo.co.in
Abstract— Data linkage refers to the process of matching the data from several databases that refers to the entities of same type. Data linkage is also possible for the entities that do not share the common identifier. With the growing size of the today’s database, the complexity of the matching process becomes a major challenge for Data linkage. Many Indexing techniques were developed for data linkage but however those techniques are not efficient. In this paper, a new data linkage method called as One Class Clustering Tree(OCCT) is developed to overcome the existing challenges and also to perform the data linkage process for the entities that do not share a common identifier. The developed technique builds the tree in such a way that the inner nodes of the tree represents the features of the first set of entities and the leaves of the tree represents the features of the second sets that are similar. The one class clustering tree uses certain splitting criteria and pruning methods for the data linkage. Keywords--Linkage, classification, clustering, splitting, decision tree induction, index techniques.
I.
INTRODUCTION
Data linkage is the process of identifying different entries that refers to the same entity across different data sources[1]. The main aim of the data linkage is to join the datasets that do not share a common identifier or the foreign key. Data linkage is usually performed to reduce the large data into the smaller data. It also helps in removing the duplicate data in the datasets. This technique is called as deduplication [19]. Data linkage can be classified into two types namely, one-to-one data linkage and one-to-many data linkage[15]. In one-to-one data linkage, the aim is to link an entity from one dataset with the matching entity from the other dataset. In one-to-many data linkage the aim is to link an entity from first dat set with the group of matching entities from the other data set. In this paper a new data linkage approach is used called as One Class Clustering Tree(OCCT) which is aimed at performing one-tomany data linkage. The OCCT is most preferable compared to all the indexing techniques because it can easily be translated to linkage rules.
ISBN NO : 978 - 1502893314
A.Jayanthi, M.E(CSE),Department of CSE, Velammal Engineering College,Anna University, Chennai,India. jayanthiarumugamk@gmail.com
The paper is structured as follows: In Section II, we review on indexing techniques,Section III deals with the data linkage using OCCT and finally Section IV concludes the paper. II. INDEXING TECHNIQUES In this section the various indexing techniques are discussed and the variation among them are discussed in more detail. The indexing process of the data linkage can be divided into two phases. 1)Build- All the records in the database are being read and their Blocking Key Values(BKV) are generated. Most of the indexing techniques uses inverted index approach [6] where the record identifiers that have the same BKV will be inserted into the same inverted index list.2)Retrieve- For every block, the list of the record identifiers is retrieved from the inverted index and the candidate record pairs are generated from the list. A.TRADITIONAL BLOCKING Traditional blocking is one of the technique used in the data linkage[1]. In traditional Blocking all the records that have the same BKV are being inserted into the same block and the records within that block are compared with each other. This technique can be implemented using the inverted index[6].The main disadvantage of traditional blocking is that the errors and the variations in the record fields used to generate the BKVs will lead to the record being inserted into the wrong block. The second disadvantage is that the sizes of the block generated depend upon the frequency distribution of the BKVs and thus it is difficult to predict the total number of candidate record pairs that will be generated. B.SORTED NEIGHBORHOOD INDEXING Sorted Neighborhood Indexing helps in sorting the database according to the BKVs,and to subsequently move the window of a fixed number of records over the sorted values and the candidate record pairs are generated only from the records within a current window. It uses three approaches namely sorted array based approach [4],inverted index based
International Association of Engineering and Technology for Skill Development 51
Proceedings of International Conference on Advancements in Engineering and Technology approach[14] and Adaptive Sorted Neighborhood approach[16].The sorted array based approach is not applicable when the window size is small. However the inverted index based approach also has the same drawback of traditional blocking and it is inefficient approach as it takes lots of time for splitting the entities. The Adaptive sorted Neighborhood approach is not suitable when window size is too large. C. Q-GRAM BASED INDEXING Q-Gram Based Indexing technique overcomes the drawback of the traditional blocking and the sorted neighborhood indexing. The main aim of this technique is to index the database such that the records that have the similar,and not just the same,BKV will be inserted into the same block[8].However, much larger number of candidate record pairs will be generated,leading to a more time consuming process. D. SUFFIX ARRAY-BASED INDEXING Suffix Array-Based Indexing technique is one of the most efficient approach compared to the previous works. The basic idea of this technique is to insert the BKVs and their suffixes into a suffix array based inverted index[11]. It uses the approach called Robust Suffix Array Based Indexing where the inverted index lists of the suffix values that are similar to each other in the sorted suffix array are merged[13]. This technique also takes a lot of time to merge the values.
www.iaetsd.in
III.DATA LINKAGE USING OCCT OCCT is induced using one of the splitting criteria. The splitting criteria is used to determine which attribute should be used in each step of building the tree. OCCT uses the prepruning process to decide which branches should be trimmed.
DATABASE A
DATABASE B
CONSTRUCT OCCT USING ALL ENTITIES
PREPRUNING TECHNIQUE
COMPARE ENTITIES
MATCHING ENTITY
NON-MATCHING ENTITY
E. CANOPY CLUSTERING The canopy clustering[14]is built by converting BKVs into the lists of tokens with each unique token becoming a key in the inverted index. It uses the approach called as the Thresholdbased approach and Nearest Neighbor-Based approach.The drawback of the canopy clustering is similar to that of the sorted neighborhood technique based on the sorted array. F. STRING-MAP-BASED INDEXING String-map-based indexing [9] is based on mapping BKVs to objects in a multidimensional Euclidean Space,such that the distance between the pairs of the strings are preserved.Group of similar strings are then generated by extracting the objects that are similar to each other. However this technique fails when the size of the database is too large or too small. Hence all the above discussed indexing techniques has few drawbacks in the data linkage process. In order to overcome those indexing problems associated with the data linkage process a new approach called as the One Class Clustering Tree is proposed, which uses four splitting criteria namely,Coarse-Grained Jaccard coefficient,Fine-Grained Jaccard Coefficient, Least Probable Intersection(LPI) and Maximum Likelihood Estimation(MLE) for data split and pruning techniques.
ISBN NO : 978 - 1502893314
FINAL RESULT Fig 1: Work Flow Diagram Initially the tree is constructed where the inner nodes of the tree consists of the attribute and the leaves represents the clusters of the clusters of the matching entities. Secondly, the prepruning technique is being used which means that the algorithm stops expanding a branch whenever the subbranch does not improve the accuracy of the model. OCCT uses the probabilistic model to find the similar entities that are to be matched. This probabilistic approach helps to avoid overfitting. OCCT is chosen to be the best approach for data linkage compared to indexing techniques. IV.CONCLUSION In this paper OCCT approach is used which performs one-tomany data linkage.This method is based on the one class decision tree model which sums up the knowledge of which records to be linked together. This method uses one-class approach which gives the results more accurately.OCCT model has also been proved successful in three different domains namely data linkage prevention,recommender system and fraud detection.
International Association of Engineering and Technology for Skill Development 52
Proceedings of International Conference on Advancements in Engineering and Technology REFERENCES 1.
2.
3. 4.
5. 6. 7.
8.
9.
10.
11.
12.
13.
I.P. Fellegi and A.B. Sunter, “A Theory for Record Linkage,” J. Am. Statistical Soc., vol. 64, no. 328, pp. 1183-1210, Dec. 1969. D.D. Dorfman and E. Alf, “Maximum-Likelihood Estimation of Parameters of Signal-Detection Theory and Determination of Confidence Intervals—RatingMethod Data,” J. Math. Psychology,vol. 6, no. 3, pp. 487-496, 1969. J.R.Quinlan, “Induction of Decision Trees,” Machine Learning, vol. 1, no. 1, pp. 81-106, March 1986. M.A. Hernandez and S.J. Stolfo, “The Merge/Purge Problem for Large Databases,” Proc. ACM SIGMOD Int’l Conf. Management of Data (SIGMOD ’95), 1995. P.Langley, Elements of Machine Learning, San Franc Isco, Morgan Kaufmann, 1996. I.H. Witten, A. Moffat, and T.C. Bell, Managing Gigabytes, second ed. Morgan Kaufmann, 1999. S.Guha, R.Rastogi and K.Shim, “Rock: A Robust Clustering Algorithm for Categorical Attributes,” Informat- ion Systems, vol. 25, no. 5, pp. 345-366, July 2000. L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava, “Approximate String Joins in a Database (Almost) for Free,” Proc. 27th Int’l Conf. Very Large Data Bases (VLDB ’01), pp. 491-500, 2001. L. Jin, C. Li, and S. Mehrotra, “Efficient Record Linkage in Large Data Sets,” Proc. Eighth Int’l Conf. Database Systems for Advanced Applications (DASFAA ’03), pp. 137-146, 2003. I.S.Dhillon, S. Mallela, and D.S. Modha, “Information-Theoretic Co-Clustering,” Proc. Ninth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, pp. 89-98, 2003. A. Aizawa and K. Oyama, “A Fast Linkage Detection Scheme for Multi-Source Information Integration,” Proc. Int’l Workshop Chal- lenges in Web Information Retrieval and Integration (WIRI ’05), 2005. A.J.Storkey, C.K.I.Williams, E.Taylorand R.G.Mann, “An Expectation Maximisation Algorithm for Oneto- Many Record Linkage,” University of Edinburgh Informatics Research Report, 2005. P. Christen, “A Comparison of Personal Name Matching: Techniques and Practical Issues,” Proc. IEEE Sixth Data Mining Workshop (ICDM ’06), 2006.
ISBN NO : 978 - 1502893314
www.iaetsd.in
14. P. Christen, “Towards Parameter-Free Blocking for Scalable Record Linkage,” Technical Report TR-CS07-03, Dept. of Com- puter Science, The Australian Nat’l Univ., 2007. 15. P. Christen and K. Goiser, “Quality and Complexity Measures for Data Linkage and Deduplication,” Quality Measures in Data Mining, vol. 43, pp. 127151, 2007. 16. S. Yan, D. Lee, M.Y. Kan, and L.C. Giles, “Adaptive Sorted Neighborhood Methods for Efficient Record Linkage,” Proc. Seventh ACM/IEEE-CS Joint Conf. Digital Libraries (JCDL ’07), 2007. 17. A.Gershman et al., “A Decision Tree Based Recomme- nder System,” in Proc. the 10th Int. Conf. on Innovative Internet Community Services, pp. 170179, 2010. 18. M.Yakout, A.K.Elmagarmid, H.Elmeleegy, M.Quzzani and A.Qi, “Behavior Based Record Linkage,” in Proc. of the VLDB Endowment, vol. 3, no 1-2, pp. 439-448, 2010. 19. P. Christen, “A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication,” IEEE Trans. Knowledge and Data Eng., vol. 24, no. 9, pp. 1537-1555, Sept. 2012, doi:10.1109/TKDE. 2011. 127. 20. M.Dror, A.Shabtai, L.Rokach, Y. Elovici, “OCCT: A One-Class Clustering Tree for Implementing One-toMany Data Linkage,” IEEE Trans. on Knowledge and Data Engineering, TKDE-2011-09-0577, 2013.
International Association of Engineering and Technology for Skill Development 53