Metric similarity joins using mapreduce

Metric Similarity Joins Using MapReduce

Abstract: Given two object sets Q and O, a metric similarity join finds similar object pairs according to a certain criterion. This operation has a wide variety of applications in data cleaning and data mining, to name but a few. However, the rapidly growing volume of data nowadays challenges traditional metric similarity join methods, and thus, a distributed method is required. In this paper, we adopt a popular distributed framework, namely, MapReduce, to support scalable metric similarity joins. To ensure the load balancing, we present two sampling based partition methods. One utilizes the pivot and the space space-filling filling curve mappings to cluster the data into one one-dimensional dimensional space, and then selects high quality centroids to enable equal-sized sized partitions. The other uses the KD-tree tree partitioning technique to equally divide the data after the pivot mapping. To avoid unnecessary object pair evaluation, we propose a framework that maps the two involved object sets in order, where the range range-object object filtering, the double-pivot double filtering, the pivot filtering, and the plane sweeping techniques are utilized for pruning. Extensive experiments with both real and synthetic data sets demonstrate that our solutions outperform significantly existing state-of-the-art state competitors.

Turn static files into dynamic content formats.

Create a flipbook