C2Net A Network-Efficient Approach to Collision Counting LSH Similarity Join

Page 1

C2Net a Network-Efficient Approach to Collision Counting LSH Similarity Join

Abstract: Similarity join of two datasets P and Q is a primitive operation that is useful in many application domains. The operation involves identifying pairs (p; q), in the Cartesian product of P and Q such that (p; q) satisfies a stipulated similarity condition. In a high-dimensional space, an approximate similarity join based on locality-sensitive hashing (LSH) provides a good solution while reducing the processing cost with a predictable loss of accuracy. A distributed processing framework such as Map Reduce allows the handling of large and high-dimensional datasets. However, network cost estimation frequently turns into a bottleneck in a distributed processing environment, thus resulting in a challenge of achieving faster and more efficient similarity join. This paper focuses on collision counting LSH-based similarity join in Map Reduce and proposes a network-efficient solution called C2Net to improve the utilization of Map Reduce combiners. The solution uses two graph partitioning schemes: (i) minimum spanning tree for organizing LSH buckets replication; and (ii) spectral clustering for runtime collision counting task scheduling. Experiments have shown that, in comparison to


the state of the art, the proposed solution is able to achieve 20% data reduction and 50% reduction in shuffle time. Existing system: Locality sensitive hashing (LSH) is a well-known approximation technique used for similarity search in a high dimensional space. In this method, a family of hash functions is used to allocate similar data points into the same buckets with high probability. This transforms proximity search into an exact match lookup problem. Datar et al. Define an LSH function based on the Gaussian distribution and propose a popular method named E2LSH for Euclidean space. However, when a new search range is specified, rehashing cannot be avoided. To enable the use of existing hash values and hash tables for a varied search range, Gan et al. utilize the collision frequency to evaluate the similarity between two points and propose the C2LSH method. Proposed system: As for task scheduling during runtime, we need to identify the LSH blocks on each data node to be processed by a map per. The main objective of this step is to improve the combination ratio by using local LSH blocks and local LSH block replicates. Specifically, we exploit the partitions generated by spectral clustering to determine an ideal schedule and then check whether this schedule is achievable by using only the local data. If so, this schedule is adopted for the query processing. Otherwise, we proposed a strategy to approximate the ideal schedule. Note that the spectral clustering algorithm is executed for determining task scheduling when the LSH blocks are transferring to nodes, which should not result in extra time cost. Advantages: In order to achieve this, we firstly group LSH buckets into blocks and then use a graph to capture the similarity relationships among blocks. Next, we generate a minimum spanning tree (MST) and use it to partition the LSH blocks. We then discuss recent developments on how LSH can be used to speed up similarity join processing in various application domains in Section 2.2, and the major Map Reduce optimization techniques in Section 2.4. Finally, we highlight


the differences between our proposed solution and other related techniques discussed in this section. Disadvantages: Propose an algorithm called LEEN for Map Reduce framework to solve the data locality and partition skew problems in Hadoop. The algorithm partitions all buffered intermediate keys according to their frequencies and the fairness of the expected data distribution after the shuffle phase to achieve higher locality and reduce the amount of shuffle data. Various LSH techniques have been proposed to turn a similarity search problem into a collection of exact match lookup problems. In the context of similarity join query processing, LSH-based methods have been proposed to address similarity join problems in different domains, e.g., vector space, time series, and strings. Modules: LSH-based Similarity Join: A considerable amount of research attention has been devoted to reducing the similarity join processing cost using various LSH techniques. Yuan et al. propose a framework for approximate string similarity join based on MinHash LSH and triebased index techniques under the edit distance constraints. Xiong et al. Introduce a bucket-pruning-based LSH indexing method to reduce the similarity computation cost for top-k similarity join in a heterogeneous information network. Chen et al. parallel using LSH and Map Reduce. In particular, they perform similarity join in 3 steps: (i) map the time sequence into the frequency domain using Discrete Fourier Transform to avoid the dimension curse; (ii) use LSH to find candidate similar time-sequence pairs; (iii) remove duplicated pairs to avoid repeated computation. Yu et al. Map Reduce Optimization: Many researchers have devoted much attention to the optimization of Map Reduce framework to accelerate the processing. He et al. propose an algorithm called LEEN for Map Reduce framework to solve the data locality and partition skew problems in Hadoop. The algorithm partitions all buffered intermediate keys


according to their frequencies and the fairness of the expected data distribution after the shuffle phase to achieve higher locality and reduce the amount of shuffle data. Hammoud et al. Propose a locality-aware skew-aware reduce task scheduler called center-of-gravity reduce scheduler (CoGRS) to solve the partitioning skew problem. CoGRS schedules each reduce task at its center-of-gravity node, which is computed after considering partitioning skew as well. Both LEEN and CoGRS have shown that we can reduce the network traffic during the shuffle stage by solving the partitioning skew and increasing the data locality. LSH Block and Replicates Assignment: As stated in Section 3, we formulate the LSH block assignment problem as a cluster analysis problem by including two main concerns: (i) we want to assign similar blocks to the same node to help improve the combination ratio; and (ii) we also want each node to have roughly the same number of LSH blocks for load balancing. To facilitate the partitioning of LSH blocks, we construct a similarity graph for LSH blocks as defined in Section 3 and generate a minimum spanning tree (MST). In addition, we propose a strategy to improve workload balance by utilizing the fault tolerance mechanism in Spark. Specifically, we make multiple replicas of LSH blocks and propose a method of assigning them to different data nodes. The replicated block assignment is based on the similarity between each block and its nodes, so that these block replicates can benefit for identifying similar pairs (pi; qj) at each node. Scheduling of Map per Tasks: As shown in the system overview (Figure 2), the runtime scheduler is responsible for determining a subset of LSH blocks for processing. A straightforward solution is to directly use the original MST partitioning for the schedule. That is, each map per processes only the originally assigned LSH blocks. We call this scheduling method MST-default, or simply MST-D. A better approach is to make use of replicas to improve data locality within each slave node. For each data node, we randomly choose one LSH block and then incrementally include LSH blocks according to their similarity values. However, this method is still based on MST which is not designed to minimize the distances between LSH blocks within the same data node. We call this method MST-G.


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.