Efficient k means approx with mr

PAPER TITLE: APPEARED: DATE:

Efficient K-means++ approximation with MapReduce IEEE Transactions on Parallel and Distributed Systems, Vol. 25, No. 12 December 2014

SUMMARY by:

GEORGIANA TACHE

1. Introduction K-means is a very applied algorithm in clustering, which however, in its simple form, has some drawbacks, e.g. convergence to a local optimum and slight inaccuracy. K-means results depend very much on the initialization of the centers. K-means++ is an improvement of K-means, which has the initialization step and the k-means step. Its solution is an approximation of the optimal solution. Drawbacks: inefficient when dealing with large data, parallel processing is impeded by the sequential steps of the algorithm, and increased cost of I/O operations required to create the initial centers. Though MapReduce is a useful framework, it cannot properly solve the last 2 drawbacks (for generating k centers in the initialization, it needs 2k MapReduce jobs). This paper: a K-means initialization algorithm which takes only 1 MapReduce job instead of 2k, with the standard initialization algorithm taking place in the Map, and the weighted K-means++ initialization in Reduce. This algorithm is also supposed to offer a better approximation of K-means, of O(α^2). A single MR job leads to less I/O costs and a further application of a pruning strategy fastens the computation. 2. Related work There already exists a MapReduce implementation for K-means, as well as for other clustering algorithms. Some of the famous projects are Haloop, Lloyd's iteration, K-means++ initialization (providing a O(α) approximation), Scalable K-means++ (improvement of K-means++, but needs too many MR jobs, inefficient). 3. Problem and notation, K-means++, MapReduce model of Hadoop X = {x1, x2, … , xn} data set, split into n parts, each of 64 MB From K-means: Y = {Y1, Y2, …, Yk}, where Yi = exhaustive and complementary clusters A centroid: centroid(Yi) = 1/|Yi| * ∑y , where y is from Yi ||xi – xj|| = euclidian distance between xi, xj SSE(C) = ∑ min ||x-c||^2 , where sum is for all x from X, and min is on c from C Goal: minimize SSE(C) and find optimal C = {c1... ck} → NP-hard problem SSE(C) <= α SSE-opt (sum of squared error, it should be small) K-means++: the distance between any two clusters should be as big as possible If we start initializing the centers as far as possible, that doesn't mean that they will be the real centers. K-means++: the first center is chosen uniformly at random, all others with probability according to the squared distance from the closest cluster center (notated D(x) ). In Hadoop, multiple rounds of MapReduce are required for K-means and K-means++, with significant overhead due to communication across multiple machines, synchronization, congestion issues. 4. Theoretical analysis of MapReduce K-means++ MapReduce K-means++ algorithm is a combination of Lloyd's iteration (initialization) with MapReduce K-means. K-means initialization: (1) choose one point as a center according to probability; (2) update the sum of

Turn static files into dynamic content formats.

Create a flipbook