PAPER TITLE: APPEARED: DATE:
Efficient K-means++ approximation with MapReduce IEEE Transactions on Parallel and Distributed Systems, Vol. 25, No. 12 December 2014
SUMMARY by:
GEORGIANA TACHE
1. Introduction K-means is a very applied algorithm in clustering, which however, in its simple form, has some drawbacks, e.g. convergence to a local optimum and slight inaccuracy. K-means results depend very much on the initialization of the centers. K-means++ is an improvement of K-means, which has the initialization step and the k-means step. Its solution is an approximation of the optimal solution. Drawbacks: inefficient when dealing with large data, parallel processing is impeded by the sequential steps of the algorithm, and increased cost of I/O operations required to create the initial centers. Though MapReduce is a useful framework, it cannot properly solve the last 2 drawbacks (for generating k centers in the initialization, it needs 2k MapReduce jobs). This paper: a K-means initialization algorithm which takes only 1 MapReduce job instead of 2k, with the standard initialization algorithm taking place in the Map, and the weighted K-means++ initialization in Reduce. This algorithm is also supposed to offer a better approximation of K-means, of O(α^2). A single MR job leads to less I/O costs and a further application of a pruning strategy fastens the computation. 2. Related work There already exists a MapReduce implementation for K-means, as well as for other clustering algorithms. Some of the famous projects are Haloop, Lloyd's iteration, K-means++ initialization (providing a O(α) approximation), Scalable K-means++ (improvement of K-means++, but needs too many MR jobs, inefficient). 3. Problem and notation, K-means++, MapReduce model of Hadoop X = {x1, x2, … , xn} data set, split into n parts, each of 64 MB From K-means: Y = {Y1, Y2, …, Yk}, where Yi = exhaustive and complementary clusters A centroid: centroid(Yi) = 1/|Yi| * ∑y , where y is from Yi ||xi – xj|| = euclidian distance between xi, xj SSE(C) = ∑ min ||x-c||^2 , where sum is for all x from X, and min is on c from C Goal: minimize SSE(C) and find optimal C = {c1... ck} → NP-hard problem SSE(C) <= α SSE-opt (sum of squared error, it should be small) K-means++: the distance between any two clusters should be as big as possible If we start initializing the centers as far as possible, that doesn't mean that they will be the real centers. K-means++: the first center is chosen uniformly at random, all others with probability according to the squared distance from the closest cluster center (notated D(x) ). In Hadoop, multiple rounds of MapReduce are required for K-means and K-means++, with significant overhead due to communication across multiple machines, synchronization, congestion issues. 4. Theoretical analysis of MapReduce K-means++ MapReduce K-means++ algorithm is a combination of Lloyd's iteration (initialization) with MapReduce K-means. K-means initialization: (1) choose one point as a center according to probability; (2) update the sum of
the distances from all the points to their nearest center. This is done by using 2 sequential MR jobs, one per step. For finding k centers → 2k jobs, which is very costly for Hadoop. A fast K-means++ initialization: use a standard K-means++ initialization in Mapper and a weighted one in Reducer. In Mapper, return the number of points represented by each center, which gives the weights for the Reducer. In Reducer, choose the k initial centers according to the weights. The time complexity is O(knd), where d = dimension of the space. The mapper phase has pseudo-code: C=ϕ C = C U {x}, where x is chosen at random from X, as the first center num[i] = 0 for 1 <= i <= k (init. 0 attached points for all the future centers) while not all k centers have been chosen: compute distances from all points x to the nearest center found so far choose x with (the highest) probability D^2(x) / ∑ D(x)^2 add this x to the centers' list C for all the points: find its nearest center of the point, ck for xi num[k]++ (~ the number of attached points to that center) * Map function returns <num[i], ci> * Here there is a small mistake in the paper (Algorithm 2, lines 10-11), they used the same index i for both xi and ci, where actually the center might have another index (for example k) The reducer phase has pseudo-code: C=ϕ C = C U {x}, where x is chosen at random from X, as the first center while not all k centers have been chosen: compute euclidian distances from all points x to the nearest center found so far choose x with (the weighted) probability num * D^2(x) / ∑ D(x)^2 add this x to the centers' list C Reducer function returns C, the final list of centers Each of the n splits of the data X (64MB each) is sent to a Map task which chooses k centers (Y1... Yn as denoted for each X1... Xn respectively). The Reducer collects all the results from the n splits and chooses the k final centers according to weights provided by Map. This algorithm is simple, efficient, provides a good approximation, but contains unneeded distance computations, therefore the paper will propose a new version. Further, the paper announces in Lemma1 that for the standard K-means++, SSE <= α SSE-opt, where α = O(log k) and SSE-opt = ∑ min ||x – c|| ^2, where min is on c from C', and sum is over x from the data set X. In Lemma2, the weighted K-means++ initialization algorithm produces ∑min || φ(x) – c ||^2 <= α ∑min || φ(x) – c ||^2 , where the left side min is on the C (Reducer output), the right side min is on the C' (standard K-means++ output), and φ(x) is from Y' multiset of X. The paper goes on to prove that the solution obtained for MapReduce K-means++ is an O(α^2) approximation to the optimal solution of K-means, by using equations derived from the triangle inequality: d(x,z) <= d(x,y) + d(y,z).
5. Improved MapReduce K-means++ initialization Needed because: high number of iterations in the initialization algorithm, scanning of all data points in both Map and Reduce to check if they can switch to a nearer center ( O(nk^2) for finding k centers) . Solution: prune unneeded distance computations, such as those of the far away points from a new center and always keep for them the old center. This uses the triangle inequality. Although this solution is good enough for the moment, it can be further improved. 6. Experimental results Homogeneous Hadoop cluster, latest version, 12 machines (1 master + 11 slaves). Several data sets are provided. There have been evaluated: 1 – efficiency of MapReduce K-means++ initialization & scalable K-means++ initialization The MR K-means++ initialization proposed in this paper outperforms the scalable K-means++ by the simple fact of number of jobs used (only 1), and it is I/O and communication cost effective. 2 – approximation of the MapReduce K-means++ initialization and MapReduce Random initialization The MR K-means++ initialization of this paper obtains a better approximation than the Random one and shows a steadier approximation trend as the number of centers increases. 3 – approximation of MapReduce K-means++ and MapReduce K-means MapReduce K-means++ provides a better approximation than MapReduce K-means, and the difference in approximation value becomes smaller as the the iteration number increases. The standard square error (SSE) of MR K-means is 10 times bigger than that of MR K-means++ for one of the data sets. 4 – efficiency of MapReduce K-means++ initialization and Improved MapReduce K-means++ initialization The number of distance calculations in Improved MapReduce K-means++ is significantly less than in the MapReduce K-means++ with the increase of k (99% and 18% less for the Map, 96% and 22% for Reduce, for the 2 data sets respectively). 7. Conclusion The paper brought new: * an efficient K-means++ initialization with MapReduce, which uses 1 job for finding k centers, with an approximation measure of O(α^2) as compared with the optimal solution of K-means; * the improved K-means++ algorithm by applying a pruning strategy, which boosts the speed and efficiency.