Integrated Intelligent Research (IIR)
International Journal of Data Mining Techniques and Applications Volume: 02 Issue: 02 December 2013 Page No.59-63 ISSN: 2278-2419
Improve the Performance of Clustering Using Combination of Multiple Clustering Algorithms Kommineni Jenni1 ,Sabahath Khatoon2 , Sehrish Aqeel3 Lecturer , Dept of Computer Science, King Khalid University, Abha, Kingdom of Saudi Arabia Email : Jenni.k.507@gmail.com, sabakhan_312@yahoo.com , sehrishaqeel.kku@gmail.com Abstract - The ever-increasing availability of textual documents has lead to a growing challenge for information systems to effectively manage and retrieve the information comprised in large collections of texts according to the user’s information needs. There is no clustering method that can adequately handle all sorts of cluster structures and properties (e.g. shape, size, overlapping, and density). Combining multiple clustering methods is an approach to overcome the deficiency of single algorithms and further enhance their performances. A disadvantage of the cluster ensemble is the highly computational load of combing the clustering results especially for large and high dimensional datasets. In this paper we propose a multiclustering algorithm , it is a combination of Cooperative Hard-Fuzzy Clustering model based on intermediate cooperation between the hard k-means (KM) and fuzzy c-means (FCM) to produce better intermediate clusters and ant colony algorithm. This proposed method gives better result than individual clusters.
result cooperation are the Ensemble Clustering and the Hybrid Clustering approaches [11]-[15]. Ensemble clustering is based on the idea of combining multiple clusterings of a given dataset X to produce a superior aggregated solution based on aggregation function. Recent ensemble clustering techniques have been shown to be effective in improving the accuracy and stability of standard clustering algorithms. However, an inherent drawback of these techniques is the computational cost of generating and combining multiple clusters of the data. In this paper, we propose a new Cooperative Hard-Fuzzy Clustering (CHFC) model based on obtaining best clustering solutions from the hard k-means (KM) [5] and the fuzzy cmeans (FCM) [6] at the intermediate steps based on a cooperative criterion to produce better solutions and to achieve faster convergence to solutions of both the KM and FCM over overlapped clusters. After getting the clusters we can apply the next stage of clustering is the process of simulation of ants in search of food, and because the advantages of swarm intelligence, can avoid falling into local optimal solution, but get a better clustering results. The rest of this paper is organized as follows: in section 2, related work to data clustering is given. The proposed hybrid model is presented in section 3. Section 4 provides a Multi clustering. Experimental results are presented and discussed in section 5. Finally, we discuss some conclusions and outline of future work in section 6.
Keywords- Clustering, density; pheromone; ant colony algorithm; k-means, hard k-means, fuzzy c-means, Cooperative Hard-Fuzzy; I.
INTRODUCTION
Clustering can be considered as the most important unsupervised learning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. A cluster is a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters. A large number of clustering methods [1]-[10] have been developed in many fields, with different definitions of clusters and similarity metrics. It is well known that no clustering method can adequately handle all sorts of cluster structures and properties (e.g. overlapping, shape, size and density). In fact, the cluster structure produced by a clustering method is sometimes an artifact of the method itself that is actually imposed on the data rather than discovered about its true structure. Combining multiple clustering [11]-[15] is considered as an example to further broaden and stimulate new progress in the area of data clustering. Clustering is an important technology of the data mining study, which can effectively discovered by analyzing the data and useful information. It groups data objects into several classes or clusters so that in the same cluster of high similarity among objects, and objects are vary widely in the different cluster [19]. Combining clustering can be classified into two categories based on the level of cooperation between the clustering algorithms; either they cooperate on the intermediate level or at the end-results level. Examples of end-
II.
RELATED WORK
Clustering algorithms have been developed and used in many fields. Hierarchical and partitioning methods are two major categories of clustering algorithms. [21], [22] provides an extensive survey of various clustering techniques. In this section, we highlight work done on document clustering. Many clustering techniques have been applied to clustering documents. For instance, [23] provided a survey on applying hierarchical clustering algorithms into clustering documents. [24] Adapted various partition-based clustering algorithms to clustering documents. Another popular approach in document clustering is agglomerative hierarchical clustering [26]. Algorithms in this family follow a similar template: Compute the similarity between all pairs of clusters and then merge the most similar pair. Different agglomerative algorithms may employ different similarity measuring schemes. K-means and its variants [25] represent the category of partitioning clustering algorithms. [25] Illustrates that one of the variants, bisecting k-means, outperforms basic k-means as well as the agglomerative approach in terms of accuracy and efficiency. The bisecting k-means algorithm first selects a cluster to split. Then it utilizes basic k-means to form two sub-clusters and repeats until the desired number of clusters is reached. The K59
Integrated Intelligent Research (IIR)
International Journal of Data Mining Techniques and Applications Volume: 02 Issue: 02 December 2013 Page No.59-63 ISSN: 2278-2419 means algorithm is simple, so it is widely used in text cooperative quality criterion. Minimum Objective Function clustering. Due to the randomness of the initial center selection [18] is used as a cooperative quality criterion. Both the KM in the K means algorithm, the results of its operation are and FCM receive a solution (i.e. a set of k centroids) from the unstable and easy to fall into the local minimum point. And neighboring algorithm, each algorithm accepts the received then Dunn proposed the fuzzy C-means (FCM) algorithm solution and replaces the current centroids with the new set of which is promoted by the Bezdek [27]. We divide the n items centroids only if the new set of centroids minimizes its of data into the C items of the fuzzy categories and minimize objective function J value at each step. Catching better solution the objective functions. FCM algorithm is very common in the with minimum objective function than the current solution fuzzy clustering algorithm [28]. However, the clustering results enables both the KM and the FCM to convergence faster. The of the traditional fuzzy C-means clustering algorithm lack the CHFC alternates between the KM in some iterations and the stability and the boundary value attribution of the traditional FCM in a different number of iterations as the CHFC takes the fuzzy C-means clustering algorithm exists the deficiencies. best solutions from either the KM or the FCM at each iteration Combining clusterings invokes multiple clustering algorithms (i.e. best solutions with minimum objective function value). in the clustering process to benefit from each other to achieve global benefit (i.e. they cooperate together to attain better This cooperative strategy is simple because both the KM and overall clustering quality). Current approaches to multiple FCM have the same structure of prototypes (centroids), which clustering are the ensemble clustering and hybrid clustering. enables fast exchange of information between both of them. Hybrid clustering assumes a cascaded model of multiple The Message Passing Interface (MPI) [52] is used to facilitate clustering algorithms to enhance the clustering of each other or the communication between the KM and the FCM (e.g. send to reduce the size of the input representatives to the next level and receive operations). Assume that CKM and CFCM are the of the cascaded model. set of k centroids generated by the KM and FCM algorithms, respectively, at any iteration. Both the KM and the FCM Hybrid PDDP-k-means algorithm [15] starts by running the evaluate their current objective function J using their current PDDP algorithm [8] and enhances the resulting clustering centroids CKM and CFCM, respectively. Each algorithm solutions using the k-means algorithm. Hybrid clustering accepts the received set of centroids from its neighboring only violates the synchronous execution of the clustering algorithms if minimizes its current objective function J and updates its at the same time, as one or more of the clustering algorithms current set of centroids with the new received set. These steps stays idle till a former algorithm(s) finishes it clustering. of exchanging and updating centroids are repeated until either Deneubourg et al proposed Basic Ant Colony Model (BM) for the KM or the FCM converge to the desired clustering quality. clustering. Agents are moving randomly on a square grid of cell on which some objects are scattered. When a loaded agent B. Quality measures comes to an empty cell, it performs drop down action if the estimation of neighbor density greater than a probability, or The clustering results of any clustering algorithm should be else it keeps moving to other empty cells. evaluated using an informative quality measure(s) that reflects the “goodness” of the resulting clusters. Two external quality measures (F-measure and Entropy) [7]-[8] are used, which III. PROPOSED METHOD assume that a prior knowledge about the data objects (e.g. class A. Cooperative Hard-Fuzzy Clustering labels) is given. The Separation Index (SI) [17] is used as internal quality measure, which does not require prior In the proposed CHFC [18] model, we aggregate both the KM knowledge about the objects. and FCM algorithms to cooperate at each iteration synchronously (i.e. there is no idle time wasted in the waiting a) F-measure process as in the hybrid models) for the aim of reducing the F-measure combines the precision and recall ideas from the total computational time compared to the FCM, achieving information retrieval literature. The precision and recall of a faster convergence to solutions, and producing an overall cluster Sj with respect to a class Ri, i, j=1,2,..,k, are defined as: solution that is better than that of both clustering algorithms. recall(Ri,Sj) = L i j (1) Both KM and FCM attempt to reduce the objective function in | Ri | every step, and may terminate at a solution that is locally L ij (2) optimal. A bad choice of initial centroids can have a great p r e c is io n ( R i , S j ) | S j | impact on both the performance and quality of the clustering. That is why many runs are executed and the overall best value is taken. Also a good choice of initial centroids reduces the Where Lij is the number of objects of class Ri in cluster Sj, |Ri| number of iterations that are required for the algorithm to is the number of objects in class Ri , and |Sj| is the number of converge. So when the KM algorithm is fed by good cluster objects in cluster Sj. The F-measure of a class Ri is defined as: centroids from the neighboring FCM, it may result in a better 2 * p r e c i s i o n ( R , s ) * r e c a l l ( R , s ) (3) F ( R ) m ax clustering quality.In addition, when the KM produces good p r e c is io n ( R , S ) r e c a ll ( R , S ) cluster centroids and sends these centroids to the neighboring With respect to class Ri we consider the cluster with the FCM, the FCM algorithm can generate appropriate highest F-measure to be the cluster Sj that is mapped to class memberships for each data object, yielding better cluster Ri, and that F-measure becomes the score for class Ri. The centroids. In the CHFC model, both the KM and FCM overall F-measure for the clustering result of k clusters is the exchange their centroids at each iteration and each algorithm weighted average of the F-measure for each class Ri: accepts or rejects the received set of centroids based on a i
i
i
60
j
i
j
j
j
i
j
Integrated Intelligent Research (IIR)
International Journal of Data Mining Techniques and Applications Volume: 02 Issue: 02 December 2013 Page No.59-63 ISSN: 2278-2419 [ i j ( t ) ] [ ] i j P ij (t ) , j U [ i j ( t ) ] [ ] i j ie U o th e r w is e 0,
k
(| Ri | F ( R i ))
(4)
i 1
o v e r a ll F m e a s u r e
k
| Ri |
i 1
The higher the overall F-measure, the better the clustering due to the higher accuracy of the resulting clusters mapping to the original classes.
If
p ij ( t ) p 0 , x i
Suppose C
b) Entropy Entropy tells us how homogenous a cluster is. Assume a partitioning result of a clustering algorithm consisting of k clusters. For every cluster Sj we compute prij, the probability that a member of cluster Sj belongs to class Ri. The entropy of each cluster Sj is calculated as: (5) E (S ) p r l o g ( p r ) , j 1, . . . , k ij
k
The specific steps of learning methods based on ant colony clustering algorithm are as follows:
ij
1. First, chose randomly k-representative point according to the experience. In general, the initial choice of representative points tend to affect the results of iteration, that is likely to get is a local optimal solution rather than the global optimal solution. To address this issue, you can try different initial representative point in order to avoid falling into local minimum point. 2. Initialization, suppose N , m , r , 0 , , , ij ( 0 ) 0 , p 0
k
j 1
j
n
E (S
j
)
The lower the overall Entropy, the more will be homogeneity (or similarity) of objects within clusters. Ant Clustering Algorithm
ij ( 0 ) 0 , p 0
Ant colony algorithm [19] (ACG) is a simulated evolutionary algorithm. Algorithms simulate the collaborative process of real ants and be completed jointly by a number of ants. Each ant search solution independently in the candidate solution space, and find a solution which leaves a certain amount of information. The greater the amount of information on the solution, the more is the likelihood of being selected. Ant colony clustering method is affected from that. Ant colony algorithm is a heuristic global optimization algorithm, according to the amount of information on the cluster centers it can merge together with the surrounding data, thus to be clustering classification [20]. There use it to carry out a division of the input samples. Clustering the data are to be regarded as having different attributes of the ants, cluster centers are to be regarded as "food source." that ants need to find. Assume that the input sample X={X i |i=1,2,…,n}, there are n input samples. The process of determining the cluster center is the process of starting ant from the colony to find food, when ants in the search, different ants select a data element is independent of each other. Suppose:
di
j
|| m ( x i x j ) ||
3.
i j
c o m p u te d ij || m ( x i x j ) ||
ij
1, d i j r 0 , d ij r
5. Calculate the probability of x i be integrated into the X
j
p ij ( t )
[ i j ( t ) ] [ i j ] [ ( t ) ] [ ] ij ij ie U 0 , o t h e r w i s e
,
j U
6. Determine whether p ij ( t ) p 0 was set up, if satisfy; continue, otherwise, i + 1 switch 3. 7. According to
Cj
1 J
compute the
J
xk , xk C
j
k 1
cluster center. 8. Computing j-deviation from a clustering error and overall error, which states that the jth cluster center's first icomponent. Calculating the overall error ԑ
2
1, d i j r 0 , d ij r
2
4. Calculate the amount of information on the various paths
k
D
j
j 1
9.Determine whether the output ԑ ≤ԑ0was set up, if satisfy, output the clustering results; if not set up to switch 3 continue iteration.
“Where” denote that the weighted Euclidean distance between x i and x; m as the weighting factors, can be set according to the different contribution of the various components in the clustering. Established that clustering radius r, ԑ said the statistical error, ij (t) denote residual amount of information on the path between the data xi to the data xj at t times, in the initial moment. The amount of information on the various paths is the same and equal to 0. The amount of information on the path from the equation (7) is given.
k
k 1
The overall entropy for a set of k clusters is the sum of entropies for each cluster weighted by the size of each cluster: (6) | | S
C.
k
j
/ d k j r , k 1, 2 , . . . . . j
J
i 1
o v e r a ll E n tr o p y
X
w ill b e in t e g r a te d in t o th e X
Where, Cj denotes the data set X. that all integrated into the neighborhood of Xj Find the ideal cluster center: c j 1 x , x Cj J
k
j
j
(8)
IV.
MULTI CLUSTERING ALGORITHM
In these multi Clustering Algorithm contains following steps:
(7)
Ant transition probabilities is given from equation (8), where i j denotes the expectations of the degree that xi be integrated
into the X j, generally take 1/d ij. 61
Initially we can apply the Hard Fuzzy clustering algorithm on input datasets. Hard Fuzzy clustering consists of a combination of K-Means and Fuzzy C means clustering. These two algorithms are applied on input data set based on distance measure we will get the some inter clusters. In second stage we will consider inter clusters as an input and then apply ant clustering algorithm on these clusters.
Integrated Intelligent Research (IIR)
International Journal of Data Mining Techniques and Applications Volume: 02 Issue: 02 December 2013 Page No.59-63 ISSN: 2278-2419 Finally, we will get desired output cluster. These desired 2000 70.45% output clusters give better performance than individual 2500 67.34% clusters. 84.02% Average V.
EXPERIMENT RESULTS
In these we can use Mini-news-group as input data set. Then we can apply the multi clustering algorithms on this dataset. So the multi clustering gives 10 % performance better result than individual clusters. From the experimental results of view, kmeans algorithm and ant colony algorithms based on pheromone selected randomly sample points as the initial cluster centers, the clustering accuracy rate of this approach is instability, with a low average accuracy rate, and clustering effect in the actual data will not be good. The hard fuzzy clustering gives some better result. Next, we apply ant clustering algorithm on output of hard fuzzy clustering it will generate the desired output clusters. The final result gives better performance than individual clusters. In these we can use precession and recall to calculate the performance and entropy of the individual clusters and multi cluster can be calculated. But this improved algorithm have high and stable rate of accuracy and can be used on the actual data clustering. Resulting performance of the Hard Fuzzy clustering algorithm on min news group dataset can be shown in Table 1 and figure Proposed method result can be shown in Table 2 and figure.
Figure 1. F-Measure of Hard Fuzzy clustering on Min News Group Dataset
Table 1. Performance Of Hard Fuzzy Clustering On Min News Group Dataset No.f Documents 100 200 300 400 500 600 700 800 900 1000 1500 2000 2500 Average
Figure 2. F-Measure of Hard Fuzzy +Ant clustering on Min News Group Dataset
Performance 91.34% 90.45% 88.40% 85.37% 82.03% 79.50% 75.21% 71.99% 68.45% 65.35% 61.23% 55.45% 48.23% 74.08%
VI.
In this work we presented the cooperative Hard-Fuzzy Clustering (CHFC) model and Ant clustering algorithms for the goal of achieving better clustering quality and time performances. The Multi model is based on intermediate cooperation between the k-means and the Fuzzy c-means algorithms and exchanging the best intermediate solutions, next apply Ant clustering algorithm on intermediate clusters finally get the desired output clusters with the better performance. The proposed model achieves faster convergence to solutions, and produces an overall solution that is better than both clustering algorithms. The proposed model has better performance than the hybrid cascaded models.
Table 2. Performance Of Hard Fuzzy +Ant Clustering On Min News Group Dataset No.f Documents 100 200 300 400 500 600 700 800 900 1000 1500
Figure 3 CONCLUSION
Performance 95.45% 94.02% 93.33% 91.09% 89.99% 86.54% 85.77% 82.43% 80.99% 78.60% 76.21% 62
Integrated Intelligent Research (IIR)
International Journal of Data Mining Techniques and Applications Volume: 02 Issue: 02 December 2013 Page No.59-63 ISSN: 2278-2419 Figure 4
[22] X. Rui, "Survey ofclustering algorithms", IEEE Transactions on Neural Networks, vol 16(3), pp. 634-678, 2005. [23] P.Willett, Recent trends in hierarchical document clustering: a critical review, Information processing and management, vol. 24,pp. 577-97, 1988. [24] R. Cutting, D. R. Karger, J. O. Pedersen, and J. W.Tukey,"Scatterlgather: A cluster-based approach to browsing large document collections". In Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318-329, 1992. [25] R. C. Dubes and A. K. Jain, "Algorithms for Clustering Data", Prentice Hall College Div, Englewood Cliffs, NJ, March 1998. [26] M. Steinbach, G. Karypis, and V. Kumar, "A comparison of document clustering techniques", KDD Workshop on Text Mining'00, 2000. [27] N. R. Pal and J. C. Bezdek, On cluster validity for the fuzzy c-means model, IEEE Trans. Fuzzy Systems, vol. 3, pp. 370-379, 1995. [28] J. C. Bezdek and N. R. Pal, Some new indexes of cluster validity, IEEE Trans. Syst. Man. Cybern., vol. 28, pp. 301-315, 1998.
REFERENCES [1]A. Jain, M. Murty and P. Flynn.“Data Clustering: A Review”.ACM computing surveys,Vol.31,pp:264-323, 1999. [2] R. Xu, “Survey of Clustering Algorithms”, IEEE Transactions on Neural Networks, Vol.16, Issue 3, pp: 645- 678, 2005. [3] M. Steinbach, G. Karypis and V. Kumar, “A Comparison of Document Clustering Techniques”, Proceeding of the KDD Workshop on Text Mining, pp: 109-110, 2000. [4] R. Duda and P. Hart, “Pattern Classification and scene Analysis”. John Wiley and Sons, 1973. [5] J. Hartigan and M. Wong. “A k-means Clustering Algorithm”, Applied Statistics, Vol. 28, pp: 100-108, 1979. [6] J. Bezdek, R. Ehrlich, and W. Full, “The Fuzzy C-Means Clustering Algorithm”, Computers and Geosciences, Vol.10, pp: 191-203, 1984. [7] S. M. Savaresi and D. Boley. “On the Performance of Bisecting K-means and PDDP”. In Proc. of the 1st SIAM Int. Conf. on Data Mining, pp: 1-14, 2001. [8] D. Boley. “Principal Direction Divisive Partitioning”. Data Mining and Knowledge Discovery,2 (4), pp: 325-344, 1998. [9] M. Ester, P. Kriegel., J. Sander, and X. Xu. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”, Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining, Portland, OR, AAAI Press, 1996, pages 226-231. [10] K. Hammouda and M. Kamel, “Collaborative Document Clustering”,2006 SIAM Conference on Data Mining (SDM06), pp: 453-463, 2006. [11] A. Strehl and J. Ghosh. “Cluster ensembles – knowledge reuse framework for combining partitionings”. In conference on Artificial Intelligence (AAAI 2002), pp: 93–98, AAAI/MIT Press, 2002. [12] Y. Qian and C. Suen. “Clustering combination method”. In International Conference on Pattern Recognition. ICPR 2000, volume 2, pp: 732–735, 2000. [13] H. Ayad, M. Kamel, "Cumulative Voting Consensus Method for Partitions with Variable Number of Clusters," IEEE Transactions on Pattern Analysis and Machine Intelligence, IEEE Computer Society Digital Library. IEEE Computer Society, 2007. [14] Y. Eng, C. Kwoh, and Z. Zhou, ''On the Two-Level Hybrid Clustering Algorithm,'' AISAT04 - International Conference on Artificial Intelligence in Science and Technology, 2004, pp. 138-142. [15] S. Xu and J. Zhang, “A Hybrid Parallel Web Document Clustering Algorithm and its Performance Study”, Journal of Supercomputing, Vol. 30, Issue 2, pp: 117-131, 2004. [17] W. Gropp E. Lusk and A. Skjellum, “Using MPI, Portable Parallel Programming with Message Passing Interface”. The MIT Press, Cambridge, MA, 1996. [18] U. Maulik, S. Bandyopadhyay, “Performance Evaluation of Some Clustering Algorithms and Validity Indices”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.24, No.12, pp:1650-1654, 2002. [18] R. Kashef, Student, M. S. Kamel,” Hard-Fuzzy Clustering: A Cooperative Approach”, IEEE conf 2007. [19] Lan Li, Wan-chun Wu, Qiao-mei Rong,” Research on Hybrid Clustering Based on Density and Ant Colony Algorithm”, 2010 Second International Workshop on Education Technology and Computer Science 2010. [20] L. Jing, K. Ng Michael and J. Huang. An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data. IEEE Transactions on Knowledge and Data Engineering, 19(8): 1026-1041,2007. [21] A.KJain, M.N. Murty and PJ .Flynn, "Data clustering: a reView", ACM Computing Surveys, vol. 3I( 3), pp 264-323,1999 .
63