6-IJAEST-Volume-No-2-Issue-No-2-Representative-Based-Method-of-Categorical-Data-Clustering-152-156 by ISERP ISERP

G. Hanumantha Rao* et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 2, Issue No. 2, 152 - 156

Representative Based Method of Categorical Data Clustering G.Hanumantha Rao*, G. Narender+, T.Balaji*, Y.Anitha* * Assistant Professor, Department of CSE, Vasavi College of Engineering, Hyderabad-500 031, INDIA + Associate Professor, Department of CSE, Murthy Institute of Tech. & Science, Ranga Reddy-501301, INDIA hanu.abc@gmail.com, guggillanarender@gmail.com, balaji_075@yahoo.co.in, anitha_yella@yahoo.co.in

Keywords-

Data Mining, Dissimilarity, Mode.

Clustering,

I. INTRODUCTION

Categorical

data,

IJ A

Data mining [1] involves the use of sophisticated data analysis tools to discover previously unknown, valid patterns and relationships in large data sets [6]. Data Mining is a technique to extract valid novel, potential patterns and useful information from complex and huge amount of data set [2]. Two fundamental Goals of Data Mining are prediction and description otherwise known as verification model and discovery model. Some important data mining tasks are association rules, classification rules, clustering. There are two major types of predictions: one can either try to predict some unavailable data values or pending trends, or predict a class label for some data. Prediction is however more often referred to the forecast of missing numerical values, or increase/ decrease trends in time related data. The major idea is to use a large number of past values to consider probable future values. Description in terms of(human-interpretable) patterns. Association rules discovers the relations between variables in large databases [7]. Data classification is the categorization of data for its most effective and efficient use[8]. A data set (or dataset) is a collection of

ISSN: 2230-7818

data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the data set in question.[13] Its values for each of the variables, such as height and weight of an object or values of random numbers. Each value is known as a datum. The data set may comprise data for one or more members, corresponding to the number of rows. II. OVERVIEW OF CLUSTERING

Abstractâ&#x20AC;&#x201D; Categorical data (unordered data) using Clustering is one of the data mining techniques, which helps in identifying clusters within the domain space. . The objective of clustering is grouping of most similar objects. Here we are grouping the objects with minimum dissimilarity. Thus more similar objects are grouped in this paper; we present the proposed method which is experimented with the well known data sets from UCI data repository, taking example as a soybean, dataset. The data set consists of 47 records and each record contains 35 attributes describing the feature of plants with four classes of disease. There are three phases in this method. The dissimilarity matrix, neighbor matrix and the initial clusters are formed in first phase. Clusters are merged in the second phase by relocating the objects using the neighborhood concept. In the third phase, mode of attributes of clusters is computed, and phase I and Phase II are applied for the tuples formed from these Representatives.

A cluster is a collection of data objects that are more similar to one another within the same cluster and are dissimilar to the objects in other clusters. Unsupervised learning deals with in stances which have not been preclassified in any way and so they do not have a class attributes associated with them. Clustering is a process of grouping data into groups based on similarity measure and it is used to describe the preprocessing step for other algorithms such as characterization and classification.

Clustering in Data Mining is a discovery process that groups a set of data in which the intra cluster similarity is maximized and the inter cluster similarity is minimized. In general, clustering is divided into two broad categories viz., hierarchical and partitional. The partitional clustering technique partitions the database into pre-defined number of clusters based on some criteria. Hierarchical clustering technique is divided into agglomerative and divisive. The agglomerative Clustering [3] follows the bottom up strategy whereas the divisive approach follows the top down strategy. The basic principle of clustering hinges on the Concept of distance metric. Since the data are invariably real numbers for statistical applications and Pattern recognitions, a large class of metrics exists and one can define oneâ&#x20AC;&#x2DC;s own metric according to the specific requirements. Data Mining primarily works with large data bases. The objects in the data base contain the attributes of various data types. These values may be of either numeric or non numeric type. Clustering can be performed for both Numerical and categorical data. For clustering numerical data, geometric properties like distance function are used as a criterion. As

Page 152

G. Hanumantha Rao* et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 2, Issue No. 2, 152 - 156

data clustering is mostly related to real time or transactional data sets, the attributes are of both numerical and categorical type. Data objects are clustered based on similarity measurement. The most common similarity measurement is the distance function. The similarity between two objects Oi and OJ is measured using Euclidean distance or Manhattan distance. Clustering categorical data are very different from those of numerical data in terms of the definition of similarity measure. The distance based Metric cannot be used to cluster the categorical data. Numerical clustering methods are applied to categorical data through data preprocessing [4]. But these preprocessing techniques do not produce Quality clusters always. So it is widely accepted to apply the clustering on raw categorical data. Here we have used the similarity concept as a measurement. To maximize the intra cluster similarity, the minimum dissimilarity concept is used.

of the whole data set, that is, Xi(t+1)= Xi(t) for all i=1,....,k K-means is one of the simplest unsupervised learning algorithms[12]. K-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. he procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. The following are the steps for K-means algorithm [10]. Step 1: Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids.Start with a random partition into K clusters

Step 3: Compute new cluster centers as the centroids of the clusters. Step 4: Steps 1 and 2 are repeated until there is no change in the membership (also cluster centers remain the same) The K-modes and extension of K-means cluster the categorical data, where the means are replaced by modes. The K representative algorithm is an extension of K-means Algorithm using relative frequency as a measure to cluster the categorical Data. We are introducing a new representative Based Algorithm IV. PROPOSED ALGORITHM

IJ A

Step 1. Begin with an initial partition,

X=X1(0) Ụ X2(0) Ụ…. Ụ Xk(0)

Step 2: Update the modes of each cluster according to (3) to obtain C1(t) , C2(t) …. Ck(t)

Step 3: Re-test the similarity of all data vectors from cluster to cluster with each mode vector in the following way. If a vector from Xi(t) is found to be strictly nearer to Cj(t) than to the current Ci(t) , reallocate that vector to the cluster Xj(t+1) to obtain a new partition X=X1(t+1) Ụ X2(t+1) Ụ…. Ụ Xk(t+1)

Notice that ties here are biased so that the mode of a data vector‘s current cluster is preferred. Step 4: Go back to Step 2 and repeat the iteration until no object has changed cluster assignment after full cycle

ISSN: 2230-7818

Step 2: Generate a new partition by assigning each pattern to its closest cluster center

The basic idea of clustering is grouping together similar objects[15]. A few existing categorical clustering algorithms are discussed here. The k-means problem[14] is based on a simple iterative scheme for finding a locally minimal solution. In the K-Means problem, a set of N points X(I) in M-dimensions is given. The goal is to arrange these points into K clusters, with each cluster having a representative point Z(J), usually chosen as the centroid of the points in the cluster. This algorithm is often called the kmeans algorithm [9]. Prototypes algorithm is based on Kmeans [5] but removes the numeric data limitation. It is applicable for both numeric and categorical data. The following are the steps for K-modes algorithm[11]. Given a set X of categorical data vectors in Pm and a specified number k of desired clusters, do

III. RELATED WORKS

Let T be the set of n objects having categorical data with m attributes. The proposed algorithm is divided into three phases. The process of clustering is carried out in the first phase. Merging of clusters is performed in the second phase. Mode of attributes for each cluster is generated in the third phase. Apply phase I and phase II for the dataset obtained in the Third phase. A. Proposed Representative Based Algorithm

Let the Object list = [1, 2, 3...n], where N - Number of tuples/records M - Number of attributes Phase I

The steps involved in this phase are detailed below: Step1: Construct a dissimilarity matrix ‗d‘ using the measurement.

Page 153

G. Hanumantha Rao* et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 2, Issue No. 2, 152 - 156

Step2. Compute the threshold value, and minimum dissimilarity of each object,

TABLE-2 DISSIMILARITY MATRIX

Step3. Construct a neighbor matrix as ‗neigh‘. Step4. Select the first member of an object list, Form a new cluster with this object as a member. Group the neighbors of object based on the criteria. Remove the clustered objects from the object list. Step5. Repeat the above step until the object list becomes empty. Phase II

OBJECT 1 2 3 4 5 6 7

1 1 3 3 2 1 1

2 1 3 3 1 2 2

3 3 3 1 3 3 3

4 3 3 1 3 3 3

5 2 1 3 3 2 2

6 1 2 3 3 2 1

7 1 2 3 3 2 1 -

TABLE-3 NEIGHBOR MATRIX

Step 1: Select the cluster with least number of objects. Step2. The objects in the selected cluster are Relocated based on the Cluster Merging Criteria.

2,6,7

1,5

1,7

1,6

Step3. The above steps are repeated until no more merging is possible.

The steps involved in merging of clusters are detailed below:

Phase III

IJ A

Compute the mode of each column or attribute of all objects in each cluster. Where the attribute value with maximum frequency is defined as a mode of an attribute. If the number of cluster produced in Phase II is ‗K1‘, then this phase results in‗K1‘ tuples. Consider this as a dataset with ― K1‖ Tuples with ‗m‘ attributes and repeat Phase I and Phase II. B. Illustration

The proposed algorithm is illustrated with a sample dataset given in the Table-1. The sample dataset consists of 7 tuples/objects with 4 attributes. The dissimilarity matrix is depicted in table-2. Here n = 4 and m = 3. TABLE-1

GRADUATION b-tech b-tech m-tech m-tech b-tech b-tech b-tech

SAMPLE DATA SET

BRANCH cse ece eee eee ece cse cse

ISSN: 2230-7818

PERCENTAGE 70% 70% 60% 55% 65% 80% 75%

EMAIL-ID yes no no no no yes yes

The minimum dissimilarity value of each object is, m (.) = [1, 1, 1, 1, 1, 1, 1] and the neighbor‘s of each object ‗Neigh‘ is given in Table- 3.Object list is initialized with values 1, 2 … n, Where n is the number of objects. Consider the first object O1 = 1, neigh (O1) is {2, 6, and 7}. Objects 2, 6 and 7 are grouped into cluster1, C1 = {1, 2, 6, 7}. Now object list contains the objects 3, 4 and 5. In the next iteration object 3 is selected. C2 is formed with object 3 as member. Repeat the process. At the end of I phase, we get 3 clusters, which are given in Table – 4. TABLE-4 RESULTANT CLUSTERS AFTER PHASE-2 CLUSTER NUMBER

OBJECT

1,2,6,7

3,4

In the II phase, cluster with least number of Objects, C3 is selected. Using the Cluster merging criteria, object 5 is placed in the cluster C1. Thus 3 clusters are reduced to 2 clusters viz. C1 = {1, 2, 6, 7, 5} and C2 = {3, 4}. Mode of cluster C1 and C2 is {btech, cse, 70%} and {mtech, eee, 60%}. Apply phase-I and phase-II for these two tuples. For

Page 154

G. Hanumantha Rao* et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 2, Issue No. 2, 152 - 156

this sample data set, after phase III also we get the same results.



The proposed method is experimented with real data set like Soybean small. Soybean small dataset consists of 47 records and each record contains 35 attributes describing the feature of plants with four classes of diseases. And the result can be obtained in 10 clusters after phase-2 and 4 clusters after phase-3



15,18



18,20



19,20

D. Measure the purity of cluster



13,15,19



11,13



13,14



C. Results Done Experimentally

A cluster is called a pure cluster if all the objects belong to a single class. To measure the efficiency of the proposed method, we have used the clustering accuracy measure .The clustering accuracy r is defined as, R= 1\n Summation I=1 to k of all a1 Where al is the number of data objects that occur in both cluster Cl and its corresponding labeled class, and n is the number of objects in the data set.

IJ A

In Our proposed algorithm the number of clusters ‗K‘ is not given as input. But during merging using representative the number of clusters is reduced, evident that when the number of clusters is more the purity is also more. .So it is proposed here to select the cluster with high purity. The proposed method is compared with K-modes algorithm and the results depend on the mode selection. In the proposed method whatever may be the order the results will be same. Execution is necessary to minimize the cost function. The sample dataset and dissimilarity matrix of Soya bean dataset can be represented in power point representation and next step is the neighbor matrix can be shown below as follows:

TABLE-5 NEIGHBOR MATRIX OF SOYABEAN DATASET



25,30



21,24



22,23



24,25



6,7



39,43



1,10



31,41



2,4,9



2,3,4



ISSN: 2230-7818

Page 155

G. Hanumantha Rao* et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 2, Issue No. 2, 152 - 156

Because of the computation overhead in constructing dissimilarity matrix, we have experimented only with small Soya bean data set. Further we planned to extend this to a large data set.



33,45



33,40



32,34,37



33,36



31,36



References

TABLE-6 RESULTANT CLUSTERS OF SOYABEAN DATASET AFTER PHASE-2

[1] Arun.K.Pujari, Data Mining Techniques, Universities Press, 2001 [2] www.ics.uci.edu/~mlearn/MLRepository.html [3] Aranganayagi.S and K.Thangavel, A Novel ClusteringAlgorithm For Categorical Data ,Computational Mathematics, Narosa Publishing House, New Delhi, India [4] Maria Halkidi, Yannis Batistakis, Michalis Vazirgiannis, On Clustering Validation techniques Journal of Intelligent Information Systems. [5] Ohn Mar San, Van-Nam Huynh, Yachter Nakamori, An Alternative Extension Of The K-Means algorithm For Clustering Categorical Data, J. Appl. Math. Comput. Science. [6] http://www.fas.org/irp/crs/RL31798.pdf [7] http://en.wikipedia.org/wiki/Association_rule_learning [8]http://searchdatamanagement.techtarget.com/definition/data -classification [9] An Efficient k-Means Clustering Algorithm: Analysis and Implementation, Tapas Kanungo, IEEE transaction on pattern analysis and machine intelligence. [10] clustering algorithms by Johan Everts, Kunstmatige Intelligentie / RuG [11] The Effects of Ties on Convergece in K-Modes Variants for Clustering Categorical Data N. Orlowski, D. Schlorff, J. Blevins, D. Ca˜nas, M. T. Chu_, R. E. Funderlic N.C. State University Department of Computer Science [12]http://home.dei.polimi.it/matteucc/Clustering/tutorial_htm l/kmeans.html [13] http://en.wikipedia.org/wiki/Data_set [14]http://people.sc.fsu.edu/~jburkardt/m_src/kmeans/kmeans. html [15] http://www.astro.princeton.edu/~gk/A542/matt.pdf

CLUSTER NUMBER

OBJECT 1,5,10

2,3,4,6,7,8,9

11,13,14,15,17,18,19,20

12,16

21,24,25,28,30

22,23,27

26,29

31,32,34,37,47

33,36,39,40,43

44,45,46,47

IJ A

V. CONCLUSION

In many existing clustering methods, the number of clusters is given as an input parameter. In the proposed method, the dissimilarity measurement is used to cluster the categorical data and the objects are clustered without getting any input as ‗K‘ or the threshold value. The experimental results depict the efficiency of the new proposed method.

ISSN: 2230-7818

Page 156