Ijcot v5p315

Page 1

International Journal of Computer & Organization Trends – Volume 5 Number 2 – February 2014

Modified Ant-Miner for Large Datasets Mamta Yadav#1, Mohit Khandelwal*2 M.Tech Scholar#1, Associate Professor*2 IET, Alwar, India Abstract—Data mining is the term used to describe the

process of extracting value from a database. The datawarehouse is a location where information is stored. The data stored depends largely on the type of industry and company. Ant-Miner has produced good results when compared with more conventional data mining algorithms and it is still a relatively recent algorithm that suggests further research trying to improve it. Ant- Miner cannot be beneficial for a very large dataset. Uncertainly, this paper can improve the ant- miner by using the concept of k-mean clustering with ant-miner, so that it can be used in large data set. In this paper, data set “Weight Lifting Exercises monitored with Inertial Measurement Units Data Set” is used taken from “UCI repository”. The simulation result shows the better performance of modified ANTMINER as compared to existing ANTMINER with reduced tree size. Keywords— Data mining, Ant-Miner, K-Mean

II. LITERATURE REVIEW 1.

ANT- MINER

Ant-Miner has produced good results when compared with more conventional data mining algorithms and it is still a relatively recent algorithm that suggests further research trying to improve it. In the original Ant-Miner, the goal of the algorithm was to produce an ordered list of rules, which was then applied to test data in the order in which they were discovered. This makes it difficult to interpret the rules at the end of the list, since their conditions make sense only in the context of all the previous rules in the ordered list of rules [3].This makes the discovered rules easier for the user to interpret, now the interpretation of each rule is independent from all the other discovered rules. 1.1 A Brief Description Of The ANT-MINER Algorithm

I. INTRODUCTION Image Data Mining is the process of extracting useful knowledge from real-world data. There are several data mining tasks like as clustering and classification. In this task the aim is to discover, from training data consisting as records, cases whose class is known, a classification model that can be used to predict the class of cases in the test data containing unknown-class cases. The popular category of classification model consists of classification rules that are the model category. The aim of the classification algorithm is to discover a set of classification rules [1]. One algorithm for solving this task is Ant-Miner proposed by Parpinelli and colleagues [2], which employs ant colony optimization techniques to discover classification rules of the form: IF (term1 AND term2 AND ….. term) THEN (predicted class) where each term is of the form <attribute = value>, and different rules can have different number of terms in their antecedent (IF part). The consequent of a rule is a predicted class, i.e., the value that the rule predicts for the class attribute when an example satisfies the conjunction of terms in the rule antecedent. Classification rules have the advantage of representing knowledge at a high level of abstraction, so that they are intuitively comprehensible to the user [3].

ISSN: 2249-2593

The original Ant-Miner algorithm upon which the Unordered Rule Set Ant-Miner proposed. Algorithm TrainingSet = {all training cases}; DiscoveredRuleList = [ ]; /* initialize rule list with empty list */ WHILE (TrainingSet > Max_uncovered_cases) t = 1; /* ant index, and also rule index */ j = 1; /* convergence test index */ Firstly initialize all trails with the same amount of pheromone; REPEAT The ant starts with an empty rule and incrementally constructs a classification rule Rt by adding one term at a time to the current rule; Prune rule Rt; /* remove irrelevant terms from rule */ Modernize the pheromone of all trails by increasing pheromone in the trail followed by Ant (proportional to the quality of Rt) and decreasing pheromone in the other trails (simulating pheromone evaporation);

http://www.ijcotjournal.org

Page 72


International Journal of Computer & Organization Trends – Volume 5 Number 2 – February 2014 IF (Rt is equal to Rt – 1) /* update convergence test */ THEN j = j + 1; ELSE j = 1; END IF t = t + 1; UNTIL (i ≥ No_of_ants) OR (j ≥ No_rules_converg) Select best rule Rbest among all rules Rt constructed by all the ants;

prototype that is the most representative point for a group and can be applied to a wide range of data since it needs a proximity measure for a pair of objects. The difference with centroid is the medoid correspond to an actual data point [5]. The process, which is called “k-means”, appears to give partitions which are reasonably efficient in the sense of within-class variance; the corroborated to some extend by mathematical analysis and practical experience. Moreover, kmeans procedure is easily programmed and is estimating economical, so that it is feasible to process very large samples on a digital computer [4].

Add rule Rbest to DiscoveredRuleList; 2.1 The Basic K-means Algorithm TrainingSet = TrainingSet - {set of cases correctly covered by Rbest}; END WHILE Ant-Miner discovers an ordered list of classification rules based on a heuristic function involving information gain – a popular heuristic function in data mining [1] – and positive feedback involving artificial pheromone. Individually iteration of the Repeat-Until loop an ant attempts to discover a rule by selecting terms in a probabilistic manner continuously all the attributes have been used to make the current rule or adding any other available term would make the rule coverage less than min_cases_per_rule – a user-specified threshold. Discovered rule is then pruned in an attempt to reduce over fitting to the training data and increase rule quality. Later on, pheromone values for the terms in the current rule are increased in order to increase the probability that other ants will select those terms and then the pheromone values for all terms are normalized. The While loop iterates until the number of training examples remaining in the dataset becomes less than or equal to Max_uncovered_cases – another user-specified threshold. The rule discovered in the Repeat-Until loop that has the highest quality is then added to the list of discovered rules and training examples correctly covered by that rule are removed from the training dataset [1]. 2.

K-MEAN CLUSTERING

The simple definition of k-means clustering is to classify data to groups of objects based on attributes or features into K number of groups [4]. K is positive integer number. K-means is Prototype-based (center-based) clustering technique which is one of the algorithms that solve the well-known clustering problem. This creates a one-level partitioning of the data objects. K-means (KM) define a prototype in terms of a centroid, which is the mean of a group of points and is applied to dimensional continuous space. Another technique as prominent as K-means is K-medoid, which defines a

ISSN: 2249-2593

This technique is one of the simplest algorithms; we begin with a description of the basic algorithm: We suppose that we have some data point D=(X1… Xn), then first choose from this data points K initial centroid, where k is user-parameter, number of clusters desired. Each point is then assigned to nearest centroid. For various, we want to identify group of data points and assign each point to one group. Idea is to choose random cluster centers one for each cluster. Centroid of each cluster is then updated based on means of each group which assign as a new centroid. Then we repeat assignment and updated centroid continuously no point changes means no point don’t navigate from each cluster to another or equivalently, each centroid remain the same [4]. Basic K-Mean Clustering a.

Choose k points as initial centroid

b.

Repeat

c.

Assign each point to the closest cluster centre,

d.

Re-compute the cluster centers of each cluster

e.

Until convergence criterion is met III. PROPOSED ALGORITHM

The proposed algorithm performs clustering then applies ANTMINER on each cluster to classify the data. The process can be explained by following algorithm. 1.

Input Data set having n instance.

2.

Generate Classification algorithm.

3.

Select K random Points (data) that act as the centroid.

4.

For each point in dataset repeat step 5 and 6

5.

Calculate the similarity of point in dataset with the K selected points(centroids) by using the formulae given below

http://www.ijcotjournal.org

rules

by

Ant-Miner

Page 73


International Journal of Computer & Organization Trends – Volume 5 Number 2 – February 2014 Table 2: Performance Comparison of Existing and Proposed Algorithm

6.

The point is clustered in to the group having highest similarity.

End for 7.

8.

Calculate the new K centroid by taking mean of all points (data) within the cluster. If new centroid points!=old centroid points then

Goto step 4. 9.

Exiting

Proposed

Algorithm

Algorithm

TP rate

0.969

0.971

FP rate

0.011

0.01

Precision

0.969

0.971

Recall

0.969

0.971

ROC

0.969

0.971

Classification

96.91

98.4

92

63

Parameters

For l=1:K apply step 10 and 11

10. Apply classification rules on cluster l. 11. Get generated classes.

accuracy

End for.

Size of tree

The algorithm can provide efficient results on large dataset. IV. DATA SET DESCRIPTION In this paper, “Weight Lifting Exercises monitored with Inertial Measurement Units Data Set” is used. This data set is taken from the website “UCI repository”. By this data set, six young health subjects were asked to perform 5 variations of the biceps curl weight lifting exercise. One of the variations is the one predicted by the health professional [6].

120

Table 1: Various characteristics of data set

20

Data Set Characteristics

Multivariate

Number of instances

5090

Number of attribute

59

Attribute characteristics

Real

Missing Values?

Yes

100 80 60

Exiting

40

Proposed

0 Classification Size of Tree accuracy Figure 1: Comparison of classification accuracy and size of tree

V. RESULT ANALYSIS 0.972

The implementation of the proposed algorithm is performed using the WEKA tool on the dataset described in previous section. We have to add the new algorithm to the WEKA tool. New algorithm can be added to the WEKA by using the eclipse. The proposed algorithm and existing algorithm results are shown in table 2.

0.971 0.97 0.969 0.968

Exiting Proposed

Figure2: Comparison of existing and proposed algorithm on various parameters

ISSN: 2249-2593

http://www.ijcotjournal.org

Page 74


International Journal of Computer & Organization Trends – Volume 5 Number 2 – February 2014 VI. CONCLUSION Data mining is very supportive technique to analyze the forthcoming prediction with the help of historical behavior of data and statistics, these features of data mining sanction proactive business and it is called knowledge-driven decisions. ML (Machine learning) can often be successfully applied to these problems for improving the efficiency of systems and the designs of machines. As the Ant-Miner is efficient for small data sets. So we have partitioned data in to small set of clusters then apply the Ant miner for the classification. Specific parameters like recall, precision are optimized. The result shows that the modified ANTMINER improves the accuracy approx. by 2%. It also reduces the tree size. In future other swarm intelligence techniques can be applied for classification. The proposed algorithm results can be verified on other datasets.

ACKNOWLEDGMENT The making of the paper needed co-operation and guidance of a number of people. We therefore consider it our prime duty to thank all those who had helped me through their venture. Authors are grateful to the all members of the Department of CSE and IT, Institute of Engineering and Technology, Alwar, India for their conversations and valuable suggestions. REFERENCES [1] James Smaldon et al., “A New Version of the Ant-Miner Algorithm Discovering Unordered Rule Sets”, GECCO’06, July 8–12, 2006, Seattle, Washington, USA. Copyright 2006 ACM 1-59593-186-4/06/0007…$5.00. [2] R.S. Parpinelli, H.S. Lopes, and A.A. Freitas. Data mining with an ant colony optimization algorithm. IEEE Transactions on Evolutionary Computing 6(4), 2002, pp. 321–332. [3] I.H. Witten and E. Frank. Data Mining: practical machine learning tools and techniques. 2nd Edition. Morgan Kaufmann, 2005. [4] Elham Karoussi et al.,“ Data Mining K-Clustering Problem”, University of Agder, 2012. [5] M. S. V. K. Pang-NingTan, “Data mining,” in Introduction to data mining, Pearson International Edition , 2006, pp. 496-508. [6] Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human '13) Stuttgart, Germany: ACM SIGCHI, 2013.

ISSN: 2249-2593

http://www.ijcotjournal.org

Page 75


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.