GRD Journals- Global Research and Development Journal for Engineering | Volume 4 | Issue 7 | June 2019 ISSN: 2455-5703
Implementation of Data Mining Algorithms using R V. Neethidevan Department of MCA Mepco Schlenk Engineering College, India
Abstract Data mining is an inter disciplinary field and it finds application everywhere. To solve many different day to life problems, the algorithms could be made use. Since R studio is more comfortable for researcher across the globe, most widely used data mining algorithms for different cases studies are implemented in this paper by using R programming language. Could be implemented with help of R programming. The advanced sensing and computing technologies have enabled the collection of large amount of complex data. Data mining techniques can be used to discover useful patterns that in turn can be used for classifying new data or other purpose. The algorithm for processing large set of data is scalable. Algorithm for processing data with changing pattern must be capable of incrementally learning and updating data patterns as new data become available. Still data mining algorithm such as decision tree support the incremental learning of data with mixed data types, the user is not satisfied with scalability of these algorithms in handling large amount of data. The following algorithms were implemented using R studio with complex data set. There are four algorithms in the project- 1) Clustering Algorithm 2) Classification Algorithm 3) Apriori Algorithm 4) Decision Tree Algorithm. It is concluded that R studio produced most efficient result for implementing the above said algorithms. Keywords- R, Data Mining, Clustering, Classification, Decision Tree, Apriori Algorithm, Data Sets
I. INTRODUCTION R Studio is a free and open-source integrated development environment (IDE) for R, a programming language for statistical computing and graphics. R Studio is written in the C++ programming language and uses the Qt framework for its graphical user interface, which including rich code editing, debugging, testing, and profiling tools. A. Clustering Algorithm K--means is one of the simplest unsupervised learning algorithms that solve the well-known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed apriori. The main idea is to define k centers, one for each cluster. These centers should be placed in a cunning way because of different location causes different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest center. B. Classification Algorithm It is one of the Data Mining. That is used to analyze a given data set and takes each instance of it. It assigns this instance to a particular class. Such that classification error will be least. It is used to extract models. That define important data classes within the given data set. Classification is a two-step process. During the first step, the model is created by applying classification algorithm. That is on training data set. Then in the second step, the extracted model is tested against a predefined test data set. That is to measure the model trained performance and accuracy. So classification is the process to assign class label from a data set whose class label is unknown. C. Apriori Algorithm Apriori is an algorithm for frequent item set mining and association rule learning over transactional databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent item sets determined by Apriori can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis. D. Decision Tree Algorithm Decision tree learning uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). It is one of the predictive modeling approaches used in statistics, data mining and machine learning. Tree models where the target variable can take a discrete set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees.
All rights reserved by www.grdjournals.com
4
Implementation of Data Mining Algorithms using R (GRDJE/ Volume 4 / Issue 7 / 002)
II. DATA SET CREATION A. Steps to Create a Dataset The Excel data set type supports one value per parameter. It does not support multiple selection for parameters. To create a data set using a Microsoft Excel file from a file directory data source: – Click the New Data Set toolbar button and select Microsoft Excel File. The New Data Set - Microsoft Excel File dialog launches. – Enter a name for this data set. – Click Shared to enable the Data Source list. – Select the data source where the Microsoft Excel File resides. – To the right of the File Name field, click the browse icon to browse for the Microsoft Excel file in the data source directories. Select the file. – If the Excel file contains multiple sheets or tables, select the appropriate Sheet Name and Table Name for this data set, as shown below. – If you added parameters for this data set, click Add Parameter. Enter the Name and select the Value. The Value list is populated by the parameter
III. IMPLEMENTATION DETAILS The algorithms were implemented under R Studio with the necessary code. The code is attached with Appendix. Different datasets were used for each algorithm implementation. A. Clustering
All rights reserved by www.grdjournals.com
5
Implementation of Data Mining Algorithms using R (GRDJE/ Volume 4 / Issue 7 / 002)
B. Classification
All rights reserved by www.grdjournals.com
6
Implementation of Data Mining Algorithms using R (GRDJE/ Volume 4 / Issue 7 / 002)
C. Apriori Algorithm
D. Decision Tree Algorithm
All rights reserved by www.grdjournals.com
7
Implementation of Data Mining Algorithms using R (GRDJE/ Volume 4 / Issue 7 / 002)
IV. CONCLUSION The Implementation of Data Mining Algorithm acts efficiently done in R environment and enhancing its features. The large set of data could be processed and manipulate using R environment. This can be widely used in statistical analysis of data. Since it is very large in size the user can’t notice its occurrence. This System is able to achieve reliability. It reduces the human involvement in manipulating the data. This System reduces risk in the mistake that the human occurs while manipulate the larger set of data.
APPENDIX A. Cluster library (datasets) data (iris) summary (iris) set.seed(8953) iris1<-iris iris1$Species<-NULL (kmeans.result<-kmeans(iris1,3)) table (iris$Species,kmeans.result$cluster) plot (iris1[c("Sepal.Length", "Sepal.Width")], col = kmeans.result$cluster) points (kmeans.result$centers[, c("Sepal.Length", "Sepal.Width")],col = 1:3, pch = 8, cex library (fpc) pamk.result <- pamk(iris1) pamk.result$nc table (pamk.result$pamobject$clustering, iris$Species) layout(matrix(c(1, 2), 1, 2)) plot (pamk.result$pamobject)
= 2)
All rights reserved by www.grdjournals.com
8
Implementation of Data Mining Algorithms using R (GRDJE/ Volume 4 / Issue 7 / 002)
library(fpc) iris2 <- iris[-5] # remove class tags ds <- dbscan(iris2, eps = 0.42, MinPts = 5) table(ds$cluster, iris$Species) plot(ds, iris2[c(1, 4)]) plotcluster(iris2, ds$cluster) B. Classification Algorithm str(iris) set.seed(1234) ind <- sample(2, nrow(iris), replace = TRUE, prob = c(0.7, 0.3)) train.data <- iris[ind == 1, ] test.data <- iris[ind == 2, ] library(party) myFormula <- Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width iris_ctree <- ctree(myFormula, data = train.data) table(predict(iris_ctree), train.data$Species) Print ctree print(iris_ctree) print(iris_ctree) plot(iris_ctree) C. Apriori Algorithm install.packages("caTools") # Decision Tree Regression # Importing the dataset setwd("E:\\Research 2018\\Course VN\\Algorithm Datasets\\Decision_Tree_Regression") datasets = read.csv('Position_Salaries.csv') dataset = datasets[2:3] # Splitting the dataset into the Training set and Test set # # install.packages('caTools') # library(caTools) # set.seed(123) # split = sample.split(dataset$Salary, SplitRatio = 2/3) # training_set = subset(dataset, split == TRUE) # test_set = subset(dataset, split == FALSE) # Feature Scaling # training_set = scale(training_set) # test_set = scale(test_set) # Fitting Decision Tree Regression to the dataset # install.packages('rpart') #rpart(Recursive partitioning is a statistical method for multivariable analysis. Recursive partitioning creates a decision tree that strives to correctly classify members of the population by splitting it into sub-populations based on several dichotomous independent variables) library(rpart) # ~. is tilde dot plot for dependent and independet variable regressor = rpart(formula = Salary ~ ., data = dataset, control = rpart.control(minsplit = 1)) # rpart.control: Various parameters that control aspects of the rpart fit. # minsplit :the minimum number of observations that must exist in a node in order for a split to be attempted. # Predicting a new result with Decision Tree Regression y_pred = predict(regressor, data.frame(Level = 6.5)) y_pred #Apriori algorithm setwd("E:\\Research 2018\\Course VN\\Algorithm Datasets\\Apriori") library(arules) dataset=read.csv('Market_Basket_Optimisation.csv',header=FALSE) dataset = read.transactions('Market_Basket_Optimisation.csv',sep=',',rm.duplicates=TRUE) summary(dataset) All rights reserved by www.grdjournals.com
9
Implementation of Data Mining Algorithms using R (GRDJE/ Volume 4 / Issue 7 / 002)
itemFrequencyPlot(dataset, topN=10) rules=apriori(data=dataset,parameter=list(support=0.04,confidence=0.2)) #visulaalizing results inspect(sort(rules,by='lift')[1:10])A q() D. Decision Tree Algorithm # Visualising the Decision Tree Regression results (higher resolution) # install.packages('ggplot2') library(ggplot2) #seq : Sequence Generation x_grid = seq(min(dataset$Level), max(dataset$Level), 0.01) # 0.01 number: increment of the sequence ggplot() + geom_point(aes(x = dataset$Level, y = dataset$Salary), colour = 'red') + geom_line(aes(x = x_grid, y = predict(regressor, newdata = data.frame(Level = x_grid))), colour = 'blue') + ggtitle('Truth or Bluff (Decision Tree Regression)') + xlab('Level') + ylab('Salary') # Plotting the tree plot(regressor) text(regressor)
REFERENCES Book [1] [2]
Rakesh Agrawal and Ramakrishnan Srikant Fast algorithms for mining association rules. Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, pages 487-499, Santiago, Chile, September 1994. Rodriguez, J. J.; Kuncheva, L. I.; Alonso, C. J. (2006). "Rotation forest: A new classifier ensemble method". IEEE Transactions on Pattern Analysis and Machine Intelligence. 28 (10): 1619â&#x20AC;&#x201C;1630. doi:10.1109/TPAMI.2006.211
Website [3] [4]
https://docs.oracle.com/middleware/12211/bip/BIPDM/GUID-70F8A7D1-B206-434A-9B20-D2D7377AC0CB.htm#BIPDM179 https://stackoverflow.com/questions/6771588/how-to-define-a-simple-dataset-in-r
All rights reserved by www.grdjournals.com
10