Implementation of Data Mining Algorithms using R

GRD Journals- Global Research and Development Journal for Engineering | Volume 4 | Issue 7 | June 2019 ISSN: 2455-5703

Implementation of Data Mining Algorithms using R V. Neethidevan Department of MCA Mepco Schlenk Engineering College, India

Abstract Data mining is an inter disciplinary field and it finds application everywhere. To solve many different day to life problems, the algorithms could be made use. Since R studio is more comfortable for researcher across the globe, most widely used data mining algorithms for different cases studies are implemented in this paper by using R programming language. Could be implemented with help of R programming. The advanced sensing and computing technologies have enabled the collection of large amount of complex data. Data mining techniques can be used to discover useful patterns that in turn can be used for classifying new data or other purpose. The algorithm for processing large set of data is scalable. Algorithm for processing data with changing pattern must be capable of incrementally learning and updating data patterns as new data become available. Still data mining algorithm such as decision tree support the incremental learning of data with mixed data types, the user is not satisfied with scalability of these algorithms in handling large amount of data. The following algorithms were implemented using R studio with complex data set. There are four algorithms in the project- 1) Clustering Algorithm 2) Classification Algorithm 3) Apriori Algorithm 4) Decision Tree Algorithm. It is concluded that R studio produced most efficient result for implementing the above said algorithms. Keywords- R, Data Mining, Clustering, Classification, Decision Tree, Apriori Algorithm, Data Sets

I. INTRODUCTION R Studio is a free and open-source integrated development environment (IDE) for R, a programming language for statistical computing and graphics. R Studio is written in the C++ programming language and uses the Qt framework for its graphical user interface, which including rich code editing, debugging, testing, and profiling tools. A. Clustering Algorithm K--means is one of the simplest unsupervised learning algorithms that solve the well-known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed apriori. The main idea is to define k centers, one for each cluster. These centers should be placed in a cunning way because of different location causes different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest center. B. Classification Algorithm It is one of the Data Mining. That is used to analyze a given data set and takes each instance of it. It assigns this instance to a particular class. Such that classification error will be least. It is used to extract models. That define important data classes within the given data set. Classification is a two-step process. During the first step, the model is created by applying classification algorithm. That is on training data set. Then in the second step, the extracted model is tested against a predefined test data set. That is to measure the model trained performance and accuracy. So classification is the process to assign class label from a data set whose class label is unknown. C. Apriori Algorithm Apriori is an algorithm for frequent item set mining and association rule learning over transactional databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent item sets determined by Apriori can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis. D. Decision Tree Algorithm Decision tree learning uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). It is one of the predictive modeling approaches used in statistics, data mining and machine learning. Tree models where the target variable can take a discrete set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees.

4