International Journal of Research in Advent Technology, Vol.2, No.6, June 2014 E-ISSN: 2321-9637
A Master Slave Cluster Based Approach for Selecting Frequent Item Sets Arshdeep Kaur Kharoud1, Dr. Navdeep Kaur2 1
Student of Master of Technology and 2Associate Professor and Head Department of Computer Science and Engineering Sri Guru Granth Sahib World University Fatehgarh Sahib, Punjab, India Email: arsh.kharoud@gmail.com1 and drnavdeep.sggswu@gmail.com2
Abstract— With the increase of the usage of technology for various applications the size of the databases also increases. Data mining or knowledge discovery is the process of digging through and analyzing enormous sets of data and then extracting the meaningful data and data patterns for business as well as real time applications. Various techniques are available for mining the data from databases; clustering is one of the data mining techniques. Clustering is the unsupervised learning process which organizes the data into various clusters having similar properties. In this paper a new approach is proposed which is based on master slave technique. The proposed algorithm works on distributed machines, in which each machine works as a slave or a cluster. The output of the each slave is sent to the master where the final output is obtained. The proposed approach is a bottomup approach. Index Terms— Data mining, Clustering, Maximal frequent patterns.
1. INTRODUCTION Data Mining is the process of sifting through very large amounts of data for useful information. It is the process of discovering interesting patterns and knowledge from large amount of data. The data sources can include databases, data warehouses, the web and other information repositories [1]. While large-scale information technology has been evolving separate transaction and analytical systems, data mining provides the link between the two. Data mining software analyzes relationships and patterns in stored transaction data. Clustering, classification and association rules are the techniques used of data mining. Clustering is the process of organizing objects into groups whose members are similar in some way. A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters [2]. Clustering is needed in data mining because it compacts the representation of data which increases the speed and efficiency of data mining process and decreases the need of memory to store the data. It also helps in fast identification of "outliers". Hierarchical Methods, Partitioning Methods, Density-based Methods, Model-based Clustering Methods, Grid-based Methods etc are the various methods for the clustering. 2. LITERATURE REVIEW Raghuvira Pratap A et al. [3] proposed an efficient density based k-medoids clustering algorithm to overcome the
drawbacks of DBSCAN and k-medoids clustering algorithms. The result of this algorithm is an improved version of kmedoids clustering algorithm. This algorithm performs better than DBSCAN while handling clusters of circularly distributed data points and slightly overlapped clusters. Seema Maitrey et al. [4] presented Enhancement of CURE Clustering Technique in Data Mining. This paper has considered CURE method from hierarchical clustering. CURE (Clustering usage Representatives) method find clusters from a large database that is more robust to outliers, and identifies clusters having non-spherical shapes and wide variances in size. CURE employs a combination of data collection, data reduction by using random sampling and partitioning. With the availability of large data sets in application areas like bioinformatics, medical informatics, scientific data analysis, financial analysis, telecommunications, retailing, and marketing, it is becoming increasingly important to execute data mining tasks in parallel. At the same time, technological advances have made shared memory parallel machines commonly available to organizations and individuals. Although CURE provides high quality clustering, a parallel version was not available. This new algorithm enabled it to outperform existing algorithms as well as to scale well for large databases without declining clustering quality. Tahereh Hassanzadeh and Mohammad Reza Meybodi [5] presented a new approach to using firefly algorithm to cluster data. It is shown how firefly algorithm can be used to find the
248