International Journal of Research in Advent Technology, Vol.2, No.6, June 2014 E-ISSN: 2321-9637
A Master Slave Cluster Based Approach for Selecting Frequent Item Sets Arshdeep Kaur Kharoud1, Dr. Navdeep Kaur2 1
Student of Master of Technology and 2Associate Professor and Head Department of Computer Science and Engineering Sri Guru Granth Sahib World University Fatehgarh Sahib, Punjab, India Email: arsh.kharoud@gmail.com1 and drnavdeep.sggswu@gmail.com2
Abstract— With the increase of the usage of technology for various applications the size of the databases also increases. Data mining or knowledge discovery is the process of digging through and analyzing enormous sets of data and then extracting the meaningful data and data patterns for business as well as real time applications. Various techniques are available for mining the data from databases; clustering is one of the data mining techniques. Clustering is the unsupervised learning process which organizes the data into various clusters having similar properties. In this paper a new approach is proposed which is based on master slave technique. The proposed algorithm works on distributed machines, in which each machine works as a slave or a cluster. The output of the each slave is sent to the master where the final output is obtained. The proposed approach is a bottomup approach. Index Terms— Data mining, Clustering, Maximal frequent patterns.
1. INTRODUCTION Data Mining is the process of sifting through very large amounts of data for useful information. It is the process of discovering interesting patterns and knowledge from large amount of data. The data sources can include databases, data warehouses, the web and other information repositories [1]. While large-scale information technology has been evolving separate transaction and analytical systems, data mining provides the link between the two. Data mining software analyzes relationships and patterns in stored transaction data. Clustering, classification and association rules are the techniques used of data mining. Clustering is the process of organizing objects into groups whose members are similar in some way. A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters [2]. Clustering is needed in data mining because it compacts the representation of data which increases the speed and efficiency of data mining process and decreases the need of memory to store the data. It also helps in fast identification of "outliers". Hierarchical Methods, Partitioning Methods, Density-based Methods, Model-based Clustering Methods, Grid-based Methods etc are the various methods for the clustering. 2. LITERATURE REVIEW Raghuvira Pratap A et al. [3] proposed an efficient density based k-medoids clustering algorithm to overcome the
drawbacks of DBSCAN and k-medoids clustering algorithms. The result of this algorithm is an improved version of kmedoids clustering algorithm. This algorithm performs better than DBSCAN while handling clusters of circularly distributed data points and slightly overlapped clusters. Seema Maitrey et al. [4] presented Enhancement of CURE Clustering Technique in Data Mining. This paper has considered CURE method from hierarchical clustering. CURE (Clustering usage Representatives) method find clusters from a large database that is more robust to outliers, and identifies clusters having non-spherical shapes and wide variances in size. CURE employs a combination of data collection, data reduction by using random sampling and partitioning. With the availability of large data sets in application areas like bioinformatics, medical informatics, scientific data analysis, financial analysis, telecommunications, retailing, and marketing, it is becoming increasingly important to execute data mining tasks in parallel. At the same time, technological advances have made shared memory parallel machines commonly available to organizations and individuals. Although CURE provides high quality clustering, a parallel version was not available. This new algorithm enabled it to outperform existing algorithms as well as to scale well for large databases without declining clustering quality. Tahereh Hassanzadeh and Mohammad Reza Meybodi [5] presented a new approach to using firefly algorithm to cluster data. It is shown how firefly algorithm can be used to find the
248
International Journal of Research in Advent Technology, Vol.2, No.6, June 2014 E-ISSN: 2321-9637 centroid of the user specified number of clusters. The algorithm then extended to use k-means clustering to refined centroids and clusters. This new hybrid algorithm called K-FA. The experimental results showed the accuracy and capability of proposed algorithm to data clustering. Dr. T. Velmurugan [6] presented Evaluation of k-Medoids and Fuzzy C-Means Clustering Algorithms for Clustering Telecommunication Data. This research work deals with two of the most delegated, partition based clustering algorithms in data mining namely k-Medoids and Fuzzy C-Means. These two algorithms are implemented and the performance is analyzed based on their clustering result quality. The connection oriented broad band data is the source of data for this analysis. To test the performance, the distance between the server locations and their connections are taken for clustering. The number of connections in the servers is changed after the clustering process. The run time for each algorithm is analyzed and the results are compared with one another. Finally, the best algorithm is suggested based on their computational time for the chosen telecommunication data. K.Kalaivani and A.P.V. Raghavendra [7] described enhancement of the existing kmeans clustering process based on the categorical and mixed data types in efficient manner. The goal is to use integrated clustering approach based on high dimensional categorical data that works well for data with mixed continuous and categorical features. The experimental results of the proposed method on several data sets are suggest that the link based cluster ensemble algorithm integrate with proposed k-means algorithm to produce accurate clustering results. In this proposed algorithm proves the convergence property of clustering process, thus improves the accuracy of clustering results. The scope of this proposed work is used to provide the accurate and efficient results, whenever the user wants to access the data from the database. Prajesh P Anchalia et al. [8] described the implementation of the K-Means Clustering Algorithm over a distributed environment using ApacheTM Hadoop. The key to the implementation of the K-Means Algorithm is the design of the Mapper and Reducer routines which has been discussed in the later part of the paper. The steps involved in the execution of the K-Means Algorithm has also been described in this paper based on a small scale implementation of the K-Means Clustering Algorithm on an experimental setup to serve as a guide for practical implementations. TayfunServi and HamzaErol [9] proposed a new data mining method for determining the number and structure of clusters, and refining groups in multivariate heterogeneous data set including groups, partly and completely overlapped group structures by using dynamic model based clustering. It is called dynamic model based clustering since the structure of model changes at each stage of refinement process dynamically. The proposed data mining method works without data reduction for high dimensional data in which some of variables including completely overlapped situations.
3. CHALLENGES OF CLUSTERING Since there are various algorithms available for the clustering, it is very difficult to choose the suitable algorithm for a particular dataset. There are many limitations in the existing algorithms; some algorithms have problems when the clusters are of differing sizes, densities, having non-globular shapes. Some algorithms are sensitive to noise and outliers. Each algorithm has its own run time, complexity, error frequency etc. Another issue may be that the outcome of a clustering algorithm mainly depends on the type of dataset used. With the increase in the size and dimensions of dataset, it becomes difficult to handle for a particular clustering algorithm. The complexity of data set increases with the inclusion of data like audios, videos, pictures and other multimedia data which form very heavy database. There is a great need to choose the efficient clustering algorithm for the dataset. The selection of a clustering algorithm may based on the type of dataset, time requirement, efficiency needed, accuracy required, error tolerance etc. so the main challenge is to choose the correct type of clustering algorithm to get the desired results [10]. 4.PROPOSED ALGORITHM The proposed algorithm uses master slave technique. In this architecture there is a master, which is declared as the master during the setting up of the architecture. Master is a more powerful machine than the slaves and interconnects with the slaves. Algorithm: Proposed Algorithm Input: Data sets Output: Maximal frequent patterns Step 1:
Start
Step 2:
Generate Frequent Pattern Tree (FP-Tree) at each slave
Step 3:
Now starting from level 1 candidates to last level candidates i. Send k’s level candidate to the master from each slave but in sorted order. ii. Then we can get a total support_count at the master level. Now wemachines can get Maximal This algorithm worksiii.on distributed in which each machine works as a slave. Frequent Patterns. Step 4:
End
249
International Journal of Research in Advent Technology, Vol.2, No.6, June 2014 E-ISSN: 2321-9637 5. CONCLUSION
Master
Slave
Slave
Slave
Figure 1: Master Slave architectural design
Step 1: Start Step 2: Generate the Frequent Pattern Tree at each slave. After generating FP Tree on each slave the situation is:
Master
FP Tree1
FP Tree2
Slave 1
Slave 2
FP Tree3 Slave 3
Figure 2: After generating FP Tree on each Slave
In this paper, data mining and various clustering algorithms and techniques for data mining are discussed. A new approach is proposed for clustering in data mining which is efficient than the traditional clustering algorithms. It works on distributed systems, in which each system works as a slave. Firstly, the FP-Tree is generated at each slave and the result is generated in the form of candidates then the result of each slave is sent to the master but in the sorted order. The total support_count is obtained at the master level, which generates the maximal frequent patterns. This new approach has better efficiency and reduces the time and space complexity. ACKNOWLEDGEMENTS I would like to place on record my deep sense of gratitude to Dr. Navdeep Kaur, Associate Professor and Head of Department of Computer Science and Engineering, Sri Guru Granth Sahib World University, Fatehgarh Sahib, Punjab, India, for her generous guidance, help and useful suggestions. I would like to say thanks for support of my friends. I want to express my appreciation to every person who contributed their inspiration. I am highly grateful to my parents and brother for the inspiration and ever encouraging moral support, which enabled me to pursue my studies.
Candidates are generated as a result at the each level, from level 1 to level n. Step 3: i. The generated candidates from level 1 to level n are sent to the master in the sorted order. ii.
With the help of the generated candidates the total support_count is obtained at the master level.
iii.
With the help of total support_count the Maximal Frequent Patterns are obtained as the final result at the master level.
Step 4: End. The proposed algorithm works efficiently for the distributed data. It is based on master slave technique therefore it decreases the time complexity by performing the task in the parallel manner. The data is divided on the different slaves and the FP Tree is generated at each slave to generate candidate and these candidates are sent to the master which in result decreases the space required to store the candidates on the master.
Arshdeep Kaur Kharoud REFERENCES [1] Jiawei Han, Micheline Kamber and Jian Pei, “Data Mining Concept and Techniques”, 3rded., Morgan Kaufmann Publishers, Waltham, MA 02451 (U.S.A.), 2012. [2] Tariq Rashid, “Clustering: An Introduction”, home.deib.polimi.it, [Online]. Available: http://home.deib.polimi.it/matteucc/Clustering/tutorial_ht ml/[Accessed:2013]. [3] RaghuviraPratap A, K SuvarnaVani, J Rama Devi and Dr.KNageswaraRao "An Efficient Density Based Improved K- Medoids Clustering Algorithm", IJACSA Vol. 2, No. 6, 2011, pp. 49-54. [4] SeemaMaitrey, C K Jha, Rajat Gupta and Jaiveer Singh, "Enhancement of CURE Clustering Technique in Data Mining", IJCA, April, 2012, New York, USA, pp. 7-11. [5] TaherehHassanzadeh and Mohammad Reza Meybodi, "A New Hybrid Approach for Data Clustering Using Firefly Algorithm and K-means", IEEE,2012,Shiraz, Fars, pp. 007-011. [6] Dr. T. Velmurugan, "Evaluation of k-Medoids and Fuzzy C-Means Clustering Algorithms for Clustering
250
International Journal of Research in Advent Technology, Vol.2, No.6, June 2014 E-ISSN: 2321-9637 Telecommunication Data", IEEE,2012,Tiruchirappalli, Tamilnadu, India, pp. 115-120. [7] K.Kalaivani and A.P.V. Raghavendra, "An Integrated Clustering Approach for High Dimensional Categorical Data", IEEE, March, 2013, Nagercoil, pp. 1-4. [8] Prajesh P Anchalia, Anjan K Koundinya and Srinath N K, "MapReduce Design of K-Means Clustering Algorithm", IEEE, June, 2013, Suwon, pp. 1-5. [9] TayfunServi and HamzaErol, "A Data Mining Method For Refining Groups In Data Using Dynamic Model Based Clustering", IEEE, June, 2013, Albena, pp. 1-6. [10] Ms. Asmita Yadav, "A Survey Of Issues And Challenges Associated With Clustering Algorithms ", IJSETT, July,2013, pp. 7-11.
251