Clustering of Streaming Data using STRAP Algorithm

IJSRD - International Journal for Scientific Research & Development| Vol. 4, Issue 05, 2016 | ISSN (online): 2321-0613

Clustering of Streaming Data using STRAP Algorithm Madhuri V. Gohad1 Prof. P. M. Yawalkar2 1 P.G. Student 2Professor 1,2 Department of Computer Engineering 1,2 METs Institute of Engineering, Adgoan, Nashik, Savitribai Phule Pune University Abstract— Now a day’s data stream clustering is more active research area, used to discover useful information from continuously generated huge amounts of data. There are various clustering algorithms related to data stream have been developed and proposed to make clustering on data stream. Clustering is the method of arranging the objects in one set, such that objects in the single group are more related to each other than those in other clusters (groups). The clustering of data stream imposes various challenges that need to be solved; some of them are dealing with dynamic data, capable of performing processing on fast incoming objects, also capable to perform processing of incremental data objects, and ability to address time, memory and cost limitations. The proposed STRAP clustering algorithm extends the Affinity Propagation (AP) to handle evolving data stream. It combines the statistical change point detection test with Affinity Propagation. It ingredients a group of labeled data objects with group of exemplars for detecting a changes in the generative process underlying the dream data. Experimental results with stateof-the-art data stream clustering methods demonstrate the effectiveness and efficiency of the proposed method. The proposed semi-supervised STRAP algorithm increases the accuracy and decreases the percentage of outliers compare to existing system. Key words: Affinity propagation, Data Stream, STRAP Algorithm I. INTRODUCTION THE clustering is a widely studied research problem in the data mining. Clustering is the assignment of data objects in one set such that the objects that are more related to each other are in one group whereas unrelated objects are in other groups. Data in cluster is an ordered list having similarity and dissimilarity. Cluster analysis can be done by finding the data similarities and putting similar data objects into one cluster and dissimilar data objects into reservoir or other clusters. However, the difficulty is to prepare data stream clustering algorithms having arbitrary shapes because of data set has only one pass constraints. In various conditions, data is more easily and better characterized by a measure of pair wise similarities instead of negative squared Euclidean distance. The aim of clustering is to achieve a better clustering behavior, specifically having a low computational complexity or cost with low distortion. The task of exemplar-based clustering algorithm is to identify data point subsets as exemplars and assign every next data point to one of those identified exemplars. This focuses on learning or understanding a data stream generative model, having some important features:  The generative model is represented through the actual data items (i.e. set of exemplars).  It is available and used at any time-step, for monitoring application.



Statistical hypothesis testing is used to detect changes in the underlying distribution [12]. The affinity propagation algorithm is simple to define and customize; it is also computationally effective and efficient, scales linearly or quadratically in the number of similarities. Because of the huge size and evolving property of data streams, a good stream clustering algorithm should meet the following specific requirements:  Single scan of data.  Ability to filter out noise in continuously evolving streams.  The system should be able to process very large or infinite streams in main memory of limited size. Data streaming is a data that gathers through telephone records, webcams, and online transactions. This kind of data is continuous, so to maintain that data the need is to select best representatives from clusters of streaming data. II. LITERATURE SURVEY The X. Zhang et. al. proposed the STRAP algorithm which proceeds by incrementally updating the current model if the current data item _ts the model; otherwise put it in a reservoir. A Change Point Detection (CPD) test, detects the changes of distribution by monitoring the data items sent to reservoir. When CPD test triggered, the new model is rebuilt from the current model and data items in the reservoir [8], [9], [12], [14], The BIRCH method is suitable for very large databases and used data streaming. BIRCH makes use of all available memory for generating finest possible sub-clusters or clusters. In this method the concept of Clustering Feature (CF) tree is used to generate the clusters, where CF is a height balanced tree [1], [13]. L. O Callaghan et. al. proposed the method called STREAM. In the first step, the objects are together grouped and medians are gathered and calculated for each group and depending on the number of data objects in the clusters weights are assigned to it. In second step the medians are used to cluster until top tree [2], [7], [13]. C. Aggarwal et. al. implemented the method called CluStream. The CluStream method tries to cluster the whole data stream at single time-step instead of viewing the stream when process changed. The idea of this technique is to split the clustering into online components which stores detailed summary statistics periodically and an offline component uses this summary statistics [3], [5], [7], [13]. Clustering On Demand Framework was suggested by B.R. Dai et. al. and was used to dynamically cluster multiple data streams. The Clustering On Demand (COD) framework mainly has two advantages one is an online statistic collection has one data scan and another is to compact multi resolution approximation which are designed and developed respectively for time and space constraint in

862

Turn static files into dynamic content formats.

Create a flipbook

Clustering of Streaming Data using STRAP Algorithm

Published on Mar 16, 2017

International Journal for Scientific Research and Development - IJSRD

Now a dayÃ¢â‚¬â„¢s data stream clustering is more active research area, used to discover useful information from continuously generated huge amounts of data. There are various clustering algorithms related to data stream have been developed and proposed to make clustering on data stream. Clustering is the method of arranging the objects in one set, such that objects in the single group are more related to each other than those in other clusters (groups). The clustering of data stream imposes various challenges that need to be solved; some of them are dealing with dynamic data, capable of performing processing on fast incoming objects, also capable to perform processing of incremental data objects, and ability to address time, memory and cost limitations. The proposed STRAP clustering algorithm extends the Affinity Propagation (AP) to handle evolving data stream. It combines the statistical change point detection test with Affinity Propagation.