International Journal of Computer & Organization Trends –Volume 3 Issue 11 – Dec 2013
Improved Document Clustering Using Multi View Point Graph Index Model Yasoda Khumbham Mtech student
D.N.V.S.L.S. Indira Assistant professor CSE dept Abstract – Clustering is among the data mining and text mining techniques designed to analyze datasets by dividing it into meaningful groups. The data points(objects) inside the dataset can have certain relationships among them. All clustering algorithms assume this are applied to various datasets. The existing algorithms for text mining make use of a single viewpoint for measuring similarity between objects. Their drawback is the fact that the clusters can’t exhibit comprehensive variety of relationships among objects. To overcome this drawback, we propose a new similarity measure known as multi-viewpoint based similarity measure to confirm the clusters show all relationships among objects. Our main objective is to cluster web documents. So, with this work “ Improved multi view- point based web and desktop based document clustering methods” will proposed for clustering high dimensional data . This process includes different viewpoints from several objects of multiple clusters and more useful assessment of similarity could be possibly be achieved. In this proposed system Novel phrase- based document index model ie Document Index Graph, allowing for incremental construction of a phrase-based index of a given document with an emphasis on efficiency, rather than depend on single-term indexes only. It provides efficient phrase matching that is used to evaluate the similarity between documents. An incremental document clustering algorithm based on maximizing the tightness of clusters by carefully watching the pair-wise document similarity distribution inside clusters. Keywords – C l u s t e r s ,Indexing,Keyword,Graph.
I. INTRODUCTION Clustering is basically a strategy of grouping particular physical or abstract objects into classes of similar objects and it is a most captivating concept of data mining in which certain data objects that might be identical to another. Purpose of Clustering will be to catch fundamental structures in data and classify them into meaningful subgroup for more analysis. The majority of the clustering algorithms most certainly been published every year and could be proposed for several research fields. They have been developed by using various techniques and approaches. But as documented in the recent study k-means has been one of the top most data mining algorithms presently. For quite a number of of the practitioners kmeans will be the favorite algorithm in their related fields to make use of. Even if it is naturally a top most algorithm, it possesses a few basic drawbacks when clusters are of
ISSN: 2249-2593
differing sizes, densities and non-globular shape. Irrespective of the drawbacks is simplicity understandability, and scalability is likely the main reasons that made the algorithm popular. Still requires more robust dissimilarity or similarity measures; recent works for instance [8] illustrate this need. In this paper is motivated by investigations that are caused by the above and similar research findings. It appears to be to us which the nature of similarity measure plays an important function in the success or failure of a clustering method. Our first objective will be to derive a new way of measuring similarity between data objects in sparse and high-dimensional domain, particularly text documents. Seen from the proposed similarity measure, we formulate new clustering criterion functions and introduce their respective clustering algorithms, which happen to be fast and scalable technique, yet are also capable of providing high-quality and consistent performance[4]. Our investigation of similarity of clustering was initially motivated by the research on automated text categorization of foreign language texts, as explained. Clearly as the level of digital documents has been increasing dramatically during recent times like the Internet grows, information management, search, and retrieval, etc., have become practically important problems. Developing techniques organize a lot of unstructured text documents into a smaller number of meaningful clusters could well be very good as document clustering is essential through tasks as indexing, filtering, automated metadata generation, word sense disambiguation, population of hierarchical catalogues of web resources and, in general, any application requiring document organization . Document clustering is currently being studied from many decades nonetheless it has been far from a unimportant and solved problem. The challenges are: 1. Selecting appropriate keywords of the documents that ought to be used for clustering. 2. Selecting a fast similarity measure between documents. 3. Selecting a regular clustering method utilising the above similarity measure. 4. Implementing the clustering algorithm in an productive way that produces it feasible inside of terms of required memory and CPU resources. 5. Finding methods for assessing the overall quality of the performed clustering.
http://www.ijcotjournal.org
Page 549