International Journal of Computer & Organization Trends –Volume 3 Issue 11 – Dec 2013
Improved Document Clustering Using Multi View Point Graph Index Model Yasoda Khumbham Mtech student
D.N.V.S.L.S. Indira Assistant professor CSE dept Abstract – Clustering is among the data mining and text mining techniques designed to analyze datasets by dividing it into meaningful groups. The data points(objects) inside the dataset can have certain relationships among them. All clustering algorithms assume this are applied to various datasets. The existing algorithms for text mining make use of a single viewpoint for measuring similarity between objects. Their drawback is the fact that the clusters can’t exhibit comprehensive variety of relationships among objects. To overcome this drawback, we propose a new similarity measure known as multi-viewpoint based similarity measure to confirm the clusters show all relationships among objects. Our main objective is to cluster web documents. So, with this work “ Improved multi view- point based web and desktop based document clustering methods” will proposed for clustering high dimensional data . This process includes different viewpoints from several objects of multiple clusters and more useful assessment of similarity could be possibly be achieved. In this proposed system Novel phrase- based document index model ie Document Index Graph, allowing for incremental construction of a phrase-based index of a given document with an emphasis on efficiency, rather than depend on single-term indexes only. It provides efficient phrase matching that is used to evaluate the similarity between documents. An incremental document clustering algorithm based on maximizing the tightness of clusters by carefully watching the pair-wise document similarity distribution inside clusters. Keywords – C l u s t e r s ,Indexing,Keyword,Graph.
I. INTRODUCTION Clustering is basically a strategy of grouping particular physical or abstract objects into classes of similar objects and it is a most captivating concept of data mining in which certain data objects that might be identical to another. Purpose of Clustering will be to catch fundamental structures in data and classify them into meaningful subgroup for more analysis. The majority of the clustering algorithms most certainly been published every year and could be proposed for several research fields. They have been developed by using various techniques and approaches. But as documented in the recent study k-means has been one of the top most data mining algorithms presently. For quite a number of of the practitioners kmeans will be the favorite algorithm in their related fields to make use of. Even if it is naturally a top most algorithm, it possesses a few basic drawbacks when clusters are of
ISSN: 2249-2593
differing sizes, densities and non-globular shape. Irrespective of the drawbacks is simplicity understandability, and scalability is likely the main reasons that made the algorithm popular. Still requires more robust dissimilarity or similarity measures; recent works for instance [8] illustrate this need. In this paper is motivated by investigations that are caused by the above and similar research findings. It appears to be to us which the nature of similarity measure plays an important function in the success or failure of a clustering method. Our first objective will be to derive a new way of measuring similarity between data objects in sparse and high-dimensional domain, particularly text documents. Seen from the proposed similarity measure, we formulate new clustering criterion functions and introduce their respective clustering algorithms, which happen to be fast and scalable technique, yet are also capable of providing high-quality and consistent performance[4]. Our investigation of similarity of clustering was initially motivated by the research on automated text categorization of foreign language texts, as explained. Clearly as the level of digital documents has been increasing dramatically during recent times like the Internet grows, information management, search, and retrieval, etc., have become practically important problems. Developing techniques organize a lot of unstructured text documents into a smaller number of meaningful clusters could well be very good as document clustering is essential through tasks as indexing, filtering, automated metadata generation, word sense disambiguation, population of hierarchical catalogues of web resources and, in general, any application requiring document organization . Document clustering is currently being studied from many decades nonetheless it has been far from a unimportant and solved problem. The challenges are: 1. Selecting appropriate keywords of the documents that ought to be used for clustering. 2. Selecting a fast similarity measure between documents. 3. Selecting a regular clustering method utilising the above similarity measure. 4. Implementing the clustering algorithm in an productive way that produces it feasible inside of terms of required memory and CPU resources. 5. Finding methods for assessing the overall quality of the performed clustering.
http://www.ijcotjournal.org
Page 549
International Journal of Computer & Organization Trends –Volume 3 Issue 11 – Dec 2013
Collection of Data includes the processes like crawling, indexing, filtering etc. which are used to collect the documents that needs to be clustered, index them to store and retrieve in a better way, and filter them to remove the extra data , for example, stopwords Preprocessing is done to represent the data in a form that can be used for clustering. There are many ways of representing the documents like, Vector-Model, graphical model, etc. Many measures are also used for weighing the documents and their similarities. Document Clustering is the main focus of this thesis and will be discussed in detail. Postprocessing includes the major applications in which the document clustering is used, for example, the recommendation application which uses the results of clustering for recommending news articles to the users[2].
II.
LITERATURE SURVEY
Document clustering is probably one of the text mining techniques. It was around because the inception of text mining domain. It is s strategy of grouping objects into some categories or groups in such a manner that there is maximization of intra-cluster object similarity and intercluster dissimilarity. Here a desire does mean a document and term can refer to a little bit of advice among the document. Each document considered for clustering is represented since an m – dimensional vector d. The can be seen as total number of terms contained in the given document. Document vectors will be the result of some sort of weighting schemes like Term Frequency –Inverse Document Frequency. Many approaches came into existence for document clustering. Some of them are information theoretic co-clustering, non – negative matrix factorization, and probabilistic model based method etc. However, these approaches have not use specific measure in finding document similarity. With this paper we perceive methods that specifically use certain measurement. That are caused by the literature it is discovered that possibly one of the popular measures is Eucludian distance. Dist (di,dj) = ||di – dj||
ISSN: 2249-2593
In text mining domain, cosine similarity measure is additionally often measurement for looking for document similarity, mostly for hi-dimensional and sparse document clustering. The cosine similarity measure is also applied to possibly one of the variants of k-means often known as spherical k-means. It can be mainly used to maximise the cosine similairity between cluster‟s centroid and to discover the documents among the cluster. The variation between k-means which uses Euclidian distance and of course the k-means that cause utilization of cosine similarity would certainly former focuses on vector magnitudes while latter concentrates on vector directions. Another popular approach is named graph partitioning approach. With this approach the document corpus is considered being a graph. Min – max cut algorithm happens to be the one which includes this procedure it also concentrates on minimizing centroid function. In nearest-neighbor graph clustering methods, such as the CLUTO‟s graph method above, the concept of similarity is somewhat different from the previously discussed methods. Two documents may have a certain value of cosine similarity, but if neither of them is in the other one‟s neighborhood, they have no connection between them. In existing work the real document datasets are used as examples for this validity test. The first is Reuter‟s collection, which is a famous test collection for text classification. In our validity test, we selected around 2,000 documents from the various categories: “abed”, “basic”, “interest”, “make”, “money-fix”, “boat” and “operate” to form Reuters. Some of the documents may appear in more than one category. The second dataset is k1b, a collection of 2,000 web pages from the web sites. Subject hierarchy, including 6 topics: “strength”, “entertainment”, “game”, “political affairs”, “tech” and “business”. It was created from a past study in information retrieval called Web Ace. EXISTING SYSTEM :
Cosine based similarity measure which give high outliers for noisy datapoints. MULTIVIEWPOINT-BASED CLUSTERING Existing system is tested on corpus offline datasets. Doesn‟t suit for online/Desktop document clustering.
PROPOSED SYSTEM IMPROVEMENTS:
Indexing Graph based clustering approach to remove noisy points and fast similarity matching.
http://www.ijcotjournal.org
Page 550
International Journal of Computer & Organization Trends –Volume 3 Issue 11 – Dec 2013
Existing work is extended to Desktop and Web document clustering instead of offline datasets.
III. PROPOSED SYSTEM The vector space model does not represent any relation amongst the words, so sentences are divided within the individual components with no representation of this very sentence structure. Then again, the proposed Document Index Graph (DIG for short) indexes your message while maintaining the sentence structure inside the original documents. This allows us to utilize more informative phrase matching as an alternative to individual words matching. Moreover, the DIG also captures several stages of significance of this very original sentences, thus allowing us for sentence significance also[4]. Document clustering can loosely be defined as “clustering of documents”. Clustering is a process of understanding the similarity an d/or dissimilarity between the given objects and thus, dividing them into meaningful subgroups sharing commoncharacteristic. Good clusters are those in which t he members inside the cluster have quite a deal of similar characteristics. Tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation.
Data preprocessing In the data preprocessing stage there are many steps used to prepare the documents to the next stage. Preprocessing consists of steps that take its input as a plain text document and its output as a set of tokens (which can be single-terms or n-grams) to be included in the vector model. These steps typically can be stated as following: 1. Punctuations, special characters and the numbers are filtered from the document. 2. Partition each document into phrases and then tokenize each phrase into its constituting words. 3. Remove the stopwords which were detected in the stopword list provided by us. 4. Get POS (Part Of Speech) tagger for each remaining word and eliminate each word which is not verb or noun. 5. Remove each word with low frequency or too much occurring words.
ISSN: 2249-2593
Document Representation In this stage each document will be represented in the form as given in Fig.3; that by detecting each new phrase and assigning an id for each cleaned unrepeated phrase (neither when it contains the same words or carries the same meaning). The main challenge is to detect the phrases which do convey the same meaning. This is done by constructing a feature vector for each new phrase then a similarity measure is recursively calculated between each new phrase and the phrases that already have been added to the feature vector and when the similarity exceeds a threshold (assumed value); then one of them will be discarded.
Document Subgraph. Each document di is mapped to a subgraph gi that represents this document in a standalone manner. Cumulative DIG. Let the DIG representation of the documents processed up to document. Phrase Matching. A list of matching phrases between document di and dj is computed by intersecting the subgraphs of both documents, gi and gj Document Index Graph construction and phrase matching 1: D ←New Document 2: for each sentence s in D do 3: w1← first word in s 4: if w1 is not in G then 5: Add w1 to G 6: end if 7: L ←Empty List {Lis a list of matching phrases} 8: for each word wi∈ {w2, w3,...,wk} in s do 9: if(wi−1, wi)is an E in G then 10: Extend phrase matches in L for sentences that continue along (wi−1, wi) 11: Add new phrase matches toL 12: else 13: Add E(wi−1, wi)to G 14: Update sentence path in Nodes(N) wi−1 and wi 15: end if 16: end for 17: end for 18: Output matching phrases inL. The DIG is built incrementally by processing one document at a time. When a new document is introduced, it is scanned in sequential fashion, and the graph is updated with the new sentence information as necessary. New words are added to the graph as necessary and connect ed with other Nodes(N) to reflect the sentence structure. The graph building process becomes less memory demanding when no new words are introduced by a new document (or very few new words are introduced). At this point the graph becomes more stable, and the only operation needed is to update the sentence structure in the graph to accommodate
http://www.ijcotjournal.org
Page 551
International Journal of Computer & Organization Trends –Volume 3 Issue 11 – Dec 2013
for the new sentences introduced. It is very critical to note that introducing a new document will only require the inspection (or addition) of those words that appear in that document, and not every node in the graph. This is where the efficiency of the model comes from[3]. IV. RESULTS Fig4: Selecting the directory level for graph indexing All experiments were performed with the configurations Intel(R) Core(TM)2 CPU 2.13GHz, 2 GB RAM, and the operating system platform is Microsoft Windows XP Professional (SP2).
Fig5: Successfully completed graph indexing
Fig1: Home page of MVClustering
Fig6: Searching the keyword based MVClustering
Fig2: Creating index for graphing the pointers
Fig7: Results of MV Graph index clustering
Fig3: Selecting the directory for graph indexing
ISSN: 2249-2593
http://www.ijcotjournal.org
Page 552
International Journal of Computer & Organization Trends –Volume 3 Issue 11 – Dec 2013
Fig8: Results of each graph indexing directories.
Fig10: Phrase based multi view graph index searching.
Fig11: Results of java program phrase in the graph indexing model.
120 100 80 Word java
60
Phrase java program
40 20 0 text doc
Avg accuracy
html
150
Fig9: Results showing the multi view of search with respect to file types.
100
Word java Phrase java program
50 0 text doc
Avg accuracy
html
Word java
70
97
100
Phrase java progra m
100
96
87
Performance of word and phrase in the graph indexing model.
ISSN: 2249-2593
http://www.ijcotjournal.org
Page 553
International Journal of Computer & Organization Trends –Volume 3 Issue 11 – Dec 2013
V. CONCLUSION AND FUTURE SCOPE In this paper existing Multiviewpoint-based Similarity measuring method, named MVS.Both the Theoretical analysis and empirical examples represents that MVS is likely more supportive for text documents than the famous cosine similarity. After analyzed the results we proposed a robust document clustering using graph based index mechanism which improved the time and accuracy while retrieving the documents using keyword. Proposed system gives accurate and high percentage of cluster rate to each document compare to existing approach.
REFERENCES [1] Study on the E Detection Algorithms of Road Image, Chun-ling FanYuan-yuan Ren, ISIP „10 Proceedings of the 2010 Third International Symposium on Information Processing Pages 217220 IEEE Computer Society Washington, DC, USA ©2010. [2] Document Clustering ,Pankaj Jajoo, IITR. [3] Phrase-based Document Similarity Based on an Index Graph Model Khaled M. Hammouda. [4] Clustering with Multiviewpoint-Based Similarity Measure Duc Thang Nguyen, Lihui Chen IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, [5]. A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh, “Clustering with Bregman Divergences,” J. Machine Learning Research, vol. 6, pp. 1705-1749, Oct. 2005. [6]. E. Pekalska, A. Harol, R.P.W. Duin, B. Spillmann, and H. Bunke, “Non-Euclidean or Non-Metric Measures Can Be Informative,” Structural, Syntactic, and Statistical Pattern Recognition, vol. 4109, pp. 871-880, 2006. [7]. M. Pelillo, “What Is a Cluster? Perspectives from Game Theory,” Proc. NIPS Workshop Clustering Theory, 2009. [8]. D. Lee and J. Lee, “Dynamic Dissimilarity Measure for Support Based Clustering,” IEEE Trans. Knowledge and Data Eng., vol. 22, no. 6, pp. 900905, June 2010.
ISSN: 2249-2593
http://www.ijcotjournal.org
Page 554