A Similarity Measure for Text Classification and Clustering

Page 1

IJSRD - International Journal for Scientific Research & Development| Vol. 3, Issue 08, 2015 | ISSN (online): 2321-0613

A Similarity Measure for Text Classification and Clustering P.Mayilsamy1 P.Elango (M.C.A., M.Phil)2 1 Research Scholar 2Assistant Professor 1,2 Department of Computer Science 1,2 Gobi Arts & Science College (Autonomous), Gobichettipalayam Abstract—This paper introduces a live of similarity between two clustering’s of constant dataset made by two completely different algorithms, or maybe constant algorithmic rule. Mensuration the similarity between documents is a crucial operation within the text process field. This paper projected a replacement similarity measure. To compute the similarity between two documents with relevance a feature, the proposed measure takes the subsequent three cases into account: a) The feature seems in each documents, b) The feature seems in just one document, and c) The feature seems in none of the documents. For the primary case, the similarity will increase because the distinction between the two concerned feature values decreases. Moreover, the contribution of the distinction is generally scaled. For the second case, a hard and fast worth is contributed to the similarity. For the last case, the feature has no contribution to the similarity. The effectiveness of the live is evaluated on many real-world knowledge sets for text classification and bunch issues. Key words: Clustering Algorithms, Clustering Applications, Similarity Measures, Text Clustering

similarities between the info points. Locality Preserving Indexing (LPI) methodology could be a completely different spectral bunch methodology supported graph partitioning theory. The LPI methodology applies a weighted operate to every pairwise distance trying to specialise in capturing the similarity structure, instead of the difference structure of the documents. However, it does not overcome the essential limitation of geometer distance. Moreover, the choice of the weighted functions is commonly a troublesome task. Simultaneously, bunch still needs additional sturdy unsimilarity or similarity measures. The thesis is motivated by investigations from the on top of and similar analysis findings. It seems that the character of similarity live plays a really necessary role within the success or failure of a bunch methodology. The first objective is to derive a completely unique methodology for activity similarity between knowledge objects in distributed and high-dimensional domain, notably text documents. From the projected similarity live new bunch criterion functions ar developed and individual bunch algorithms ar introduced, that ar quick and climbable like k-means, however also are capable of providing high-quality and consistent performance [2].

I. INTRODUCTION Study of similarity of agglomeration was ab initio intended by an exploration on automatic text categorization of foreign language texts, as explained below. Supported varied distance measures, variety of ways are planned to handle document agglomeration. A typical and wide used distance live is that the geometrician distance. The k-means methodology is one amongst the ways that use the geometrician distance, that minimizes the add of the square geometrician distance between the information points and their corresponding cluster centers. Since the document house is often of high spatiality, it is desirable to seek out a low-dimensional illustration of the documents to scale back computation complexness. Low computation value is achieved in spectral bunch ways, during which the documents are first projected into a low-dimensional linguistics area and so a standard bunch rule is applied to finding document clusters. Latent Semantic Indexing (LSI) is one amongst the effective spectral bunch ways, aimed toward finding the most effective approximation to the first document space by minimizing the world reconstruction error (Euclidean distance). However, due to the high spatiality of the document area, a definite illustration of documents typically reside on a nonlinear manifold embedded within the similarities between the info points. Unfortunately, the geometer distance could be a difference live that describes the dissimilarities instead of similarities between the documents. Thus, it is powerless to effectively capture the nonlinear manifold structure embedded within the similarities between them. A good document bunch methodology should be ready to realize a low-dimensional illustration of the documents that may best preserve the

II. MOTIVATION Developing strategies to prepare massive amounts of unstructured text documents into a smaller variety of meaty clusters would be terribly useful as document clump is important to such tasks as compartmentalization, filtering, automated meta knowledge generation, signified illumination, population of ranked catalogs of internet resources and application requiring document organization [1]. To spot an appropriate tool and algorithmic rule for clump that produces the most effective clump solutions, it becomes necessary to own a way for comparison the results of various clump algorithms. Though considerable work has been drained coming up with clustering algorithms, not abundant analysis has been done on formulating a live for the similarity of two completely different clustering algorithms [2]. Thus, the most goal of this paper is to: first, propose Associate in Nursing formula for activity similarity analysis among totally different cluster algorithms; second, apply the formula to calculate similarity of varied pairs of cluster strategies applied to a Portuguese corpus and therefore the Iris dataset; finally, to cross- validate the results of similarity analysis with the geometrician (centroids) distances and Pearson parametric statistic, exploitation a similar datasets. III. RELATED WORKS Gianni Amati et al the authors were explicit that the introduce and build a framework for derivation probabilistic models of information retrieval. The models square measure statistic models of IR obtained within the language model approach. It derive term-weighting models by mensuration

All rights reserved by www.ijsrd.com

563


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.