Document Categorization using Improved KNN Classification

Page 1

IJSRD - International Journal for Scientific Research & Development| Vol. 4, Issue 05, 2016 | ISSN (online): 2321-0613

Document Categorization using Improved KNN Classification Neha1 Dr. RK Chauhan2 M. Tech. Research Scholar 2Senior Professor 1,2 Department of Computer Science & Applications 1,2 Kurukshetra University, Kurukshetra, India 1

Abstract— Document categorization is the method of classifying the documents from mixed documents into particular specific documents such that they belong to the same classes. Classification is a data mining technique used to predict group membership for data instances. The relevance of keywords in documents and text mining has become very essential. An easy way of storing creates the need for a convenient way of retrieval which simplifies that what is the use of storing documents if they cannot be found. Resultantly, categorization of documents has been applied to make it easier to find relevant information. Classifying the documents is more convenient and virtuous. Thus, the main aim of this research is to design an improved KNN classifying technique so as to classify large sets of documents with improved accuracy in lesser time in terms of F-measure and G-measure. Key words: Data Mining, Improved KNN, Centre Prediction, TF-IDF, Confusion Matrix, Euclidian Distance I. INTRODUCTION Data mining is the subfield of computer science which deals with the computational process of discovering patterns in large data sets. The main aim of data mining process is to extract vital information from a data set and convert it into an understandable format for future use. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from various different categorize it and summarize the relationships which are identified. Technically, data mining is the process of finding patterns among a large number of fields in large relational databases. Knowledge Discovery Diagram (KDD) is used to make sure that useful information is resulting from the data [2] It is basically used to retrieve useful information from the database. KDD have the following steps: 1) Data cleaning: It is a first phase in which irrelevant data are removed from the collection of data. 2) Data integration: It is a stage in which heterogeneous multiple data sources or similar data sets are combined to a common source. 3) Data selection: In this step the data related to the analysis is decided on and retrieved from the data collection. 4) Data transformation: It is also called as data consolidation in which the selected data is transformed into various forms which are appropriate for the mining process. 5) Data mining: It is the most important step in which knowledgeable techniques are applied to retrieve potentially useful patterns. 6) Pattern evaluation: In this phase the strictly patterns representing the knowledge are recognized based on the given measures. 7) Knowledge representation: It is the final step in which the discovered knowledge is visually represented to the user.

The basic block diagram of the data mining process is shown in Fig. 1 below:

Fig. 1: Block Diagram of Data Mining Process Document categorization refers to the process of dividing the large amount of text to one or multiple categories according to the contents or attributes of the text. Document categorization mainly includes two parameters Clustering: It is the task of discovering structures and groups in the data that are similar in some way or another.  Classification: It is the task of generalizing known structure to apply to the new data. The document classification process is mainly divided into various steps. Decision Tree classification: Decision tree classification is a flowchart like a tree structure where each internal node or the non-leaf node represents a test on an attribute, each branch represents an outcome of the test and each leaf node holds a class label. When a decision tree is built, many branches will reflect the anomalies in the training data because of the noise or outliers. Basically the training data will not fit in the memory so Decision tree construction becomes inefficient due to swapping of the training. Bayesian classification is a statistical classification. They can predict class membership probabilities such as the probability that a given data belongs to a particular class. A Bayesian network or directed acyclic graphical model represents a set of random variables and their conditional dependencies through a Directed Acyclic Graph. Frequent patterns and their correlation rules characterize interesting relationships between attribute conditions and class labels and thus widely used for classification. Association rule shows a strong association between attribute-value pairs that occur frequently in the data set. Association rule is mainly used to analyze the purchasing of customers in a store. Nearest Neighbor Classification is the classification using instance-based classifier which simply used for locating the nearest neighbor in instance space and labeling the unknown instance with the same class label as that of known neighbor forming a VSM vector space model [1]. The nearest neighbor classifier can be considered as a special case of more general K-nearest neighbor classifier therefore it is referred to as a KNN classifier. The best

All rights reserved by www.ijsrd.com

832


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.