IDL - International Digital Library Of Technology & Research Volume 1, Issue 2, Mar 2017
Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
Automatic Text Classification using Supervised Learning Ms. NAYANA N MURTHY 1, Mrs. SHASHIREKHA H 2 Dept. of Computer Science MTech, Student– VTU PG Center, Mysuru, India 2 Guide, Assistant Professor– VTU PG Center, Mysuru, India 1
SURVEY PAPER 1. ABSTRACT - As the time goes on and on, digitization of text has been increasing remarkably and the need to organize, categorize and classify text has become indispensable. Disorganization and very little categorization and classification of text may result in gradual lower response time of text or information retrieval. Therefore it is very important and necessary to organize, categorize and classify texts and digitized documents according to description proposed by text mining experts and computer scientists. Automated text classification has been considered as a imperative method to manage and process a large amount of documents in digital forms that are widespread and continuously increasing. In general, text classification plays and substantial role in information extraction and text retrieval, and question answering. This paper emphasizes the text classification process using machine learning techniques. 2. INTRODUCTION Automatic text classification has always been an important application and research topic since the inception of digital documents. Today, text classification is a necessity due to the very large amount of text documents that we have to deal with daily. In general, text classification includes topic based text classification and text genre-based classification. Topic-based text categorization classifies documents according to their topics. Texts can also be written in many genres, for instance: scientific articles, news reports, movie reviews, and advertisements. Genre is defined on the way a text was created, the way it was edited, the register of language it uses, and the kind of audience to whom it is addressed. Previous work on genre classification recognized that this task differs from topic-based categorization. Typically, most data for genre classification are collected from the web, through IDL - International Digital Library
newsgroups, bulletin boards, and broadcast or printed news. They are multi-source, and consequently have different formats, different preferred vocabularies and often significantly different writing styles even for documents within one genre. Namely, the data are heterogeneous. Intuitively Text Classification is the task of classifying a document under a predefined category. More formally, if I d is a document of the entire set of documents D and {cc c 1 2 , ,..., n} is the set of all the categories, then text classification assigns one category j c to a document ID. As in every supervised machine learning task, an initial dataset is needed. A document may be assigned to more than one category (Ranking Classification), but in this paper only researches on Hard Categorization (assigning a single category to each document) are taken into consideration. Moreover, approaches, that take into consideration other information besides the pure text, such as hierarchical structure of the texts or date of publication, are not presented. This is because the main issue of this paper is to present techniques that exploit the most of the text of each document and perform best under this condition. 3. PLAN OF WORK FLOW B2B market places are an intermediate layer for business communications providing one serious advantage to their clients. They can communicate with a large number of customers based on one communication channel to the market place. A successful market place has to deal with various aspects. It has to integrate with various hardware and software platforms and has to provide a common protocol for information exchange. However, the real problem is the heterogeneity and openness of the exchanged content. Therefore, content management is one of the real challenges in successful B2B electronic commerce. One of the serious problem is document description must be classified. Each document will be having its own taxonomy which organizes document 1|P a g e
Copyright@IDL-2017