IDL - International Digital Library Of Technology & Research Volume 1, Issue 2, Mar 2017
Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
Automatic Text Classification using Supervised Learning Ms. NAYANA N MURTHY 1, Mrs. SHASHIREKHA H 2 Dept. of Computer Science MTech, Student– VTU PG Center, Mysuru, India 2 Guide, Assistant Professor– VTU PG Center, Mysuru, India 1
SURVEY PAPER 1. ABSTRACT - As the time goes on and on, digitization of text has been increasing remarkably and the need to organize, categorize and classify text has become indispensable. Disorganization and very little categorization and classification of text may result in gradual lower response time of text or information retrieval. Therefore it is very important and necessary to organize, categorize and classify texts and digitized documents according to description proposed by text mining experts and computer scientists. Automated text classification has been considered as a imperative method to manage and process a large amount of documents in digital forms that are widespread and continuously increasing. In general, text classification plays and substantial role in information extraction and text retrieval, and question answering. This paper emphasizes the text classification process using machine learning techniques. 2. INTRODUCTION Automatic text classification has always been an important application and research topic since the inception of digital documents. Today, text classification is a necessity due to the very large amount of text documents that we have to deal with daily. In general, text classification includes topic based text classification and text genre-based classification. Topic-based text categorization classifies documents according to their topics. Texts can also be written in many genres, for instance: scientific articles, news reports, movie reviews, and advertisements. Genre is defined on the way a text was created, the way it was edited, the register of language it uses, and the kind of audience to whom it is addressed. Previous work on genre classification recognized that this task differs from topic-based categorization. Typically, most data for genre classification are collected from the web, through IDL - International Digital Library
newsgroups, bulletin boards, and broadcast or printed news. They are multi-source, and consequently have different formats, different preferred vocabularies and often significantly different writing styles even for documents within one genre. Namely, the data are heterogeneous. Intuitively Text Classification is the task of classifying a document under a predefined category. More formally, if I d is a document of the entire set of documents D and {cc c 1 2 , ,..., n} is the set of all the categories, then text classification assigns one category j c to a document ID. As in every supervised machine learning task, an initial dataset is needed. A document may be assigned to more than one category (Ranking Classification), but in this paper only researches on Hard Categorization (assigning a single category to each document) are taken into consideration. Moreover, approaches, that take into consideration other information besides the pure text, such as hierarchical structure of the texts or date of publication, are not presented. This is because the main issue of this paper is to present techniques that exploit the most of the text of each document and perform best under this condition. 3. PLAN OF WORK FLOW B2B market places are an intermediate layer for business communications providing one serious advantage to their clients. They can communicate with a large number of customers based on one communication channel to the market place. A successful market place has to deal with various aspects. It has to integrate with various hardware and software platforms and has to provide a common protocol for information exchange. However, the real problem is the heterogeneity and openness of the exchanged content. Therefore, content management is one of the real challenges in successful B2B electronic commerce. One of the serious problem is document description must be classified. Each document will be having its own taxonomy which organizes document 1|P a g e
Copyright@IDL-2017
IDL - International Digital Library Of Technology & Research Volume 1, Issue 2, Mar 2017
Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017 into its respective categories. Each supplier uses different structures and vocabularies to describe its documents. This may not cause a problem for a 1-1 relationship where the buyer may get used to the private terminology of his supplier. B2B market places that enable n-m commerce cannot rely on such an assumption. They must classify all documents according to a standard classification schema that help buyers and suppliers in communicating their document information. A widely used classification schema in the is UNSPSC Again it is a difficult and mainly manual task to classify the documents according to a classification schema like UNSPSC. It requires domain expertise and knowledge about the document domain. Finding the right place for a document description in a standard classification system such as UNSPSC is not at all a trivial task. Each document must be mapped to the corresponding document category in UNSPSC to create the document catalog. Document classification schemes contain huge number of categories with far from sufficient definitions (e.g. over 12,000 classes for UNSPSC) and millions of documents must be classified according to them. Document classification is expensive, complicated, time consuming and error-prone. Content Management needs support in automation of the document classification process. Text mining and Machine Learning work together for automatic classification of document. The below figure shows that flow of txt classification process..
The motivated perspective of text mining is Information Extraction (IE) to extract specific information from document description. Natural Language Processing (NLP) is to achieve a better understanding of natural language by use of computers and represent the description semantically to improve the classification process. Text representation is the important aspect in classification process, denotes the mapping of a document description into a compact form of its contents. Description is typically represented as a vector of term weights (word features) from a set of terms (dictionary), where each term occurs at least in any document description. A major IDL - International Digital Library
characteristic of the classification problem is the extremely high dimensionality of text data. The number of potential features often exceeds the number of training set. 4 SURVEY 4.1 Text Classification Using Machine Learning Techniques “This survey represent as machine learning techniques, here automatic text classification has always been an important application and research topic since the inception of digital documents. Today, text classification is a necessity due to the very large amount of text documents that we have to deal with daily. In general, text classification includes topic based text classification and text genre-based classification. Topic-based text categorization classifies documents according to their topics. Texts can also be written in many genres, for instance: scientific articles, news reports, movie reviews, and advertisements. Genre is defined on the way a text was created, the way it was edited, the register of language it uses, and the kind of audience to whom it is addressed. Previous work on genre classification recognized that this task differs from topic-based categorization. Typically, most data for genre classification are collected from the web, through newsgroups, bulletin boards, and broadcast or printed news. They are multi-source, and consequently have different formats, different preferred vocabularies and often significantly different writing styles even for documents within one genre. Namely, the data are heterogenous. Intuitively Text Classification is the task of classifying a document under a predefined category. More formally, if i d is a document of the entire set of documents D and {cc c 1 2 , ,..., n} is the set of all the categories, then text classification assigns one category j c to a document id. As in every supervised machine learning task, an initial dataset is needed. A document may be assigned to more than one category (Ranking Classification), but in this paper only researches on Hard Categorization (assigning a single category to each document) are taken into consideration. Moreover, approaches, that take into consideration other information besides the pure text, such as hierarchical structure of the texts or date of publication, are not presented. This is because the main issue of this paper is to present techniques that exploit the most of the text of each document and perform best under this condition.� 4.2 A Review of Machine Learning Algorithms for
2|P a g e
Copyright@IDL-2017
IDL - International Digital Library Of Technology & Research Volume 1, Issue 2, Mar 2017
Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017 Text-Documents Classification “The text mining studies are gaining more importance recently because of the availability of the increasing number of the electronic documents from a variety of sources. The resources of unstructured and semi structured information include the world wide web, governmental electronic repositories, news articles, biological databases, chat rooms, digital libraries, online forums, electronic mail and blog repositories. Therefore, proper classification and knowledge discovery from these resources is an important area for research. Natural Language Processing (NLP), Data Mining, and Machine Learning techniques work together to automatically classify and discover patterns from the electronic documents. The main goal of text mining is to enable users to extract information from textual resources and deals with the operations like, retrieval, classification (supervised, unsupervised and semi supervised) and summarization. However how these documented can be properly annotated, presented and classified. So it consists of several challenges, like proper annotation to the documents, appropriate document representation, dimensionality reduction to handle algorithmic issues, and an appropriate classifier function to obtain good generalization and avoid over-fitting. Extraction, Integration and classification of electronic documents from different sources and knowledge discovery from these documents are important for the research communities. Today the web is the main source for the text documents, the amount of textual data available to us is consistently increasing, and approximately 80% of the information of an organization is stored in unstructured textual format, in the form of reports, email, views and news etc. The shows that approximately 90% of the world’s data is held in unstructured formats, so Information intensive business processes demand that we transcend from simple document retrieval to knowledge discovery. The need of automatically retrieval of useful knowledge from the huge amount of textual data in order to assist the human analysis is fully apparent. Market trend based on the content of the online news articles, sentiments, and events is an emerging topic for research in data mining and text mining community. For these purpose state-of-the-art approaches to text classifications are presented in, in which three problems were discussed: documents representation, classifier construction and classifier evaluation. So constructing a data structure that can represent the documents, and constructing a classifier that can be used to predicate the class label of a IDL - International Digital Library
document with high accuracy, are the key points in text classification. Text-Documents Classification.” 4.3 A Concept of Text Classification Using Machine Learning “Modern information age produces vast amount of textual data, which can be termed in other words as unstructured data. Internet and corporate spread across the globe produces textual data in exponential growth, which needs to be shared, on need basis by individuals. If the data generated is properly organized, classified then retrieving the needed data can be made easily with least efforts. Hence the need of automatic methods to organize, classify the documents becomes inevitable due to such exponential growth in documents, very especially after the increase usage of internet by individuals. Automatic classification refers to assigning the documents to a set of pre-defined classes based on the textual content of the document. The classification can be flat or hierarchical. The class categories grow significantly large in number say, in thousands then searching with such a large number of categories becomes very difficult. This difficulty leads to have hierarchical classification in which the thematic relationship between the classifications is also used, in searching of documents. Text Categorization (TC), also known as Text Classification, is the task of automatically classifying a set of text documents into different categories from a predefined set. Consider the case of sorting and organizing emails, files in folder hierarchies so that topic identification that would support topic specific operations be made. On such attempt is the yahoo web directory. If such classification is to be done manually it has several disadvantages. i. It needs domain experts in the areas of predefined categories. ii. It is time-consuming, leads to frustration. iii. It is error-prone and could be employee biased (subject biased). iv. Human decision among two experts may disagree. v. Need to repeat the process for new documents (possibly of another domain). So the need to employee machine learning to Automate the classification is needed. In machine learning generally two types of learning algorithms are found in the literature: supervised learning algorithms or unsupervised learning algorithms. We restrict in the paper about supervised learning.” 4.4 A Study on Document Classification using Machine Learning Techniques “Due to the fast 3|P a g e
Copyright@IDL-2017
IDL - International Digital Library Of Technology & Research Volume 1, Issue 2, Mar 2017
Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017 growth of digital information available electronically, text mining plays a key role in managing information and knowledge, and therefore has become an active research area. Text mining, also known as intelligent text analysis is the process of extracting interesting and non-trivial information and knowledge from unstructured text. Text mining is a young interdisciplinary field, which draws on information retrieval, data mining, machine learning, statistics and computational linguistics. Typical text mining tasks include information extraction, topic tracking, document summarization, classification, clustering, question answering. Automated text classification is the act of dividing a set of input documents into two or more classes where each document can be said to belong to one or multiple classes. Text classification aims at assigning pre-defined classes to text documents. An example would be to automatically label each incoming news story with a topic like “sports”, “politics”, or “art”. The classification task starts with a training set D d ( ,..., ) 1 n of documents that are already labeled with a class c C (e.g. sport, politics). The task is then to determine a classification model f D C : f d c ( ) which is able to assign the correct class to a new document d of the domain. Text classification is a challenging task, as it is difficult to capture the meaning and abstract concepts of natural language just from a few keywords. Also, the high dimensionality of the feature space makes classification problem very difficult. Text classification is commonly used to handle spam emails, classify large text collections into topical categories, and manage knowledge and also to help Internet search engines.” 4.5 Various Machine Learning Techniques for Text Classification “In this survey, we examine and compare the effectiveness of applying machine learning techniques to the sentiment classification problem. A challenging aspect of this problem that seems to distinguish it from traditional topic-based classification is that while topics are often identifiable by keywords alone, sentiment can be expressed in a more subtle manner. Sentimental Analysis Definition Sentiment Analysis is a Natural Language Processing and Information Extraction task that aims to obtain writer’s feelings expressed in positive or negative comments, questions and requests, by analyzing a large numbers of documents. Generally speaking, sentiment analysis aims to determine the
IDL - International Digital Library
attitude of a speaker or a writer with respect to some topic or the overall tonality of a document. What are the challenges? Sentiment Analysis approaches aim to extract positive and negative sentiment bearing words from a text and classify the text as positive, negative or else objective if it cannot find any sentiment bearing words. In this respect, it can be thought of as a text categorization task. In text classification there are many classes corresponding to different topics whereas in Sentiment Analysis we have only 3 broad classes i.e. positive, negative and neutral. Thus it seems Sentiment Analysis is easier than text classification which is not quite the case. The general challenges can be summarized as. 1. Implicit Sentiment and Sarcasm 2. Domain Dependency 3. Thwarted Expectations4. Pragmatics 5. World Knowledge 6. Subjectivity Detection 7. Entity Identification 8. Negation Hence, it’s not easy to do text categorization and understand what the user intends to say (sentiments) because of the above mentioned problems. The complexity of the problems varies from high to low. So some problems are easily solvable like World Knowledge and some are difficult like Negation. For this purpose various algorithms like Naive Bayes, SVM and Decision Tree at available at our disposal. Steps for analyzing the sentiments in the sentence: 1. Firstly we need to decide the classifier algorithms and have an appropriate data for training. 2. Preprocess and label the data. 3. Prepare the data for training. 4. Train the classifier with the help of libraries such as NLTK, libsvm etc. 5. Make predictions by giving new test data to the trained classifier. Text categorization is the task of assigning a Boolean value to each pair (dj , ci) ∈ D × C, where D is a domain of documents and C = {c1 , . . . , c|C| } is a set of predefined categories. A value of T assigned to (dj , ci) indicates a decision to file dj under ci,while a value of F indicates a decision not to file dj under ci.” 4.6 Types of Machine Learning Algorithms “Machine learning algorithms are organized into taxonomy, based on the desired outcome of the algorithm. Common algorithm types include: • Supervised learning: where the algorithm generates a function that maps inputs to desired outputs. One 4|P a g e
Copyright@IDL-2017
IDL - International Digital Library Of Technology & Research Volume 1, Issue 2, Mar 2017
Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017 standard formulation of the supervised learning task is the classification problem: the learner is required to learn a function which maps a vector into one of several classes by looking at several input-output examples of the function. • Unsupervised learning: Which models a set of inputs: labeled examples are not available. • Semi-supervised learning: Which combines both labeled and unlabeled examples to generate an appropriate function or classifier? • Reinforcement learning: Where the algorithm learns a policy of how to act given an observation of the world. Every action has some impact in the environment, and the environment provides feedback that guides the learning algorithm. • Transduction: Similar to supervised learning, but does not explicitly construct a function: instead, tries to predict new outputs based on training inputs, training outputs, and new inputs. • Learning to learn: Where the algorithm learns its own inductive bias based on previous experience. The performance and computational analysis of machine learning algorithms is a branch of statistics known as computational learning theory. Machine learning is about designing algorithms that allow a computer to learn. Learning is not necessarily involves consciousness but learning is a matter of finding statistical regularities or other patterns in the data. Thus, many machine learning algorithms will barely resemble how human might approach a learning task. However, learning algorithms can give insight into the relative difficulty of learning in different environments.”
5. CONCLUSION This survey finally conclude that, the text classification problem is an Artificial Intelligence research topic, especially given the vast number of documents available in the form of web pages and other electronic texts like emails, discussion forum postings and other electronic documents. It has observed that even for a specified Classification method, classification performances of the classifiers based on different training text corpuses are different; and in some cases such differences are quite substantial. This observation implies that a) classifier performance is relevant to its training corpus in some degree, and b) good or high quality training corpuses may derive classifiers of good performance. Unfortunately, up to now little research work in the literature has been seen on how to exploit training text IDL - International Digital Library
corpuses to improve classifier’s performance. Some important conclusions have not been reached yet, including: • Which feature selection methods are both computationally scalable and high performing across classifiers and collections? Given the high variability of text collections, do such methods even exist? • Would combining uncorrelated, but well performing methods yield a performance increase? • Change the thinking from word frequency based vector space to concepts based vector space. Study the methodology of feature selection under concepts, to see if these will help in text categorization. • Make the dimensionality reduction more efficient over large corpus. Moreover, there are other two open problems in text mining: polysemy, synonymy. Polysemy refers to the fact that a word can have multiple meanings. Distinguishing between different meanings of a word (called word sense disambiguation) is not easy, often requiring the context in which the word appears. Synonymy means that different words can have the same or similar meaning. OTHER REFERENCES [1] Bao Y. and Ishii N., “Combining Multiple kNN Classifiers for Text Categorization by Reducts”, LNCS 2534, 2002, pp. 340-347 [2] Bi Y., Bell D., Wang H., Guo G., Greer K., ”Combining Multiple Classifiers Using Dempster's Rule of Combination for Text Categorization”, MDAI, 2004, 127-138. [3] Brank J., Grobelnik M., Milic-Frayling N., Mladenic D., “Interaction of Feature Selection Methods and Linear Classification Models”, Proc. of the 19th International Conference on Machine Learning, Australia, 2002. [4] Ana Cardoso-Cachopo, Arlindo L. Oliveira, “An Empirical Comparison of Text Categorization Methods,” Lecture Notes in Computer Science, Volume 2857, Jan 2003, Pages 183 - 196 [5] Chawla, N. V., Bowyer, K. W., Hall, L. O., Kegelmeyer, W. P., “SMOTE: Synthetic Minority Over-sampling Technique,” Journal of AI Research, 16 2002, pp. 321-357.
5|P a g e
Copyright@IDL-2017
IDL - International Digital Library Of Technology & Research Volume 1, Issue 2, Mar 2017
Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017 [6] Forman, G., “An Experimental Study of Feature Selection Metrics for Text Categorization”. Journal of Machine Learning Research, 3 2003, pp. 1289-1305 [7] Fragoudis D., Meretakis D., Likothanassis S., “Integrating Feature and Instance Selection for Text Classification”, SIGKDD ’02, July 23-26, 2002, Edmonton, Alberta, Canada. [8] Guan J., Zhou S., “Pruning Training Corpus to Speedup Text Classification”, DEXA 2002, pp. 831840 [9] D. E. Johnson, F. J. Oles, T. Zhang, T. Goetz, “A decision-tree-based symbolic rule induction system for text categorization”, IBM Systems Journal, September 2002. [10] Han X., Zu G., Ohyama W., Wakabayashi T., Kimura F., “Accuracy Improvement of Automatic Text Classification Based on Feature Transformation and Multi-classifier Combination”, LNCS, Volume 3309, Jan 2004, pp. 463-468 [11] Ke H., Shaoping M., “Text categorization based on Concept indexing and principal component analysis”, Proc. TENCON 2002 Conference on Computers, Communications, Control and Power Engineering, 2002, pp. 51- 56. [12] Kehagias A., Petridis V., Kaburlasos V., Fragkou P., “A Comparison of Word- and Sense-Based Text Categorization Using Several Classification Algorithms”, JIIS, Volume 21, Issue 3, 2003, pp. 227247. [13] B. Kessler, G. Nunberg, and H. Schutze. “Automatic detection of text genre.” In Proceedings of the Thirty-Fifth ACL and EACL, pages 32–38, 1997. [14] Kim S. B., Rim H. C., Yook D. S. and Lim H. S., “Effective Methods for Improving Naïve Bayes Text Classifiers”, LNAI 2417, 2002, pp. 414-423
IDL - International Digital Library
6|P a g e
Copyright@IDL-2017