International Journal of Research in Advent Technology, Vol.2, No.8, August 2014 E-ISSN: 2321-9637
Question Classification: Using Support Vector Machine and Lexical, Semantic and Sytactic Features Kiran Yadav, Megha Mishra M.E scholar sscet Bhilai, Prof. sscet Bhilai Yadavkiran64@gmail.com Abstract Question classification is play important role in the question answering system. The results of the question classification find out the quality of the question answering system. In this paper, a question classification algorithm based on SVM and feature, Support Vector Machine model is take to train a classifier on coarse categories, there features also use for classify the category. SVM has been used for question classification and have a good results. We use SVM as the classifier. The experiment results show that the feature extraction can perform well with SVM and our approach can reach classification accuracy. Index TermsQuestion answering, text classification, machine learning, support vector machine. 1. INTRODUCTION In this work, we use a machine learning approach to question classification. Task of question classification as a supervised learning classification. In order to prepare the learning model, we designed a deep position of features that are prognostic of question categories . In this paper work this classification has two purposes. It provides constraints on the answer types that provide foster processing to just site and verify the answer. Which city has the largest population? we do not want to test each phrase in a document to look that it gives an answer . However, there characteristics of question classification that mark it from the common work. On one hand, questions are relatively short and contain less word-based information equate with classifying the entire text. On the other hand, small questions are amenable for more correct and deeper-level In this way, this work on question classification can be also see as a case study is take semantic information to text classification. Similar to syntactic information such as part-of-speech tags, clear notion of how to use lexical semantic information is to replace or augment each word by its semantic class in the given context, then generate a feature-based representation and learn a mapping from this representation to the desired property. This general scheme leaves several issues open that make the analogy to syntactic categories nontrivial. First, it is not open which semantic category is allow and how to develop them. Second, it is not open how to hold the more dissimilar problem of semantic when decisied the delegacy of a sentence. Merge these three features and increase the accuracy of the question classification by using these features. Question classification plays an important role in question answering. Features are the key to obtain an accurate question classifier.
Question answering systems deal defferent it this problem, by giving natural language de in which users can explain their information required form of a natural language question. Retrieve the exact answer to that very same question in place of a set of documents. natural language, from a (typically large) collection of documents, such as the WWW. The developing period of the q/a system in different field is too long and recycle rate is so low. Developed a state of the art machine leaning based question classifier that use a rich a set of lexical, syntactic and semantic features. 2. QUESTION CLASSIFICATION Question Classification means it helps for give the result of given question .It is mainly use for the question answer system. It work category wise example if any type of question it there and find the answer in category it give fast result. When we search any thing it search engine like google then it gives all things which are related to that word which is in search. But it gives the answer in category wise. because of only the question`s answer is presented. Table 1. The coarse question categories Coarse ABBR DESC ENTY HUM LOC NUM
77
International Journal of Research in Advent Technology, Vol.2, No.8, August 2014 E-ISSN: 2321-9637 To simplify the following experiments, we assume that one Question resides in only one category. That is to say, unambiguous question is labeled with its most probable category. 2.1 Question types What is the fastest fish in the world? What’s the colored part of the eye called? What color is Mr. Spock’s blood? Name a novel written by John Steinbeck. What currency is used in Australia? What is the fear of cockroaches called? What are the historical trials following WorldWar II called? What is the world ’s best selling cookie? What instrument is Ray Charles best known for playing? What language is mostly spoken in Brazil? What letter adorns the flag of Rwanda? What’s the highest hand in straight poker? What is the state tree of Nebraska? What is the best brand for a laptop computer? What religion has the most members? What game is Garry Kasparov really good at?
3.
RELATED WORK
Hand-made Rule-based show on extracting names using many of human-made rules set. basically the systems consist of a set of patterns using grammatical (e.g. part of speech), syntactic (e.g. word precedence) and orthographic features (e.g. capitalization) in combination with dictionaries An example for this type of system is: "President rao said bankers talks will make discussions on private, U.S. forces to leave Iraq". In this example a proper noun follows a person's title(president), then noun is a person's name and proper noun that is started with capital character (Iraq) after the verb (to leave) is a Location's name. In this family of approaches, Appelt , propose a name identification system based on carefully handcrafted regular expression called FASTUS. They divided the task into three steps: Recognizing Phrases, Recognizing Patterns and Merging incidents These approaches are relying on manually coded rules and manually compiled corpora. These kinds of models have better results for restricted domains, are capable of detecting complex entities that learning models have difficulty with. However, the rule-based NE systems lack the ability of portability robustness, and furthermore the high cost of the rule maintains increases even though the data is slightly changed. These type of approaches are often domain and language specific and do not necessarily adapt well to new domains and languages. In Machine Learning-based NER system, the purpose of Named Entity Recognition approach is
converting identification problem into a classification problem and employs a classification statistical model to solve it. In this type of approach, the systems look for patterns and relationships into text to make a model using statistical models and machine learning algorithms. The systems identify and classify nouns into particular classes such as persons, locations, times, etc base on this model, using machine learning algorithms. There are two types of machine learning model that are use for NER. Supervised and Unsupervised machine learning model. Supervised learning involves using a program that can learn to classify a given set of labeled examples that are made up of the same number of features. Each example is thus represented with respect to the different feature spaces. The learning process is called supervised, because the people who marked up the training examples are teaching the program the right distinctions. The supervised learning approach requires preparing labeled training data to construct a statistical model, but it cannot achieve a good performance without a large amount of training data, because of data sparseness problem. In recent years several statistical methods based on supervised learning method were proposed. Bikel et. al. propose a learning name-finder base on hidden Markov model [8] called Nymbel, while Borthwick et. al. investigates exploiting diverse knowledge sources via maximum entropy in named entity recognition [9,10]. A tagging of unknown proper names system with Decision Tree model was proposed by Bechet et. al. [5], while Wuet. al. presented a named entity recognition system based on support vector machines [2]. Unsupervised learning method is another type of machine learning model, where an unsupervised model learns without any feedback. In unsupervised learning, the goal of the program is to build representations from data. These representations can then be used for data compression, classifying, decision making, and other purposes. Unsupervised learning is not a very popular approach for NER and the systems that do use unsupervised learning are usually not completely unsupervised. In these types of approach, Collins et. al. discusses an unsupervised model for named entity classification by use of unlabeled examples of data [7], Koimetal. Proposes an unsupervised named entity classification models and their ensembles that uses a small-scale named entity dictionary and an unlabeled corpus for classifying named entities [4]. Unlike the rulebased method, these types of approaches can be easily port to different domain or languages. In Hybrid NER system, the approach is to combine rulebased and machine learning-based methods, and make new methods using strongest points from each method.
78
International Journal of Research in Advent Technology, Vol.2, No.8, August 2014 E-ISSN: 2321-9637 4. QUESTION FEATURES One of the main challenges in developing a supervised classifier for a particular domain is to identifyand design a rich set of features – a process which is generally referred to as feature engineering. In the subsections that follow, we present the different types of features that were used in the question classifier, and how they are extracted from a given question. 4.1Lexical features Lexical features refer to word related features that are extracted directly from the question. In this work,we use word level n-grams as lexical features. We also include in this section the techniques of stemming and stop word removal, which can be used to reduce the dimensionality of the feature set. 4.1.1 Stemming and Stop word removal Stemming is a technique that reduces words to their grammatical roots or stems, by removing their affixes. For instance, after applying stemming, the words inventing and invented both become invent. We exploit this technique in our question classifier in the following manner. First, we represent the question using the bag-of-words model as previously described. Second, we apply Porter’s stemming algorithm (Porter, 1980) to transform each word into its stem. The following two examples depict a question before and after stemming are applied, respectively. (1) Which countries are bordered by France? (2) Which country are border by Franc? Another related technique is to remove stop words, which are frequently occurring words with no semantic value, such as the articles the and an. Both of these techniques are mainly used to reduce the feature space of the classifier – i.e., the number of total features that need to be considered. This is achieved by collapsing several different forms of the same word into one distinct term by applying stemming; or by eliminating words which are likely to be present in most questions – stop words –, and which do not provide useful information for the classifier. 4.2 Syntactic Features In addition to the information that is readily available in the input instance, it is common in natural language processing tasks to augment sentence representation with syntactic categories, under the assumption that the sought-after property, for which we seek the classifier, depends on the syntactic role of a word in the sentence rather than the specific word .
4.2.1Question headword The question headword 1 is a word in a given question that represents the information that is being Sought after. In the following examples, the headword is in bold face: (1) What is Australia’s national flower? (2) Name an American made motorcycle. (3) Which country are Godiva chocolates from? (4) What is the name of the highest mountain in Africa? In Example 1,2,3, the headword flower provides the classifier with an important clue to correctly classify the question to ENTITY:PLANT. By the same token, motorcycle in Example 4 renders hints that help classify the question to ENTITY:VEHICLE. Indeed, the aforementioned examples’ entire headword serves as an important feature to unveil the question’s category, which is why we dedicate a great effort to its accurate extraction. Our baseline classifier makes use of the standard POS information and phrase information extracted by a shallow parser. Specifically, we use chunks (non overlapping phrases) and head chunks, .The following example illustrates the information available when generating the syntax-augmented feature-based representation. Question: Who was the first woman killed in the Vietnam War? Chunking: [NP Who] [VP was] [NP the first woman] [VP killed] [PP in] [NP the Vietnam War] ? The head chunks denote the first noun or verb chunk after the question word in a question. For example, in the above question, the first noun chunk after the question word who is ‘the first woman’. The features are represented as abstract tags in each example. 4.3 Semantic Features Similar logic can be applied to semantic categories. In many cases, the property seems not depend on the specific word used in the sentence – that could be replaced without affecting this property – but rather on its ‘meaning’. For example, given the question: What Cuban dictator did Fidel Castro force out of power in 1958?, we would like to determine that its answer Should be a name of a person. Knowing that dictator refers to a person is essential to correct classification. This work systematically studies four semantic information sources and their contribution to classification: (1) automatically acquired named entity categories -NE, (2) word senses in WordNet -SemWN, (3) manually constructed word lists related to specific categories of interest -SemCSR, and (4) automatically generated
79
International Journal of Research in Advent Technology, Vol.2, No.8, August 2014 E-ISSN: 2321-9637 semantically similar word lists (Zhang, D., & Lee, W. S, 2003) -SemSWL. For the four external semantic information sources, we define semantic categories of words and incorporate the information into question classification in the same way: if a word w occurs in a question, the question representation is augmented with the semantic category(ies), of the word. For example, in the question: What is the state flower of California? given that plant (for example) is the only semantic class of flower, the feature extractor adds plant, an abstract label to the question representation. 4.3.1 Named Entities A named entity (NE) recognizer assigns a semantic category to some of the noun phrases in the question. The scope of the categories used here is broader than the common named entity recognizer. With additional categories that could help question answering, such as profession, event, holiday, plant, sport, medical etc., we redefine our task in the direction of semantic categorization. The named entity recognizer was built on the shallow parser described in (Voorhees, E. M. (2004).), and was trained to categorize noun phrases into one of 34 different semantic categories of varying specificity. Its overall accuracy (FÂŻ =1) is above 90%. For the question Who was the woman killed in the Vietnam War ?, the named entity tagger will return: NE: Who was the [Num first] woman killed in the [Event Vietnam War] ? As described above, the identified named entities are added to the question representation. 4.3.2WordNet Senses In WordNet (C. Peters,2005)words are organized according to their ‘senses’ (meanings). Words of the same sense can, in principle, be exchanged in some contexts. The senses are organized in a hierarchy of hypernyms and hyponyms. Word senses provide another effective way to describe the semantic category of a word. For example, in WordNet 1.7, the word water belongs to 5 senses. The first two senses are: Sense 1: binary compound that occurs at room temperature as a colorless odorless liquid; Sense 2: body of water. Sense 1 contains words fH2O, water} while Sense 2 contains water, body of water. Sense 1 has a hypernym Sense 3: binary compound); and one hyponym of Sense 2 is (Sense 4: tap water). For each word in a question, all of its sense IDs and direct hypernym and hyponym IDs are extracted as features.
This approach possibly introduces significant noise to classification since only a small proportion of senses are really related.
5
SUPPORT VECTOR MACHINE
Machine learning tasks can be of several forms. In supervised learning, the computer is presented with example inputs and their desired outputs, given by a "teacher", and the goal is to learn a general rule that maps inputs to outputs. Spam filtering is an example of supervised learning, particular classification, where the learning algorithm is presented with email (or other) messages labeled beforehand as "spam" or "not spam", to produce a computer program that labels unseen messages as either spam or not. In unsupervised learning, no labels are given to the learning algorithm, leaving it on its own to groups of similar inputs (clustering),density estimates or projections of high-dimensional data that can be visualised effectively.[2]:3 Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means towards an end. Topic modeling is an example of unsupervised learning, where a program is given a list of human language documents and is tasked to find out which documents cover similar topics Supervised learning is the machine learning task of inferring a function from labeled training data.[1] The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way (see inductive bias). In machine learning, the problem of unsupervised learning is that of trying to find hidden structure in unlabeled data. Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution. This distinguishes unsupervised learning from supervised learning and reinforcement learning. Unsupervised learning is closely related to the problem of density estimation in statistics.[1] However unsupervised learning also encompasses many other techniques that seek to summarize and explain key features of the data. Many methods employed in unsupervised learning
80
International Journal of Research in Advent Technology, Vol.2, No.8, August 2014 E-ISSN: 2321-9637 6 CONCLUSION In this paper we presented a detailed overview on learning-based question classification approaches. Question classification is a hard problem. In fact the machine need to understand the question and classify it to the right category. This is done by a series of complicated steps. In this paper we reviewed different learning methods and feature extraction techniques for question classification. Deciding for the best model and optimal set of features is not a simple problem. Enhancing the feature space with syntactic and semantic features can usually improve the classification accuracy. . 7
FUTURE WORK
In the question classification task, we have shown that a machine learning-based classifier using solely superficial features . Increase the accuracy in question answer system with the combination of the three feature by using svm (support vector system) method. 8 RESULT It increases the accuracy of the answer detection. It give 95.2% of accuracy. Acknowledgements I thank PROF .Megha Mishra for several valuable suggestions and the entire SSCET team for help with various components, feature suggestions and guidance.
[7] Question classification with semantic tree kernel. Pan, Y., Tang, Y., Lin, L., & Luo, Y. (2008). InProceedings of the 31st annual international acm sigir conference on research and development in information retrieval (pp. 837– 838). New York, NY, USA: AC [8] Designing an interactive open-domain question answering [9] System by Quarteroni, S&Manandhar, S. (2009)..forthcoming,Journal of Natural Language Engineering,Volume 15 Issue 1. [10] Biomedical Semantics by Chanlekha and Collier (2010)Journalof,1:3 http://www.jbiomedsem.com/content/1/1/3 [11] Document Classification with Support Vector Machines By Konstantin Mertsalov Principal Scientist, Machine and Computational Learning Rational Retention, LLC kmertsalov@rationalretention.com January 2009 [12] Information Processing and Management journal Trento, Italy(2011) homepage: www.elsevier.com/ locate/ infoproman Linguistic kernels for answer re-ranking in question answering systems Alessandro Moschitti, Silvia Quarteroni University of Trento, Via Sommarive 14, 38050 Povo.
REFERENCES [1] Question classification using support vector machines. By Zhang, D., & Lee, W. S. (2003). In Proceedings of the 26th annual international acm sigir conference on researc and developmentininformaionretrieval(pp.26–32). [2] Voorhees, E. M. (2004). Overview of the trec 2004 question answering track. In E. M. Voorhees & L. P.Buckland (Eds.), Trec (Vol. Special Publication 500-261). National Institute of Standards and Technology(NIST). [3] Wang, Y.-C., Wu, J.-C., Liang, T., & Chang, J. S. (2005). Web-based unsupervised learning for queryformulation in question answering. In Ijcnlp (p. 519-529). [4] Accessingmultilingualinformation2005multilingu alquestion answering track. In C. Peters (Ed.), repositories.Berlin, Heidelberg: Springer-Verlag. [5] Adaptive information extraction. ACM Comput. Surv.,Turmo, J., Ageno, A., & Catal`a, N. (2006). 38(2), 4.Vallin, A., Magnini, B., Giampiccolo, D., Aunimo, L., & Ayache, C. (2006). [6] Improved inference for unlexicalized parsing by Petrov, S., & Klein, D. (2007, April).. In Human language technologies2007: The conference of the north american chapter of the association for computational .
81