Clear June 2015

Page 1

CLEAR June 2015

Page 1


CLEAR June 2015

Page 2


Editorial…………………..04 News & Updates………05 CLEAR Dec 2014 Invitation…………………46 Last word…………………47

CLEAR Journal (Computational Linguistics in Engineering And Research) M. Tech Computational Linguistics, Dept. of Computer Science and Engineering, Govt. Engineering College, Sreekrishnapuram, Palakkad-678633 www.simplegroups.in simplequest.in@gmail.com Chief Editor Dr. P. C. Reghu Raj Professor and Head Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad-678633 Editors Dr. Ajeesh Ramanujan Raseek C Pelja Paul.N Anagha M Cover page and Layout Sarath K S Manu.V.Nair

CRF based Unknown Word Tag Prediction for Malayalam ................................………………….06 Alen Jacob, Amal Babu, Naseer C Malayalam Dialect Resolution using CRFs…....15 Sarath K S, Manu.V.Nair, P.C.ReghuRaj Character Relationship Extraction in Malayalam.....................................................23 Devisree V, Anjaly V A Machine Learning Approach for Malayalam Root Word Identification….............................27 Nisha M, Reji Rahmath K, P C ReghuRaj Malayalam SentiWordnet: A Lexical Resource for Cross Domain Sentiment Analysis and Opinion Mining …………………………………………...34 Anagha M, Raveena R Kumar, Sreetha K, Naseer C A Simple Approach for Monolingual Event Tracking System in Malayalam…...……………….40 Rekha Raj C T, Sruthi Sankar K P, Raseek C

CLEAR June 2015

Page 3


Dear Readers! Greetings!

This issue of CLEAR reaches you with many articles focused on Malayalam computing. This means that we are moving closer to our important objective of producing content based on research carried out on local language processing. The main issue we faced is the resourcepoor nature of Malayalam, which actually strengthened our resolve to pursue the work more vigorously. Knowing fully that the poverty of resources is only related to computing, and not on content per se, we present before you the following interesting issues: A Machine Learning Approach for Malayalam Root Word Identification, Monolingual Event Tracking System in Malayalam, Character relationship extraction in Malayalam, CRF based Unknown Word Tag Prediction for Malayalam, Malayalam Dialect Resolution using CRF, and a Lexical Resource for Cross Domain Sentiment Analysis and Opinion Mining in Malayalam. Hope you will enjoy the variety of the articles. Do send your comments and suggestions! Best Regards,

P.C. Reghu Raj (Chief Editor)

CLEAR June 2015

Page 4


Placements      

Neethu Johnson of M. Tech Computational Linguistics, 2012-2014 batch got placement for the post of NLP Engineer at Cognicor, Infopark, Ernakulam. Alen Jacob of M. Tech Computational Linguistics, 2013-15 batch got placement for the post of Data Scientist at Sporting Portal, Chennai. Lekshmi T S of M. Tech Computational Linguistics, 2012-14 batch got placement for the post of Associate System Engineer at IBM. Amal Babu of M. Tech Computational Linguistics, 2013-15 batch got placement for the post of Algorithm Designer at EMPOWER LABS Pvt. Ltd, Banjara Hills, Hyderabad. Manu V Nair of M. Tech Computational Linguistics, 2013-15 batch got placement for the post of Analyst- Data Science at Algorithmic Insight, Gautam Nagar, New Delhi. Reshma O K of M. Tech Computational Linguistics, 2012-14 batch got placement for the post of Assistant Professor at Aryanet Institute of Technology, Palakkad.

Publications 

Paragraph Ranking Based on Eigen Analysis, Reshma.O.K, P.C.ReghuRaj published in Elsevier, Procedia Computer Science 46 (2015) 532-539, International Conference on Information and Communication Technologies (ICICT 2014).

CLEAR June 2015

Page 5


CRF based Unknown Word Tag Prediction for Malayalam Alen Jacob, Amal Babu M. Tech Computational Linguistics, Govt. Engineering College, Sreekrishnapuram, Palakkad, Kerala alenjacob@outlook.com amalbabuputhanpurayil@gmail.com

Naseer .C Assistant Professor Dept. of Computer Science and Engineering, Govt. Engineering College, Sreekrishnapuram, Palakkad, Kerala naseercholakkalathu@gmail.com

ABSTRACT: In this paper we propose a method for unknown word tag prediction for Malayalam, using Conditional Random Fields (CRF). CRFs are one among the statistical modelling method popular in the fields of Pattern recognition and Machine learning. CRFs can be used in Natural Language Processing to predict a sequence of Parts-Of-Speech (POS) tags for a given input word sequence. In CRF the input words in the given sequence are not tagged in isolation, rather they are tagged by considering the context in which the words occur. In POS tagging, the performance of a given tagger depends greatly on how well the tagger tags the unknown words. A word is said to be unknown, if the given word is not present in the training data and hence in the lexicon. For predicting the unknown word POS tag, each tagger employs different methods. TnT (a second order Hidden Markov Model) make use of the probability distribution of words with same suffix, as that of the given unknown word. In SVM (Support Vector Machines), the feature values help in the prediction. In our paper we compare the efficiency of each proposed models, in predicting the tags of unknown words in Malayalam, with the efficiency attained using CRF.

Keywords:- TnT Tagger, Hidden Markov Model, Viterbi Algorithm, EM Algorithm, Conditional Random Fields, Logistic Regression, Gradient accent.

I.

INTRODUCTION

Parts-of-speech (POS) tagging is the process of assigning grammatical categories to words with similar behavior. Parts of speech (POS) tagging is an essential component in various Natural Language Processing tasks. The efficiency of a tagger plays an important role in the quality of the output produced by such systems. Various Natural Language Processing applications that require POS

taggers include Machine Translation, Question Answering (Q&A), and Named Entity Recognition (NER). The proposed approaches for POS tagging are rule based, memory-based, transformation-based, stochastic and statistical approaches [3]. Among these, it is observed that statistical approach is the best method for application involving small training corpus. Among the statistical approaches, the Maximum Entropy framework has a very strong position. Markov models combined with a good smoothing technique and with handling of unknown words would provide a better tagger [10]. CLEAR June 2015

Page 6


When the training corpus becomes small, the performance of TnT depends on the number of unknown words in the test document. A word is said to be unknown if it is not present in the training corpus, and hence not present in the lexicon. For finding the error caused due to unknown words, we conducted an experiment using tagged words taken from “Brown Corpus”. All the tagged words belonging to the “News” category were taken. This constituted a total of 100466 words (including punctuations). Five different corpora of varying size (5,000, 10000, 25000, 50000 and 100466) were created using this corpus. Figure 1 shows the 5-fold cross validation results of error due to unknown words in the test data, for each of the five corpus.

Fig. 1. The 5-Fold cross validation result of TnT error rates due to unknown words

In Figure 1 the y-axis shows the percentage of unknown words that created error in the TnT output. ie, in TnT error is calculated in terms of percentage of words that were wrongly tagged. Hence, y-axis shows the percentage of unknown word among the wrongly tagged words. The x-axis shows the size of training corpus in terms of

number of words in the corpus. It is evident from the results that as the error due to unknown word decrease as the size of the training corpus increases. In this paper we present a CRF based model for predicting the POS tags of unknown words. The corpus used consists of 9,010 Malayalam words taken from tourism category. For training the model, around 7,200 words from this corpus is chosen at each phase. The remaining portion is used for testing and evaluating the model. For training the CRF, it is required that the corpus be represented as a sequence of sentences. But our model is trained over unknown words in isolation. ie, each sentence has only one word, the unknown word. Models based on TnT and SVM also considers the unknown word in isolation. SVM might require some additional information than the word alone. But compared to CRF, SVM don’t require the complete sentence in which the word occurs. Evaluation: For evaluating the performance of CRF over TnT, we make use of fivefold cross validation. The experiment is carried over training corpus of different sizes and the results are compared and tabulated in this paper. Related works in this field are discussed in the following subsection. 1) Related works: Fuzzy network model is used by Jae-Hoon Kim and Gil Chang Kim for POS tagging under small training data [4]. In their work Artificial Neural Network is used for learning the Fuzzy Contextual Membership function. This model performs well on small training data. The unknown word problem is not pursued separately in this method. In Torsten Brants’ TnT tagger, the unknown words are tagged

CLEAR June 2015

Page 7


relying on the probability distribution of words having the same suffix [10]. Cucerzan and D. Yarowsky (2000) presents a minimally supervised learning approach for unknown word tag prediction that uses paradigmatic similarity measure learned from large training data [2]. Andrei Mikheev (1997) in his paper makes use of rule based approach for predicting the tag of unknown words [1]. In his work, the tag prediction makes use of the starting and ending segment of the unknown word. He used a general purpose lexicon and word frequencies derived from raw data to train the model. Orphanos and D. Christodoulakis (1999) makes use of decision tree based approach for solving unknown word problem for highly inflectional language [8]. The following section presents the architecture of TnT and discusses how it handles unknown word problem. This section is relevant as most of the comparison of the CRF model is made with the TnT which is based on HMM model, that has close resemblance to the CRF based model. II.

TnT ARCHITECTURE

A. The Underlying Model [10] TnT uses second order Markov models for part-of-speech tagging. The states of the model represent tags, outputs represent the words. Transition probabilities depend on the states, thus pairs of tags. Output probabilities depend only on the most recent category. To be explicit, we calculate:

for a given sequence of words w1‌wT of length T. t1‌tT are elements of the tagset, the additional tags t-1, t0 and tT+1 are beginningof-sequence and end-of-sequence markers. Using these additional tags, even if they stem from rudimentary processing of punctuation marks, slightly improves tagging results. B. Handling of Unknown Words [10] Currently, the method of handling unknown words that seems to work best for inflected languages is a suffix analysis. Tag probabilities are set according to the word’s ending. The suffix is a strong predictor for word classes. The probability distribution for a particular suffix is generated from all words in the training set that share the same suffix of some predefined maximum length. The term suffix as used here means “final sequence of characters of a wordâ€? which is not necessarily a linguistically meaningful suffix. Probabilities are smoothen by successive abstraction. This calculates the probability of a tag t given the last m letters li of an n letter word: đ?‘ƒ(đ?‘Ą|đ?‘™đ?‘›âˆ’đ?‘š+1 , ‌ đ?‘™đ?‘› ). The sequence of increasingly more general contexts omits more and more characters of the suffix, such that đ?‘ƒ(đ?‘Ą|đ?‘™đ?‘›âˆ’đ?‘š+2 , ‌ đ?‘™đ?‘› ), đ?‘ƒ(đ?‘Ą|đ?‘™đ?‘›âˆ’đ?‘š+3 , ‌ đ?‘™đ?‘› ),...đ?‘ƒ(đ?‘Ą) are used for smoothing. The recursion formula is đ?‘ƒ(đ?‘Ą|đ?‘™đ?‘›âˆ’đ?‘–+1 , ‌ đ?‘™đ?‘› ) đ?‘ƒĚ‚ (đ?‘Ą|đ?‘™đ?‘›âˆ’đ?‘–+1 , ‌ đ?‘™đ?‘› ) + đ?œƒđ?‘– đ?‘ƒ(đ?‘Ą|đ?‘™đ?‘›âˆ’đ?‘– , ‌ đ?‘™đ?‘› ) = đ?‘– + đ?œƒđ?‘–

(2)

�

argmax [âˆ? đ?‘ƒ(đ?‘Ąđ?‘– |đ?‘Ąđ?‘–−1 , đ?‘Ąđ?‘–−2 )đ?‘ƒ(đ?‘¤đ?‘– |đ?‘Ąđ?‘– )] đ?‘ƒ(đ?‘Ąđ?‘‡+1|đ?‘Ąđ?‘‡ ) For i = m, m-1‌0, using the maximum đ?‘Ą1 ‌đ?‘Ąđ?‘‡ likelihood estimates đ?‘ƒĚ‚ from frequencies in đ?‘–=1 (1) CLEAR June 2015

Page 8


the lexicon, the weights θi and the initialization P(t) = đ?‘ƒĚ‚(t) The maximum likelihood estimate for a suffix of length i is derived from corpus frequencies by đ?‘ƒĚ‚(đ?‘Ą|đ?‘™đ?‘›âˆ’đ?‘–+1 , ‌ đ?‘™đ?‘› ) =

đ?‘“(đ?‘Ą|đ?‘™đ?‘›âˆ’đ?‘–+1 , ‌ đ?‘™đ?‘› ) đ?‘“(đ?‘Ą|đ?‘™đ?‘›âˆ’đ?‘–+1 , ‌ đ?‘™đ?‘› )

(3)

For the Markov model, the inverse conditional probabilities đ?‘ƒĚ‚(đ?‘Ą|đ?‘™đ?‘›âˆ’đ?‘–+1 , ‌ đ?‘™đ?‘› ) are required. These probabilities are obtained by Bayesian inversion. In the following section we discuss the CRF based model in detail. III.

CONDITIONAL

RANDOM

FIELDS

A. Introduction Conditional random fields, a framework for building probabilistic models to segment and label sequence data [6]. CRFs are now a days used for a wide range of applications including “Computer visionâ€?, “Shallow parsingâ€?, “Named Entity Recognitionâ€? and “Gene findingâ€?. Definition[6]: Let G = (V,E) be a graph such that Y = (Yv)vĐ„V , so that Y is indexed by the vertices of G. Then (X,Y ) is a conditional random field in case, when conditioned on X, the random variables Yv obey the Markov property with respect to the graph: p(Yv|X,Yw,w ≠v) = p(Yv|X,Yw,w ~ v), where w ~ v means that w and v are neighbors in G.

Here X is a random variable over data sequences to be labeled, and Y is a random variable over corresponding label sequences. All components Yi of Y are assumed to range over a finite label alphabet T. For example, X might range over natural language sentences and Y range over part-of-speech tagging’s of those sentences, with T the set of possible part-of-speech tags. The following section discusses the use of CRF in POS tagging in detail. B. CRF and POS tagging In POS tagging based on CRF, we calculate the probability p(Y |X) for a given X. Here, as we discussed earlier, X represents the sentence to be tagged and Y represent one among the possible tag assignments for this sentence. ie, X = (X1,X2,X3,..), where Xi’s are all words in the language under consideration, and Y = (Y1,Y2,Y3,..), where Yi’s are all predicted tags for corresponding words in X. The probability p(Y|X) is not calculated using relative frequency. In CRF, a score, score(Y |X) is associated with each assignment Y for X. This score is calculated in terms of feature function and their weights. Equation (4) shows how to calculate this score. đ?‘› đ?‘ đ?‘?đ?‘œđ?‘&#x;đ?‘’(đ?‘Œ|đ?‘‹) = ∑đ?‘š đ?‘—=1 ∑đ?‘–=1 đ?‘¤đ?‘— đ?‘“đ?‘— (đ?‘‹, đ?‘–, đ?‘Œđ?‘– , đ?‘Œđ?‘–−1 ), (4)

Here m is the total number of feature functions learned by the model and n is the length of the sentence X. The variable i denotes the index of the word in the sentence, currently being scanned. Associated with each feature function there exist a weight, denoted by wj in equation (4), this weight CLEAR June 2015

Page 9


values are learned from the training data. In the following subsection we further describes the concept of feature functions. 1) Feature Function: Feature functions are used by CRF to formally define the dependency of a particular class with the features selected for the model. A feature function has the format as shown in equation (5). đ?‘“đ?‘— (đ?‘‹, đ?‘–, đ?‘Œđ?‘– , đ?‘Œđ?‘–−1 ) = 1 if Yi=Tag1 and Yi-1= Tag3 = 0 otherwise (5) here Yi represents the current word being tagged and Yi-1 represent the tag of the previous word. The feature function can be read as, if the predicted tag of the word is Tag1 and the tag of the previous word was Tag3 then return 1, else return 0. The length of the history considered can be extended. From equation (4) and (5) it is clear that, at a given time only a hand full of feature functions would only return a 1, and all others become irrelevant for the calculation of the score value associated with a given tag sequence. CRF calculates the probability p(Y|X) as shown in equation (6). This is nothing but the exponentiation and normalization of score value, to represent it in the range [0,1]. Here Y’ represent the set of all possible tag sequence for the given sentence X. đ?‘?(đ?‘Œ|đ?‘‹) đ?‘› exp[∑đ?‘š đ?‘—=1 ∑đ?‘–=1 đ?‘¤đ?‘— đ?‘“đ?‘— (đ?‘‹, đ?‘–, đ?‘Œđ?‘– , đ?‘Œđ?‘–−1 ) ] = đ?‘› ′ ′ ∑đ?‘Œ ′ exp ∑đ?‘š đ?‘—=1 ∑đ?‘–=1 đ?‘¤đ?‘— đ?‘“đ?‘— (đ?‘‹, đ?‘–, đ?‘Œđ?‘– , đ?‘Œđ?‘–−1 ) (6)

C. Learning the weights CRF selects random weight values for each feature function. For obtaining the actual value, CRF make use of gradient accent approach. For this the CRF first calculates the gradient C. Learning the weights CRF selects random weight values for each feature function. For obtaining the actual value, CRF make use of gradient accent approach. For this the CRF first calculates the gradient of the log probability with respect to wi. This is shown in equation (7). đ?œ• đ?œ•đ?‘¤đ?‘–

log đ?‘?(đ?‘Œ|đ?‘‹) = ∑đ?‘š đ?‘—=1 đ?‘“đ?‘– (đ?‘‹, đ?‘—, đ?‘Œđ?‘— , đ?‘Œđ?‘—−1 ) − ′ ′ ∑đ?‘Œ ′ đ?‘?(đ?‘Œ ′ |đ?‘‹) ∑đ?‘š đ?‘—=1 đ?‘“đ?‘– (đ?‘‹, đ?‘—, đ?‘Œđ?‘— , đ?‘Œđ?‘—−1 )

(7) The weights are updated using equation (8). Here Îą is the learning rate. đ?œ• đ?‘¤đ?‘– = đ?‘¤đ?‘– + đ?›ź log đ?‘?(đ?‘Œ|đ?‘‹) (8) đ?œ•đ?‘¤đ?‘– This operation shown in equation (8) is performed until some stopping criteria has been reached. An update falling below a predefined threshold may be chosen here. D. Finding the optimal tag sequence As the probability suggest, we would select the tag sequence that maximizes the probability p(Y |X). For this, one have to calculate this probability for all possible tag sequences and find the one that has the maximum value. This is a tedious task, since there are nm possibilities where n is the size of the sentence (ie, number of words) and m is the size of the tag set. CRF on the other hand make use of the optimal substructure property CLEAR June 2015

Page 10


satisfied by the linear-chain CRFs. This help us to device an algorithm similar to the viterbi algorithm for HMM, to find the optimal tag sequence. IV.

CRF & MALAYALAM UNKNOWN WORD TAG PREDICTION

In Malayalam, due to its high inflectional nature, it is more often to come across an unknown word. For TnT every word is a sequence of characters. For two words to be similar, their character sequence should match exactly. Hence two different inflections of a given word-root are considered as two entirely different words. In a given Malayalam document, it is common to find more than one inflections for a given word in its discourse. In a training corpus, all the inflections of a given word are not expected to occur. Hence a known word, with previously not encountered inflection is considered to be unknown. Hence TnT would come across unknown words more often in Malayalam than in any other language with comparatively less inflections. Another crucial problem associated with TnT is that, even if we have some resources, like a lexicon, that could help us provide some informations regarding a Malayalam word and its inflections, there is no provision to include this information into TnT’s architecture. The only sources of information TnT is interested in are the ones provided by the relative frequency counts. ie, the observation probabilities, transition probabilities and the initial probabilities.

On the other hand a CRF could be trained to incorporate the same information a TnT has, regarding observation, transition and initial probabilities. Further we could specify a set of other features that could help in modeling the POS tag prediction problem. These other features may include the information made available from resources such as a lexicon. In this paper we present a CRF based unknown word tag predictor model. This model works in two phases, in the “first phase” we tag the test data using a primary tagger. In the “second phase” we extracts the unknown words from the result of the primary tagger, along with the predicted tag for this word. We supply this information into the CRF based model. The CRF based model then predicts the tags for these unknown words, considering as features the word, suffix of the word and the tag predicted by the primary tagger. For CRF to efficiently predict the correct tag, some training has to be done. Next section describes the training of the proposed CRF model. 1) Training the CRF: CRF model, as described, corrects the mistakes made by the primary tagger. For making CRF equipped for this purpose, CRF has to be trained with instances that describes the unknown word prediction pattern of the primary tagger. For this purpose, we have to train the CRF on a training data that contains the unknown words tagged by the primary tagger, along with their predicted tag and actual tag. Training CRF on such a training data, would result in a model that could identify and correct a mistake made by the primary tagger in tagging the unknown words.

CLEAR June 2015

Page 11


Fig. 2. A sample of the training data used for training the CRF.

The Figure 2 shows a sample of the training data. Here the first column represents the unknown word, the second its suffix, third the predicted tag by the primary tagger and the final column shows the correct tag for the given word. The following section deals with the experimental results and its analysis. V.

parts. The primary tagger, TnT, is trained in five phases. At each phase, one of the five folds is used to test the TnT and the rest is used for training. The Figure 3 shows the performance of TnT over the unknown words in the test data at each phase. Fold 1 is used for testing in the first phase, fold 2 in the second and so on. The unknown words in the test data, at each phase, is given as input to the CRF model. Figure 4 shows the results obtained at each phase. It is observed that the CRF shows an overall accuracy of 79.7% whereas TnT shows an overall accuracy of 75.4%. Figure 5 depicts there results as a chart.

EXPERIMENTAL ANALYSIS AND RESULTS

TnT is used as the primary tagger. From the Training corpus, 5002 words are taken. This data is then divided into five folds. For creating the training corpus for CRF, the TnT is trained in five phase. At each phase one of the five folds is selected as the test data and the TnT is trained over the other four folds. The TnT output after tagging the fold selected as test data, is used for creating the entries in the corpus used to train CRF. ie, from the tagged output all the unknown word along with its predicted tag is copied to the training corpus. We also need to add the actual tag for the unknown word into the newly created corpus. This helps in the supervised learning. This experiment is carried over five times with a different fold as testing data. CRF is trained using the above training data. For testing the accuracy of such a model, we made use of a corpus consisting of 9010 words. The corpus is again divided into five almost equal

Fig. 3. 5 fold cross-validation result of TnT over a corpus of size 9010.

Fig. 4. 5 fold cross-validation result of CRF over a corpus of size 9010

CLEAR June 2015

Page 12


VII.

Fig. 5. Comparison of 5 fold cross-validation result of TnT and CRF at each fold.

VI.

COMPARISON

Orphanos and D. Christodoulakis (1999) using Decision tree base approach for predicting the unknown word tag, acquired an error rate of 16% for Greek language [8]. They used a corpus of size 7624(sentences), created from students writing iterature, newspaper, technical papers, magazines etc. Andrei Mikheev (1997) used a rule based guising for predicting the tag of unknown word [1]. He made use of three types of rules, prefix morphological rules, suffix morphological rules and ending-guessing rules. Tagging accuracy obtained by cascading these guesser was 87.75 - 88.7%. This provided an overall accuracy of 93.6 94.3% in tagging, which is a 6% improvement over the model without unknown word guising. Cucerzan and D. Yarowsky (2000) using their method acquired a 27% reduction in error rate for French and 12% for English [2]. This resulted in a 7.8% increase in accuracy for the tagger for French and 7.6% improvement for English.

CONCLUSION AND FUTURE WORK

We proposed a CRF based model to predict the tag of unknown words for Malayalam. The model performs well with small size training data. The CRF model trained over a corpus of size around 5000 was able to show a 4.3% improvement over the predictions made by a TnT trained over a corpus of size 9000. Pursuing new features such as the tag of the following word, might improve the model. Our model currently have considered each training data in isolation without any contextual information. A model that incorporates context as well is suggested as a possible extension. REFERENCES [1] Andrei Mikheev, Automatic Rule Induction for Unknown-Word

Guessing,

Computational

Linguistics, 23(3), pages 405-423, 1997. [2]

Cucerzan

and

D.

Yarowsky,

Language

Independent, Minimally Supervised Induction of Lexical Probabilities, Proceedings of the 38th Annual

Meeting

of

the

Association

for

Computational Linguistics (ACL-2000), pages 270-277, 2000. [3] Daniel Jurafsky and James H. Martin, Speech and Language Processing an introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Prentice Hall, 1999. [4] Jae-Hoon Kim, Gil Chang Kim, Fuzzy Network Model for Part-of-Speech Tagging under Small Training Data , Natural Language Engineering , Cambridge University Press-1995. [5] Joao P. Carvalho ,Fernando Batista, Luisa Coheur, A Critical Survey on the use of Fuzzy

CLEAR June 2015

Page 13


Sets

Language

[8] Orphanos and D. Christodoulakis, POS

Processing, WCCI 2012 IEEE World Congress on

Disambiguation and Unknown Word Guessing

Computational Intelligence Brisbane, Australia,

with Decision Trees , In Proceedings of the Ninth

June, 2012.

Conference of the European Chapter of the

[6]

in

Speech

Lafferty

J.,

and

Natural

McCallum

A.,

Pereira

F.,

Association for Computational Linguistics(EACL-

Conditional random fields: Probabilistic models

99), pages 134-141,1999.

for segmenting and labeling sequence data,

[9] Tatiane M. Nogueira, Heloisa A. Camargo and

Proc.

Solange O. Rezende, Fuzzy rules for document

18th

International

Conf.

on

Machine

Learning, pp. 282-289, 2001.

classification to improve information retrieval,

[7] Li-Xin Wang and Jerry M. Mendel, Generating

International Journal for Computer Information

fuzzy rules by learning from examples, IEEE

Systems

Transl. J. Magn. Japan, vol. 2, pp. 740-741,

Applications, Volume 3 pp. 210-217, 2011.

August

[10] Torsten Brants, TnT- A statistical parts of speech tagger.

1987

[Digests

9th

Magnetics Japan, p. 301, 1982].

Annual

Conf.

and

Industrial

Management

‘Thought vectors’ could revolutionize artificial intelligence The British high priest of artificial intelligence Professor Geoffrey Hinton, who was snapped up by Google two years back during its massive acquisition of AI experts, revealed that his employer may have found a means of breaking the AI deadlock that has persisted in areas like natural language processing. The hope comes in the form of a concept called “thought vectors.” The concept is both new and controversial. The underlying idea is that by ascribing every word a set of numbers (or vector), a computer can be trained to understand the actual meaning of these words. The current state of the art has taught computers to understand human language much the way a trained dog understands it when squatting down in response to the command “sit.” The dog doesn’t understand the actual meaning of the words, and has only been conditioned to give a response to a certain stimulus. If you were to ask the dog, “sit is to chair as blank is to bed,” it would have no idea what you’re getting at. Thought vectors provide a means to change that: actually teaching the computer to understand language much the way we do. The difference between thought vectors and the previous methods used in AI is in some ways merely one of degree. While a dog maps the word sit to a single behavior, using thought vectors, that word could be mapped to thousands of sentences containing “sit” in them. The result would be the computer arriving at a meaning for the word more closely resembling our own.

Read more: http://www.extremetech.com/extreme/206521-thought-vectors-couldrevolutionize-artificial-intelligence

CLEAR June 2015

Page 14


Malayalam Dialect Resolution using CRFs Sarath K.S1, Manu.V.Nair2

P.C Reghu Raj

Department of CSE Govt. Engineering College Sreekrishnapuram, Palakkad Kerala, India 678633 sarathks333@gmail.com1, manunair1990@gmail.com2

Department of CSE Govt. Engineering College Sreekrishnapuram, Palakkad Kerala, India 678633 pcreghu@gmail.com

ABSTRACT—Dialect is a regional variety of language, with differences in vocabulary, grammar and pronunciation. It is a recognized formal variant of the language spoken by a large group belonging to one region, class or profession. Dialect resolution is a localized approach through which a person can express idea in his own style. The system convert a dialect from an informal text format to formal format without losing contextual meaning. We present a Conditional Random Fields (CRFs) approach for building probabilistic model to label sequence data. A Rule Based method is also incorporated with CRFs. This work concentrates on ‘Thrissur dialect’, which is a recognized regional dialect in Malayalam language. This work is also an extension of our previous work on Dialect resolution using TnT.

Keywords: Dialect, Slang, Rule based approach, Conditional Random Fields (CRFs).

I.

INTRODUTION

Language is a social art. It is an exact reflection of the character and growth of its speakers. Variations in language are more often occur in speech than in text. These variations are distributed among a group of people in a geographically separated areas. It is exchanged between the people of the same group for a long period of time and make them adaptive with that style. Cultural, geographical and physical factors also have a role in the art of language. Such a form of a language which is particular to a specific region or social group can be termed as

‘Dialect’, distinguished by its vocabulary, grammar, and pronunciation. ‘Slang’ consists of words, expressions and meanings that are informal. They are used either by people who know each other very well or who have the same interests. It include mostly expressions that are not considered appropriate for formal occasions; often vituperative or vulgar. Slang words and phrases highly colloquial and informal in type. It consists either of newly crafted words or of existing words employed in a special sense. Resolving these dialect and slang words have many applications in day to day life. Localization is the main application of the proposed system in which the local people can engage with outer world, especially for government procedures, easily without language barriers. This system can be CLEAR June 2015

Page 15


embedded with speech to text application systems, that which easily convert the dialect words to formal words. This system can be used in preprocessing stage of Malayalam to other language machine translation systems. Section 2 describes the common features and characteristics of Thrissur dialect. Section 3 introduces the methods we used in this Dialect Resolution system. Section 4 describes system design and implementation of this model. Section 5 shows observations and experimental result of the proposed system. Section 6 gives the issues related with this system. II.

RELATED WORKS

There are no other previous works related to Dialect Resolution in Malayalam, other than our previous work on the same topic using TnT [4]. Some of the works in dialectal languages are found in Arabic languages. Hassan Sawaf (2010) describe an extension to a hybrid machine translation system for handling dialect Arabic, using a decoding algorithm to normalize nonstandard, spontaneous and dialectal Arabic into Modern Standard Arabic [6]. Wael Salloum and Nizar Habash (2011) proposed a system for improving the quality of ArabicEnglish statistical machine translation (SMT) on dialectal Arabic text using morphological knowledge [5]. They uses Rule-based approaches in this system. III.

THRISSUR DIALECT

Thrissur is the cultural city of Kerala. The people keep a unique identity in their dialect and abundant slang words collection,

which make them distinguishable in a group immediately. Some slang words are endured and entered them general lexicon and some are used across dialect, like word ‘porichu’ (means ‘fried’) used in a sense of ‘super’ or ‘good’. Many metaphor representations are there like ‘gadi padayi’ with a sense of ‘he died’ are used frequently. There are many different inflections for dialect and slang words which are easily understandable by people. Through generations, many words get changed resulting in a large dictionary. For example, ‘nthu’, ‘enthutta’, ‘enthootu’, ‘nthutta’, ‘nthootu’, ‘enthonnu’ and ‘nthonnu’ are different Thrissur dialect for a single Malayalam word ‘enthu’ with a meaning ‘what’. In a computational point of view, dealing with such properties are computationally intense. IV.

DIALECT RESOLUTION

Dialect resolution system transforms informal dialect sentences into formal, readable, meaning bearing sentences in the same language. In Dialect resolution system, slang words get replaced by its formal equivalents and the dialect word transforms into meaningful words. The system uses rule based method, machine learning method and idea from word sense disambiguation for resolving Thrissur dialect. Rule based method is typically a mapping concept. In this work, one word is mapped with another word, which is a memorizing task. While dealing with ambiguous words, we need a better disambiguating method. Lesk algorithm is a simple and dictionary based approach for word sense disambiguation (WSD) [2]. In this CLEAR June 2015

Page 16


approach, all the sense definitions of the word to be disambiguated are retrieved from the dictionary. Each of these senses is then compared to the dictionary definitions of all the remaining words in the context. The sense with highest overlap with these context words is selected as the correct sense. The idea of working of Lesk algorithm is used for disambiguation in this system. A. Machine Learning

Machine learning is the construction and study of systems that can learn from data, rather than follow only explicitly programmed instructions. It tries to find hidden pattern in the given data and predict the future data. Machine learning is a type of artificial intelligence (AI) that provides computers the ability to learn [1]. Machine learning is usually divided into two main types, supervised learning and unsupervised learning. In the predictive or supervised learning approach, the goal is to learn a mapping from inputs x to outputs y, given a labelled set of input-output pairs. In unsupervised learning set of inputs are modelled without the help of labelled examples. 1) Conditional Random Fields (CRFs): CRF is a probabilistic machine learning model to predict label sequence of a sequence data. In our previous work, we used TnT (Trigrams ‘n’ Tagger) as a machine learning method to resolve the dialect words. TnT is an efficient statistical part-of-speech tagger, internally having a HMM (Hidden Markov Model). CRF [3] is a finite state model with unnormalized transition probabilities. It

assign a well-defined probability distribution over possible labeling trained by maximum likelihood or MAP estimation. CRFs offers several advantages over HMM, including the ability to relax strong independence assumptions made in the model. CRFs also avoid a fundamental limitation of Maximum Entropy Markov Models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states with few successor states. Let X is a random variable over data sequence to be labeled and Y is a random variable over corresponding label sequences. The random variables X and Y are jointly distributed, but in a discriminative framework, construct a conditional model p(X|Y) from paired observation and label sequences, and do not explicitly model the marginal p(X). Definition: Let G = (V, E) be a graph such that Y = (Yv) vϵV, so that Y is indexed by the vertices of G. Then (X,Y) is a conditional random field in case, when conditioned on X, the random variables Yv obey the Markov property with respect to the graph: p(Yv|X,Yw, w ≠ v) = p(Yv|X,Yw,w ~ v), where w ~ v means w and v are neighbors in G. Hence, CRF is a random filed conditioned on the observation X. G is a simple chain graph or line: G = (V = {1, 2, …, m},E = {( i,i+1)}). X has any graphical structure at all, not necessary to assume that X and Y have the same graphical structure. In this paper we concentrated with sequences X = (X1, X2,…, Xn) and Y = (Y1,Y2, …,Yn). If the graph G(V,E) of Y is a tree, its cliques are the edges and CLEAR June 2015

Page 17


vertices. Then the joint distribution over the label sequence X and Y has the form,

Finally, conditional probability of a label sequence y is written as,

đ?‘?đ?œƒ (đ?‘Ś|đ?‘Ľ) âˆ? đ?‘’đ?‘Ľđ?‘? (∑đ?‘’ ∈ đ??¸,đ?‘˜ đ?œ†đ?‘˜ đ?‘“đ?‘˜ (đ?‘’, đ?‘Ś|đ?‘’, đ?‘Ľ) + ∑đ?‘Ł đ?œ– đ?‘‰,đ?‘˜ đ?œ‡đ?‘˜ đ?‘”đ?‘˜ (đ?‘Ł, đ?‘Ś|đ?‘Ł, đ?‘Ľ)) (1)

(5) V.

where x is a data sequence, y a label sequence, and y|s is the set of components of y associated with the vertices in subgraph S. Assume that the features fk and gk are fixed. CRFs use the observation-dependent normalization Z(x) for conditional distributions. It allow arbitrary dependencies on the observation sequence. In addition, the features do not need to specify completely a state or observation, so the model can be estimated from less training data. CRFs share all the convexity properties of general maximum entropy model. For each position i in the observation sequence x, we define a |đ?›ž| Ă— |đ?›ž| matrix random variable Mi(x) = [Mi (y’, y|x)], where đ?›ž is the set of possible tags or labels. Then, đ?‘€đ?‘– (đ?‘Ś ′ , đ?‘Ś|đ?‘Ľ) = exp(Λ đ?‘– (đ?‘Ś ′ , đ?‘Ś|đ?‘Ľ)) (2) Λ đ?‘– (đ?‘Ś ′ , đ?‘Ś|đ?‘Ľ) = ∑đ?‘˜ đ?œ†đ?‘˜ đ?‘“đ?‘˜ (đ?‘’đ?‘– , đ?‘Œ|đ?‘’đ?‘– = (đ?‘Ś ′ , đ?‘Ś), đ?‘Ľ) + ∑đ?‘˜ đ?‘˘đ?‘˜ đ?‘”đ?‘˜ ( đ?‘Łđ?‘– , đ?‘Œ|đ?‘Łđ?‘– = đ?‘Ś, đ?‘Ľ) (3) where ei is the edge with labels (yi-1, yi) and vi is the vertex with label yi then the normalization,

(4)

SYSTEM

DESIGN

AND

IMPLEMENTATION The task is a learning processes where each change on formal word is learned through sample large corpus. All the possible slang words are substituted, because they give no information about target word. More context words are required to replace a slang word and they may not available always. More than one dialect word in a sentence is a crucial task. Accuracy in the succeeding target words are depend upon the previous resolved words. In the same way, handling an unknown word is highly crucial one. Words in a sentence are dependent in the way of representing meaning. Therefore, even a valid output word may deviate meaning of a sentence. Contextual validity should be ensured and it is a complex task. A. Design The design phase is composed of three major levels. A source sentence is passed through the first level (rule-based level) to resolve all the slang words belongs it. After that, all the untouched words are entered into CRF, where they get modified into corresponding formal words as per the system learned. All the resolved known words are together used as context words to resolve unknown words in the third level (word disambiguation level).

CLEAR June 2015

Page 18


B. Implementation In this system, input sentences are treated separately. Sentences with any number of slangs are allowed. At most two dialect words per sentence are used in testing phase. Rule-based level: Slang words are almost fixed, they only varied through generations. Almost all current slang words and their inflections are used for this phase. All slang words in source sentence are replaced by predefined formal word, one after another. This phase is almost error free. Only predefined words are substituted and their possible synonyms are not used.

Fig. 1. Flow chart of three phases

Machine learning level: CRF is a probabilistic machine learning model, used in

this level for learning morphological variations on words in the transformation of dialect word to formal. Conditional model specifies the probabilities of possible label sequences given an observation sequence. Furthermore, the conditional model of the label sequence can depend on arbitrary, not independent features of the observation sequence without forcing the model to account for the distribution of those dependencies. The chosen features may represent attributes at different level of granularity of the same observations or aggregate properties of the observation sequence. Validity of resolved word is checked using Malayalam formal word dictionary. Sample of training corpora is given in figure 2. All the valid words are taken as list of context words for the next level.

Fig. 2. CRF Training corpus

Word disambiguation Level: Concept of word sense disambiguation (Lesk algorithm) is CLEAR June 2015

Page 19


used to handle unknown words. Each unknown word is treated one-by-one from left to right in source sentence. For each word, possible word list is created from formal word dictionary by taking an assumption that first character of unknown and target words are the same. Each word in the list is ranked by a count of number of context words come together in sentences more frequently in formal sentence corpus. The most ranked word is used as target word, and appended to context word list. Hence following unknown words will get more number of context words. But the total accuracy is inter dependent. Succeeding accuracy is proportional on the preceding resolved words. Order of context words are not considered, because Malayalam is highly free order language. VI.

Fig. 5. CRF output

After passing into the CRF, the w4 and w5 are resolved which in given in figure 6.

TEST AND OBSERVATIONS

A Malayalam sentence with six words, given in the figure 3 is used as input sentence to resolve.

Fig. 3. Input sentence

Fig. 4. Slang word correction

Let w1; w2; w3; w4; w5 and w6 are corresponding words in the input sentence. The words w3; w4; w5 and w6 are unknowns. The word w6 is a slang and resolved in Rule based step using slang dictionary, figure 4. The words w4 and w5 are resolved using CRF. The tagged output of CRF is given in the figure 5.

Fig. 6. After output

Finally, only w3 is unknown which is passed to word disambiguation stage and substitute word with highest rank. Rank is assigned for each word in the possible wordlist according to the number context words cooccur with it. The maximum ranked word is chosen as target word. The output sentence is given in figure 8. The sentence corpus contains 400 sentences. The word dictionary contains 700 formal words. The rule based dictionary contains 160 slang words and their corresponding formal words. Experimental results are shown in Table 1. It also includes a comparison between our previous work and current work.

CLEAR June 2015

Page 20


VII.

ISSUES

IN

DIALECT

RESOLUTION Fig. 7. Word Disambiguation output

Fig. 8. Resolved sentence

Table 1. Results

According to the results, the proposed system resolved exactly all formal and slang words in the input sentences correctly. The testing is carried out with the same training and testing corpus, as TnT evaluated earlier. The overall accuracy of the system depends on its performance on resolving the dialect words. The system has shown a performance of 76.85% by keeping semantic validity of words in the sentence within the context. Our previous work with TnT had a performance of 61.11%. From this observation, we can infer that CRF has a performance improvement of 15.74% than TnT for the same testing samples

Dialect resolution, in its each stage has many issues and complexity. Thrissur dialect is rich in compound and slang words. Slang word doesn’t give any clue about corresponding formal word from its morphological information. Handling these slang words are very difficult. A slang word with more than one sense in different context is again too complex to handle. In machine learning level, named entities may transformed wrongly by CRF and generate morphologically or semantically different one. In this system, at least two strong context words are necessary for one dialect word to be resolved, otherwise prediction will be wrong. Adding wrong prediction to context word-list also reduce contextual information for the following cases in a sentence. Resolving a dialect ‘B’ after a dialect ‘A’ is different from resolving ‘A’ after ‘B’. Creating all possible inflectional forms of Malayalam sentences which are used in word disambiguation level, is a tedious task. The sentences having metaphor cannot be included, because of its hidden meaning cannot be correctly mapped

VIII. CONCLUSION AND FUTURE SCOPE The proposed system showed a performance of 76.85%, which is about 15.74% extra more performance than our previous system using TnT. The system can also be used for other dialectal languages, CLEAR June 2015

Page 21


with sufficient support of corpus. As an extension of the proposed system, we are trying to replace CRF method by MBLP (Memory Based Language Processing). A large corpus of formal inflectional sentences, dictionary with all slang words and large formal word corpus will be added to support and improve the system.

[3] John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning,

’01,

pages

282–289,

San

Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. [4]

REFERENCES

ICML

Sarath

K

S

and

Manu.V.Nair.

Dialect

resolution: A hybrid approach. Unpublished. [5] Wael Salloum and Nizar Habash. Dialectal to

[1] T.O. Ayodele. Types of Machine Learning

standard arabic paraphrasing to improve arabic-

Algorithms. INTECH Open Access Publisher,

english

2010.

Proceedings of the First Workshop on Algorithms

[2] D. Jurafsky and J.H. Martin. Speech and

and Resources for Modelling of Dialects and

language processing: an introduction to natural

Language Varieties, pages 10–21, Edinburgh,

language processing, computational linguistics,

Scotland,

and speech recognition. Prentice Hall series in

Computational Linguistics.

artificial intelligence. Pearson Prentice Hall,

[6] Hassan Sawaf. Arabic dialect handling in

2009.

hybrid machine translation. In Proceedings of

statistical

July

machine

2011.

translation.

Association

In

for

the 9th Conference of the Association for Machine Translation, Denver, Colorado, 2010. AMTA 2010.

GPUs Speak Volumes in Semantic AI Platforms DARPA is exploring how deep learning, particularly with a semantic and natural language processing angle, can find a fit for government use, including more recently with its Deep Exploration and Filtering of Text (DEFT) effort, which analyzes massive volumes of text to build information around concepts or topics—essentially putting together the pieces of a puzzle based on language, word frequency, and for the most part, fully unsupervised collections that yield their own learned topics of interest. The company believes that semantic deep learning networks are the next unexplored frontier for large companies who need collate and understanding big semantic data—and by solving the hardware piece of the puzzle they have an advantage over competitors and a way to carve out a slice of that potential market by making deep learning at scale within reach.

Read more: http://www.theplatform.net/2015/06/10/gpus-speak-volumes-forsemantic-ai-platforms/

CLEAR June 2015

Page 22


Character Relationship Extraction in Malayalam Devisree V1, Anjali V2 Dept.Computer Science & Engineering Govt.Engineering College Sreekrishnapuram devisreevvtl@gmail.com1 anjalyv75@gmail.com2

ABSTRACT: Machine understanding and appreciation of story has been among the most cherished goals in Natural Language Processing research. A key step towards understanding a story is to understand the relations between the characters that occur in the story. Relation extraction problems are solved either through supervised learning or unsupervised learning algorithms. The proposed method is based on an unsupervised learning method which identifies the main characters and also extracts the relationship between the characters. The system first identifies the characters and collecting the sentences regards them. Then analyze these sentences and the context of sentences for extracting the relationships. The corpus for the proposed system is a collection of Malayalam short stories.

I.

INTRODUCTION

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semistructured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). The present significance of IE pertains to the growing amount of information available in unstructured form. Natural language Understanding is crucial for most information extraction tasks because the desired information can only be identified by recognizing conceptual roles. The term “Conceptual role” refers to semantic relationships that are defined by the role that

an item plays in context. For example, extracting noun phrases that refer to people can be done without regard to context by searching for person names, titles, and personal pronouns, such as “Mary,” “John,” “Smith,” “Mr,” “she,” and “him.”. Contextual information may be necessary for tasks like word sense disambiguation but that is a separate issue. Relation extraction is promoted by the ACE program. It is the task of finding predefined semantic relations between two entities from text [3]. The paper, Extracting Character Relationships from Stories, presented by Krishna Janakiraman uses the bag of words model and the correlation coefficient to group related character pairs across a corpus of stories. The relationship for a group of character pairs is then determined by computing the semantic similarity score CLEAR June 2015

Page 23


between the words associated with the pairs and terms from a relationship vocabulary. It is based on an unsupervised learning model. Eugene Agichtein and Luis Gravano introduce novel strategies for generating patterns and extracting tuples from plain-text documents in their Snowball System. Text documents often contain valuable structured data that is hidden in regular English sentences. This data is best exploited if available as a relational table that could be useful for answering precise queries or for running data mining tasks. Snowball explores a technique for extracting such tables from document collections that requires only a handful of training examples from users. These examples are used to generate extraction patterns that in turn result in new tuples being extracted from the document collection. II.

INFORMATION EXTRACTION

The goal of IE is to extract interesting information from documents for an automatic analysis by a computer. The extraction techniques have to deal with the understanding of the meaning of natural language. First, it is necessary to know what kind of semantic information interesting, and then identify a part of text and assign particular attributes to it. Thus this process turns the unstructured information embedded in texts into structured data. The final output of the extraction process varies; in every case, however, it can be transformed so as to populate some type of database. IE is a domain-specific task; the important types of

objects and events for one domain can be quite different from those for another domain. IE systems are more difficult and knowledge-intensive to build, and are to varying degrees tied to particular domains and scenarios. IE is more computationally intensive than IR. However, in applications where there are large text volumes IE is potentially much more efficient than IR because of the possibility of dramatically reducing the amount of time people spend reading texts. Also, where results need to be presented in several languages, the fixed format, unambiguous nature of IE results makes this relatively straightforward in comparison with providing the full translation facilities needed for interpretation of multilingual texts found by IR. III.

INFORMATION EXTRACTION AND STORY UNDERSTANDING

Information extraction (IE) is a form of natural language processing in which certain types of information must be recognized and extracted from text. On the surface, information extraction might appear to be fundamentally different from story understanding. But the challenges difficulties underlying them are largely the same. They are both subject to the same problems of ambiguity, idiosyncracy, and a strong dependence on world knowledge. An information extraction task, by definition, specifies the domain of interest and the types of information which must be extracted. The restricted domain and the focused nature of the task can greatly simplify dictionary construction, ambiguity resolution, and CLEAR June 2015

Page 24


discourse processing. Similarly, story understanding systems usually focus on a single domain or certain aspects of the world in order to minimize the knowledgeengineering effort. A key difference between information extraction and story understanding is that the latter strives to understand the entire story. Information extraction systems limit depth by focusing on certain types of information and a specific domain. In exchange for limited depth, IE systems can effectively process a wide variety of texts within the domain. Story understanding systems, on the other hand, aim for deep understanding of whole texts, but usually cannot effectively process arbitrary texts without additional knowledge engineering. Story understanding researchers have a long history of expertise with complex knowledge structures and inference generation. Ultimately, continued progress in information extraction will depend on richer semantics, knowledge structures, and inference mechanisms. Automated knowledge acquisition techniques developed for information extraction are likely to be applicable to story understanding as well. IV.

PROPOSED APPROACH

Many approaches have been proposed in the literature of relation extraction. The proposed system is based on an unsupervised learning method. The corpus consists of a set of Malayalam short stories. A fundamental step towards identifying relation- ship between character pairs is to identify the characters themselves. This task is accomplished using the Named Entity Recognizer toolkit for Malayalam.

A. Relationship Vocabulary The actual relationship term was picked from a relationship vocabulary. The vocabulary contains different words which represents certain relationships. For example a simple vocabulary is given below:

Now the features are computed using ‘bag of words’ collected from a window defined around the sentences containing the character pair. The following sequences of steps explain the feature computation process for a given character pair: The input is a short story,  step1: Use Malayalam NER tool for finding all characters in the story. Then take character pairs.  step2: Find all sentences in the story that the pair occurs in.  step3: For each of the above occurring sentence, collect k (let k=2) sentences to its right and left (that is, collect the context of the selected sentences).  step4: Tokenize each sentence into words.  step5: Use Malayalam WorldNet Padam for finding out the synonyms for words in the sentences. Compare the synonyms with the words in the vocabulary.  step6: Compute a distance score ds(pair, word) for each pair, word combination. If a word occurs in the same sentence, give it 1 point, if it CLEAR June 2015

Page 25


occurs in the đ?‘˜ đ?‘Ąâ„Ž sentence from the occurring sentence, give it 1/2đ?‘˜ points. From the ‘bag of words’ collected for each pair, a (Character-Pair, Words) matrix, M is formed. The Character Pair - Words matrix is similar to the Document-Term matrix that is typically computed for Information Retrieval tasks and the term frequency tfword;pair and the inverse document frequency idfword scores were similarly computed:

VI.

CONCLUSION

Analysis of a story have important applications in the area of Natural Language Processing. The Summarization of stories and analysis of Major Characters in the stories are some of the examples. The proposed system can extract the relationships between characters in short stories based on the relations defined in the vocabulary. REFERENCES [1] Ellen Riloff, Information Extraction as a Stepping Stone toward Story Understanding, edited by Ashwin Ram and Kenneth Moorman, The MIT Press 2000 [2] Krishna Janakiraman, Extracting Character Relationships from Stories, Proceedings of the

Each entry in the matrix was then computed using then following formula:

43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363370 [3] Wenjie Li, Peng Zhang, Furu Wei, Yuexian Hou

Now, this formula denotes that important terms that are closer to the pair are weighted higher. Then we get the corresponding vocabulary which represents the corresponding relationship.

and

Qin

Lu,

A

Novel

Feature-based

Approach to Chinese Entity Relation Extraction , HK RGC (CERG PolyU5211/05E)and China NSF (60603027) 2007. [4] Eugene Agichtein Luis Gravano,Snowball: Extracting

Relations

from

Large

Plain-Text

Collections, ACM DL 2000.

V.

EVALUATION

[5] Daniel Jurafsky and James H.Martin, Speech and Language Processing, Published by Pearson

The Malayalam wordnet Padam is used for getting the synonyms of each word in the extracted sentences. Then compares this with the words in the given vocabulary. When an overlap occurs, find out the ds(pair, word), tfpair;word, idfword of the word. The maximum value of the product of these measures gives the relationship expected.

Education,

Inc.

and

Dorling

Kindersley

Publishing, Inc., 2000.

CLEAR June 2015

Page 26


A Machine Learning Approach for Malayalam Root Word Identification Nisha M1, Reji Rahmath K2 M. Tech Computational Linguistics Department of CSE Govt.Engg College, Sreekrishnapuram, Kerala, India 678633 nisha.m407@gmail.com1 rahmathrejik@gmail.com2

P C ReghuRaj Department of CSE Govt.Engg College, Sreekrishnapuram, Kerala, India 678633 pcreghuraj@gmail.com

ABSTRACT: Words are the building blocks of every language. Morphological Analysis and Generation are necessary for building computational grammars as well as Machine Translation. To make morphological analysis and generation easier, the identification of root words are necessary. This paper presents a method to identify the root words or stems of Malayalam words. The approach used is Memory Based Language Processing (MBLP), which is a Machine Learning technique. MBLP is based on exemplar storage during learning, and analogical reasoning during processing. The system is trained using TiMBL (Tilburg Memory based Learner). The training instances for the system are created from words and manually annotated for main segmentation. The system is tested with 1500 words and it identifies root words with92% accuracy.

Keywords: Natural Languages,

Natural Language Processing, Root word, Morphology, Memory Based Language Processing, Malayalam Morphological Analysis, Machine Learning, Supervised Learning.

I.

INTRODUCTION

Natural Language Processing (NLP) is concerned with the knowledge representation and algorithms involved in learning, producing, and understanding natural language. Language technology, or language engineering uses the formalisms and theories developed within NLP in applications ranging from spelling error correction to machine translation and automatic extraction of knowledge from text [1], [9].

The need for identifying the root form of a word is very important in Natural Language Processing. Root word can be taken as keyword in searching and indexing for retrieving more relevant pages. Root word identification is necessary for translating texts with better quality. Statistical tools like frequency counter, concordance, keyword-incontext, n-gram etc. need root form of a word to know more about the vocabulary. Spelling correction and spelling suggestion also requires root word identification. Lexicographers assign root word as the head word/lexical entry in the lexicon. Malayalam is highly agglutinative language and most of the words occur in their inflected or derivative form. For obtaining the CLEAR June 2015

Page 27


root form of the words, the suffixes agglutinated with them are to be removed. Also the morphophonemic change (sandhi) occurring when a root word concatenates with a suffix should be analysed and generalized. Morphological analysers and parts-of-speech taggers developed for Malayalam attempted to find out the root form of the word. Comparing the empirical methods with the knowledge based approach, it is clear that the former have a number of advantages [1]. In general, probabilistic approaches have a greater coverage of syntactic constructions and vocabulary, they are more robust (they exhibit graceful degradation in deteriorating circumstances), they are reusable for different languages and domains, and development times for making applications and systems are shorter. On the other hand, knowledge-based methods make the incorporation of linguistic knowledge and sophistication possible, allowing them to be more precise sometimes. This paper describes the application of a machine learning method called MemoryBased learning (MBL) for root word identification of Malayalam words. The paper is organized as follows: Section II describes the related work that has undergone in natural language processing for root word identification and morphological analysis. Next section gives a brief introduction to Memory Based Language Processing (MBLP) approach to NLP based on a symbolic machine learning method called Memory-Based Learning (MBL). Section III introduces the working process of MBLP. Section IV describes the functionalities of a root word identifier. Section V proposes the IB1 algorithm that is used popularly in MBLP. Section VI gives the

experimental setup and results obtained for Malayalam language root word identification. The final section concludes the paper with future scope. II.

LITERATURE REVIEW

All words in a language are unique having their own function and meaning. The syntactic and semantic knowledge about individual words can be encapsulated in a highly structured repository known as computational lexicon which is very essential for Machine Translation. For designing a computational lexicon, the first and foremost task is to identify the head words or root words in the language. The Root Word Identifier proposed in [8] is a rule based approach which automatically removes the inflected part and derive the root words using morphophonemic rules. The use of memory-based learning for morphologicalanalysis and part-of-speech (POS) tagging of written Arabicis explored in [6]. Memory-based morphological analysis ofArabic words is feasible, but its main limitation is its inabilityto recognize the stem of an unknown word, and consequentlythe appropriate vowel insertions. Also, its guess on the possiblePOS tags of an unknown word turned out to be less useful intagging approach than using the raw prefix and suffix letters ofthe words themselves, as witnessed by the scores on unknownwords of the POS tagger specialized in unknown words. MBLP approach to tag Malayalam sentences is described in[5]. The proposed system is based on an empirical approachthat models the human parts of speech (POS) tagging processing more realistically than the existing systems, withoutcompromising the efficiency and accuracy. The idea here is to CLEAR June 2015

Page 28


use Memory Based Language processing (MBLP) algorithm.MBLP is based on the combination of two powerfultechniques: the efficient storage of solved examples of theproblem, and similarity based reasoning on the basis of thesestored examples to solve new ones. This is implemented withthe TiMBL (Tilburg Memory based Language) tagger tool andtested with existing SVM-tagger for Malayalam POS tagging. A corpus of Malayalam words is analyzed and the rules that govern suffixes are manually derived in [3]. Root words are identified by removing the suffixes as per the rules. A Malayalam Morphological analyzer is developed using a hybrid approach, combining methodologies of both paradigm and suffix stripping approach in [10]. Lttoolbox in the Apertium package is used for Morphological Analysis. Probabilistic and rule based method is used for the morphological analysis in [7]. These approaches uses inflection and suffix lists, which is created using lookup table, inflections and suffixes. III. MEMORY BASED LEARNING Memory-based learning, also known as instance-based, example-based, or lazy learning, based on the k-1 nearest neighbor classifier is a supervised inductive learning algorithm for learning classification tasks. Memory-based learning treats a set of labeled (pre-classified) training instances as points in a multi-dimensional feature space, and stores them as such in an instance base in memory [1], [9]. Thus, in contrast to most other machine learning algorithms, it performs no abstraction, which naturally allows it to deal with productive but low frequency exceptions.

A. Memory Based Language Algorithms Tilburg Memory-Based Learner (TiMBL) is free software published by the Free software foundation; which implements several memory based learning algorithms [2]. All implemented algorithms have in common that they store the representation of the training set explicitly in memory. During testing, new cases are classified by extrapolation from the most similar stored cases. Overlap and Levenshtein metrics algorithm is one memory based learning algorithm. The most basic metric that works for patterns with symbolic features is the Overlap metric; where (X, Y) is the distance between instances X and Y, represented by n features, and is the distance per feature. The distance between two patterns is simply the sum of the differences between the features. The k-NN algorithm with this metric is called IB1 [1]. The IB1 algorithm used in TiMBL differs from the original IB1 algorithm with the value of k referring in k-nearest distances as k-nearest examples with k = 1. For instance, TiMBL nearest neighbor set can contain several instances that are equally distant to the test instance. So k-NN kernel could therefore be called k-nearest distances classification. IV. ROOT WORD IDENTIFIER The process of separating the affixes from an inflected word can provide the root of the word and its grammatical information. Morphological variations for words occur in Malayalam due to Inflections, Derivations and Word compounding. Root word is the most basic form of a word that is able to convey a particular description, thought or CLEAR June 2015

Page 29


meaning [8]. The definition given for root word is that it is a real word that can make new words from root words by adding prefixes and suffixes. The proposed system uses machine learning approach for Malayalam root word identification. The Malayalam words used in paper [4] is used for training and testing. Common grammatical categories for Malayalam are noun, pronoun, verb, adverb, adjective, postpositions, indeclinables, clitics etc. In this work, the main grammatical category noun is considered. Root word identification consists of the identification of the root of words, or more technically, constituents of the words. Which constitutes a major part of morphological analyzer and generator. The design and implementation of morphological analyzer and generator for Malayalam is a promising research for various applications in NLP [4]. There are different methods for the morphological analysis of natural language processing: Brute Force Method, Root Driven Method, Affix Stripping Method are some of the methods evolved in analysis [3]. In root driven method, root/stem is identified at first and the affixes are passed. In the affix stripping method the process takes place in the reverse direction. In it the affixes are identified first and the remaining part is assumed as the stem or root. In suffix stripping method the searching process is relatively fast as the search is done on suffixes. A root word identifier is to be designed to analyze the main constituents of the words. It will help to segment the words into stems and inflectional markers. Using this root word identifier a morphological

analyzer can return the root/stem word along with its grammatical information depending upon its word category. For noun it will provide gender, number and case information and for verbs, it will be tense, aspect and modality. V. SYSTEM DESIGN AND IMPLEMENTATION The goal of the memory-based root word identifier system presented in this paper is to identify the root word of a word that has not occurred before in the training corpus. Malayalam words occur in inflected, derived and compound form. The definition given for root word is that it is a real word that can make new words from root words by adding prefixes and suffixes [3]. A. Design The key strength of the approach is to use representations of parts of words to perform memory-based reasoning. The memory-based processing system will not find a reliable match with an unseen word but it is quite likely to find good matches between the suffix part of unseen words and suffix parts of known words; likewise for any other part. Root word identification then becomes a memory-based generalized lookup process on word parts, and assembling the parts outcomes to produce the root and suffix as output. Fig 1 shows the architecture of the memory-based root word identifier. B. Implementation 1) Training: Malayalam is a highly agglutinative language. Any number of affixes can be combined to form a new word.

CLEAR June 2015

Page 30


Fig. 1: Architecture of the memory-based root word identifier for Malayalam

A paradigm defines all the words that a given stem forms and also provides a feature structure associated with the word. Words from different paradigms were analyzed for inflections. From each paradigm basic inflections were selected. Words in that paradigm with inflections in these generalized inflections are avoided. This helps to reduce the corpus size considerably. For instance, by training the word aanakaLooToppaM, the training of words aanakaL˜ and aanakaLooT can be eliminated. 2) Windowed Example Generation: Words are converted to instances suitable to memory-based learning using windowed example generation. Instances represent input (the orthographic word) and their corresponding output (the morphological analysis). Since instances need to be of a fixed length and since they need to be general enough to generalize from known to unknown

words, instances do not map entire words to entire analyses (which would render them case specific), but rather represent partial fixed-width snapshots of words mapping to subsequences of the analysis. More specifically, the mapping is broken down into smaller letter-by-letter classification tasks. The input of each instance, consisting of a fixed number of features, is created by sliding a window over the input word, resulting in one instance for each letter. Using a 1-1- 1-1-1-1-1 window yields 7 features, i.e. the input letter in focus, plus the preceding 3 letters and following 3 letters. The hyphen mark (-) is used as a filler symbol for positions beyond the beginning or end of the word. The window is created by finding the trigrams of letters of words. Each focus letter is mapped to the corresponding letter in the morphological analyzed output. A plus ‘+’ sign is used to split the stem and the affix. Once the training corpus is developed the system is trained and tested using the tool TiMBL. TiMBL calculates the entropy and gain for each attributes and assigns the class label. Figure 2 illustrates the training of the word maramillaatat. IV. EXPERIMENTAL RESULTS The experiment was performed on a Linux machine with the following specifications: Ubuntu Linux 13.04. Necessary software installed includes TiMBL. As discussed above TiMBL is a software that helps in implementing the MBL algorithm. Though this was well studied in Dutch language, in our experiment, it was extended to Malayalam language, and found to work well. CLEAR June 2015

Page 31


A. Comparison with paradigm based, suffix based and hybrid approach

Fig. 2: Training ’maramillaatat’

instance

for

the

word

There are three necessary files. (1) Input word, (2) A python program that converts the word into the form of instances in the training corpus and uses TiMBL, (3) a training file from which TiMBL learns using IB1 algorithm, similarity computed using both weighted overlap and Modified Value Difference Metric (MVDM), relevance weights computed with gain ratio, and number of most similar memory items on which the output class was based equal to 1. Algorithm implemented using weighted overlap similarity metric outperforms MVDM. Efficiency of the system depends on how well class labels are identified based on the patterns of morphotactics changes in Malayalam words. The training corpus was created using 687 inflections in the paradigms ‘amma’, ‘aana’, ‘maraM’ and ‘ati’. The system produced correct output for known words. When tested with 1500 unknown words system produced outputs with root identification accuracy of 92%.

The system performs better when compared with paradigm based and suffix stripping approaches. The system uses machine learning approach while the other two systems uses rule based approach. As Malayalam is highly agglutinative, even two or more inflected forms of words get glued into one word and this makes the paradigm approach difficult to implement. Memory based root word identifier is found to work well in the above case. Paradigm based approach requires grammatical rules, root and affix dictionaries. The accuracy of hybrid approach depends on morphological dictionary and the suffix list used. Memory based approach does not use such dictionaries hence it requires less memory. Table I shows the comparison of different approaches.

TABLE I: Comparison of Paradigm based, Hybrid and Memory based approaches.

V. CONCLUSION AND FUTURE SCOPE This paper presents a new approach for root word identification in Malayalam language. Memory based learning stores all examples in memory and it settles for a state of maximal description length. This extreme bias makes memory-based learning an interesting comparative case against so-called eager learners, such as decision tree induction algorithms and rule learners. IB1 algorithm has been used and it outperforms for Malayalam Language. There are still other CLEAR June 2015

Page 32


methods like IGTree algorithm, and IB1 algorithm with variable k value (note: in IB1, k-NN with value of k=1 is used). Performance could be measured for each values of k. More accuracy can be obtained by increasing the window size. Changes occurring at the morpheme boundaries when morphemes are glued together can be encoded in the class label for improving the accuracy of the system. Rules can be integrated with the system for a full fledged morphological analyzer. Memory based POS tagging can be used to identify the word classes. FAMBL, FAMily- Based Learning, a variant of IB1 that constitutes an alternative approach to careful abstraction over examples can also be used.

[6]Erwin Marsi, Antal Van Den Bosch, and

REFERENCES

Implicit schemata and categories in memory-

Abdelhadi Soudi. “Memory based morphological analysis generation and part-of-speech tagging of Arabic”. In Proceedings of the ACL workshop on

computational

languages,

pages

approaches 1–8.

to

semitic

Association

for

Computational Linguistics, 2005. [7] O. R. Rinju, R. R. Rajeev, P. C. Reghu Raj, and Elizabeth Sherly. “Morphological analyzer for Malayalam: Probabilistic method vs rule based method”. IJCSIT, 2, October 2010. [8] Meera Subhash, M. Wilscy, and S. A. Shanavas. “A rule based approach for root word identification

in

Malayalam

language”.

International Journal of Computer Science & Information Technology, 4(3), 2012. [9] Antal van den Bosch and Walter Daelemans.” based language processing”. Language and

[1]Walter Daelemans. “Memory-based language

speech, 56(3):309–328, 2013.

processing”. Cambridge University Press, 2005

[10] P. M. Vinod, V. Jayan, and V. K. Bhadran.

[2] Walter Daelemans, Jakub Zavrel, Ko van der

“Implementation of Malayalam morphological

Sloot, and Antal van den Bosch. Timbl: “Tilburg

analyzer based on hybrid approach”. ROCLING

memory based learner”, version 6.3. Technical

XXIV (2012), page 307, 2012.

report, reference guide. Technical Report ILK 10-01, ILK Research Group, Tilburg University, 2010. [3] Jisha P. Jayan, R. R. Rajeev, and S. Rajendran.”

Morphological

analyzer

and

morphological generator for Malayalam-Tamil machine translation”. International Journal of Computer Applications, 13(8), 2011. [4] Jisha P Jayan, Rajeev R R, S Rajendran, ”Morphological

Analyser

for

Malayalam-

A

comparison of Different Approaches”, IJCSIT, Vol.2, No. 2, Dec 2009, pp. 155-160. [5]Robert Jesuraj and P. C. Reghu Raj. “Mblp approach applied to pos tagging in Malayalam language”. NCILC, 2013.

CLEAR June 2015

Page 33


Malayalam SentiWordnet: A Lexical Resource for Cross Domain Sentiment Analysis and Opinion Mining Anagha M1, Raveena R Kumar2, Sreetha K3 Department of CSE Govt. Engineering College, Sreekrishnapuram, Palakad, Kerala, India 678633 anaghamanoharan3@gmail.com1, veenakalathil@gmail.com2, sreetha227@gmail.com3

Naseer C Assistant professor Department of CSE Govt. Engineering College, Sreekrishnapuram, Palakad, Kerala, India 678633 naseercholakkalathu@gmail.com

ABSTRACT: Most of the works in sentiment Analysis are carried out in English language. But these days people are expressing their opinions in Malayalam too. But there are many challenges in performing sentiment analysis in Malayalam. Sufficient resources for performing sentiment analysis in Malayalam language are not available. For example, a SentiWordnet for determining the polarity of words is not available in Malayalam. In this paper we try to develop a SentiWordnet for Malayalam language, which will be helpful in different sentiment analysis tasks. Malayalam WordNet developed by Amrita University was taken as base and the words in it were tagged manually as positive, negative, neutral. Further this training corpus was used to learn the machine using three different tools: “TnT, SVM, CRF”. Then the output tags are compared to determine the scores of the words. Each of the scores will vary from 0.0 to 1.0. Keywords: SentiWordnet, Sentiment Analysis, Opinion Mining.

I.

INTRODUCTION

Now a days people regularly express their opinion through various websites, and also most of the people rely on the reviews about the products before purchasing the product. Sentiment Analysis is a Natural Language Processing (NLP) task that aims in obtaining opinions of the writer, by analysing various text forms such as reviews, news, and blogs. The sentiments are mainly classified to three classes, that is, positive, negative and neutral.

Sentiment Analysis is performed at one of the three levels:  Document level: Determines the polarity of whole document  Sentence level: Determines the polarity of sentences  Aspect level: Determines the polarity of sentences/documents for each feature it contains. Sentiment analysis contains several subtasks like [6]: 1) Determining the positive and negative polarity of the Text. CLEAR June 2015

Page 34


2) Determining the strength of the PNpolarity of a text. 3) Extracting opinions from a text.

To aid these tasks, determining the positive and negative words is very important. A SentiWordNet can help in determining the positive and negative terms, and also in assigning scores to these words, which would help in efficient score calculation of the opinion. Most of the works in sentiment Analysis are carried out in English language. But these days people are expressing their opinions in Malayalam too. But there are many challenges in performing sentiment analysis in Malayalam. Sufficient resources for performing sentiment analysis in Malayalam language are not available. For example, a SentiWordnet for determining the polarity of words is not available in Malayalam. In this paper we try to develop a SentiWordnet for Malayalam language, which will be helpful in different tasks such as sentiment analysis and opinion mining. Each of the word scores will vary from 0.0 to 1.0. The rest of this paper is organized as follows: Section 2 describes the related works done. Section 3 presents the proposed methodology and Section 4 focuses on the experimental results and discussion. Section 5 describes the scope and applications of the proposed system. Finally, results are summarized and concluded in Section 6. Section 5 also briefs about the future scope of the work and different ways to improve the efficiency of the system.

II.

LITERATURE REVIEW

Sentiment Analysis is language specific and it also depends on the time. A SentiWordNet will help to effectively assign scores to the sentiment words. Lot of works have been carried out in English and other languages for building SentiWordNet. In the paper [4] SentiWordNet, a lexical resource is developed to associate each synset ‘s’ of WordNet 2.0, with positive, negative and objective scores. For building the SentiWordNet eight individual synset classifiers (s,p), were trained and then they are gathered into a synset classifier committee. In the paper [6] SentiWordNet 3.0 is introduced which Sentiment classification and opinion mining. The generation of SentiWordNet includes two main steps for refining the scores: 1) a weak-supervision, semi supervised learning step, and (2) a random-walk step. In the paper [2] an interactive gaming (Dr Sentiment) technology to create and validate SentiWordNet is introduced for three Indian languages, Bengali, Hindi and Telugu by involving Internet population. The paper [1] developed a lexical resource for Hindi, called HindiSentiWordNet and implemented a majority score based strategy to classify the given document. In the paper [5] a sentence level mood extraction technique for Malayalam is used. Semantic orientation method was used for mood extraction. The paper [3] introduces a methodology for determining polarity of text within a multilingual framework. A document in different language is first translated to CLEAR June 2015

Page 35


English, and translated document is classified to one of the class positive and negative. For classification, first the document is searched for sentiment bearing words and then assigns the scores using SentiWordNet. Through the literature survey it is evident that SentiWordNet for Malayalam language does not exist. In this paper, we try to develop a SentiWordNet in Malayalam. III.

CHALLENGES ASSIGNING EXPLICITLY

IN SCORES

Sometimes is becomes difficult in assigning positivity and negativity scores of different words explicitly since the same word may contribute different meanings to the review according to the context. Some such examples are listed below. 1) Implicit Sentiment and Sarcasm: Sentences may carry implicit sentiments, which means the opinion can be expressed without having any sentiment bearing words in it [7]. For Example: “Ee book engane vayikkum!” This sentence does not explicitly carry negative sentiment bearing words although it is a negative sentence. Thus identifying semantics is more important in Sentiment Analysis than syntax detection. 2) Thwarted Expectations: Sometimes the author deliberately sets up context only to refuse it at the end [7]. Consider the following example: “nalla kadhayaanu, ellavarum nannai abhinayichittund, pakshe ee padam odilla.” Inspite of the presence of words that are positive in orientation the overall sentiment is negative because of the crucial

last sentence, whereas in traditional text classification this would have been classified as positive as term frequency is more important there than term presence . 3) Subjectivity Detection: This is to differentiate between opinionated and nonopinionated text. This is used to enhance the performance of the system by including a subjectivity detection module to filter out objective facts. But this is often difficult to do [7]. Consider the following examples: “Enikk kathakal ishtamalla”, “ishtamaayilla enna katha nannayittund.” The first example presents an objective fact whereas the second example depicts the opinion about a particular story named ishtamaayilla. 4) Negation: Handling negation is a challenging task in Sentiment analysis. Negation can be expressed in subtle ways even without the explicit use of any negative word [7]. Example: “Ee website upayogikkan eluppam alla”. The word “alla” reverses the tag of “eluppam”. IV. SYSTEM DESIGN IMPLEMENTATION

AND

In this work, a SentiWordnet was developed for Malayalam which was not available before. The development cycle consists of six modules. The first module involves preparing a training corpus with Malayalam sentiment words. The next module involves manually training the corpus as positive; negative, and neutral. Then this corpus is used to learn the system using three different tools: TnT, SVM and CRF. The third module involves preparing a test set which would serve as the content of SentiWordNet. CLEAR June 2015

Page 36


The fourth module involves tagging the test set with the three tools. After that, in the next module, the output tags are compared to determine the scores of the words. Finally the results are written to a file with ID, Word, Positive score, Negative score, Objective Score. A. Training the System Multi-domain reviews from various web sites were collected and a training corpus was created. The next step involved was to manually tag the training data which was a tough task, because at times the sameword may give different moods in different situations. For example, the word maduthu gives a negative mood usually. But, when a positive word like chirichu combines with it, the mood of the sentence chirichu maduthu becomes extremely positive. Later the system was trained using three tools, TnT, SVM and CRF. B. Preparation of Test set The next major task was to build a large lexical resource file with about 30000 sentiment words. This was prepared by taking Malayalam WordNet developed by Amrita University as base.

was taken to be 1. Similarly, if only one tagger tagged the word to be positive, the positive score of that word was taken to be 0.33 and so on. The words without any amount of positivity or negativity were removed from the set. The remaining words are the elements in the proposed Sentiwordnet. The results were written to a file with ID, which is same as that of the words WordNet ID, Word, Positive score, Negative score and Objective Score. IV.

EXPERIMENTAL RESULTS

To test the accuracy of SentiWordNet, experiments on sentiment analysis were conducted. User reviews were collected from various online web sites. User reviews were given as input and the percentage of positivity and negativity in the review was obtained as output, using the prepared SentiWordNet as the resource providing scores. The system generated output was compared with manually tagged output since no other work has been done in this particular area till the date. The system gave a performance rate of 90.1% which was the average of judgement done by ten human judges, who were made to compare the manual and system generated outputs. VI. SCOPE AND APPLICATIONS

C. Calculating scores The words from Malayalam WordNet were taged using the three tools with which the system was trained. Then the output tags were compared to determine the scores of the words. If all the three taggers tagged a particular word as positive, its positive score

Sentiwordnet can be considered as a lexical resource that contains all the sentiment bearing words with their corresponding polarities. The polarities include positive, negative and objective scores. Sentiwordnet is mainly used for sentiment classification and opinion mining. Each word in the CLEAR June 2015

Page 37


sentiwordnet can be taken as opinion information to review the sentiment in the sentence. Sentiment analysis may be helpful in the fields like product reviews, business, blog posts and research markets etc. One of the most important approaches to find out the sentiment is to use the sentiwordnet. Sentiwordnet can be considered as a dictionary of opinionated terms. Business analysts can extract the subjective information about their products using sentiment analysis. Opinion mining can be divided into three tasks. The first task is subjectivity- objectivity polarity detection to determine whether a text is subjective or objective. The second task is positivity-negativity polarity detection to determine whether a text is positive or negative. The third one is strength of the positivity-negativity polarity detection to determine how positive or negative a text is. The subjective sentences and objective sentences can also be extracted using the polarities. The sentiment in the sentence depends up on the polarities of the words in the sentiwordnet. Sentiwordnet has got different other applications like word sense disambiguation, information retrieval, automatic text classification, automatic text summarization, machine translation. Sentiwordnet can be used to compare the similarity between the given words too.

available for Bengali, Hindi and Telugu. This lexical resource would be very helpful for the sentiment analysis and opinion mining. With the proposed sentiwordnet, the polarities of the sentences can be easily calculated. As a future work we would like to add synsets to each words which would be helpful different tasks such as calculating similarity between two words.

VII. CONCLUSION

resource for opinion mining, In Proceedings of

ACKNOWLEDGMENT The authors would like to thank Dr. Pushpak Bhattacharyya, Aditya Joshi of IITB for clearing our doubts and helping us whenever we faced obstacles. REFERENCES [1] Aditya Joshi, Balamurali A R and Pushpak Bhattacharyya,

A

Fall-back

Strategy

for

Sentiment Analysis in Hindi: a Case Study., Proceedings of ICON 2010: 8th International, 2010. [2]

Das

A.

and

S.

Bandyopadhyay,

SentiWordNet for Indian Languages, In the 8th Workshop on Asian Language Resources (ALR), August 21-22, Beijing, China, 2010. [3]

Denecke,

K

Using

sentiwordnet

for

multilingual sentiment analysis, In Proceedings of ICDE-8, vol no: 2. [4]

Esuli

Andrea

SentiWordNet:

A

and

Sebastiani

publicly

Fabrizio,

available

lexical

Language Resources and Evaluation (LREC),

In this paper we try to develop a SentiWordnet for Malayalam language using statistical method. Currently SentiWordNet was unavailable for Malayalam language. However, Sentiwordnets were already

2006. [5]

Govindaru

V.

Neethu

Mohandas,

Janardhanan PS Nair, Domain specific sentence level mood extraction from malayalam text, International

Conference

on

CLEAR June 2015

Advances

Page 38

in


Computing and Communications, pages 7881,

[7]

vol no:1, 2012.

Bhattacharya. Sentiment Analysis, A Literature

[6] S. Baccianella, A. Esuli, and F. Sebastiani,

Survey, Indian Institute of Technology, Bombay

SentiWordNet 3.0: An enhanced lexical resource

Department

for sentiment analysis and opinion mining, In

Engineering, 2012.

Proceedings International

of

the

7th

Language

Conference Resources

Subhabrata

of

Mukherjee

Computer

and

Pushpak

Science

and

on and

Evaluation (LREC 10), pages 22002204, 2010.

Spark update brings R support and machine learning chops One of the most popular big data processing platforms, Spark, now supports one of the premier statistical programming languages, R, which could pave the way for easier big data statistical analysis. “R is the lingua franca of data scientists and its adoption has exploded in the last two years,” wrote Patrick Wendell, one of the chief contributors to Spark, in an email. Wendell is also a cofounder and software engineer at Databricks, which offers a commercial cloud-based version of Spark for enterprises. The new version “will let R users work directly on large datasets, scaling to hundreds or thousands of machines, well beyond the limits of a stand-alone R program,” Wendell wrote. The newly updated Spark, version 1.4, also includes production-ready machine learning capabilities and a more comprehensive set of visual debugging tools. With more than 2 million users worldwide, R is one of the most widely used programming languages specifically designed for statistical computing and predictive analytics. The new release also comes with a production-ready machine learning pipeline, first introduced as an alpha feature in Spark 1.2. Machine learning is the programmatic approach for computers to infer new information through the use of preset rules and copious amounts of data. The new machine learning pipeline comes with a set of commonly used algorithms for preparing and transforming the data. Emerging from alpha status means that the developers safely can use the API without worrying that it will change in future editions of Spark.

Read more: http://www.cio.com/article/2935053/spark-update-brings-r-support-andmachine-learning-chops.html

CLEAR June 2015

Page 39


A Simple Approach for Monolingual Event Tracking System in Malayalam Rekha Raj C T1, Sruthi Sankar K P2

Raseek C

Department of CSE Govt. Engg College Sreekrishnapuram, Palakkad Kerala-678633 rekharajct@gmail.com1 sruthisankarkp@gmail.com2

Department of CSE Govt. Engg College Sreekrishnapuram, Palakkad Kerala-678633 rasek.c@gmail.com

ABSTRACT: Topic Detection and Tracking (TDT) is an area of information retrieval that aims to monitor the on-line news stream in order to automatically spot new unreported news events (first story detection) and assigning stories to previously detected events(topic tracking, cluster detection). Given a news story, a TDT system would be able to attach it to any previous discussions about the event portrayed in the story-else the story would be regarded as new. This paper presents a Monolingual Event Tracking system for Malayalam. This system has been developed based on a newspaper corpus developed from a Malayalam newspaper. The system takes as input a news story of a particular date as initial news story. All the stories within 30 days before and 30 days after the date of initial news story are considered as target stories. The goal of the system is to identify those news stories that describe the same event as that of the initial news story. An event has been considered as a vector that consists of term, person, location, organization and date. A news story is represented as a set of such event vectors. All the event vectors of initial and target stories are checked for similarity. Similarity between the event vectors were determined using Latent Semantic Analysis. Any two news documents are said to be similar if the number of event vectors match over a certain threshold value.

Keywords: Information Retrieval, Natural Language Processing, tracking, event vector, monolingual event tracking system.

I.

INTRODUCTION

An event is some unique thing that happens at some point in time [2]. The notion of an event differs from a broader category of events both in spatial/temporal localization and in specificity [4]. For example, “Magnus Carlsen won the World Chess Championship on 23rd November, 2014” is consider to be an

event, whereas “World Chess Championship” in general is considered to be a class of events. Events might be unexpected, such as the eruption of a volcano, or expected, such as a political election. Event Tracking, an application of Information Retrieval is the task of monitoring a stream of news stories to find those that discuss the same event as the one covered in a few sample stories [4]. Event Tracking is a sub problem of a much broader problem called Topic Detection CLEAR June 2015

Page 40


and Tracking (TDT). Given a news story, a TDT system would have to be able to attach it to any previous discussions about the event portrayed in the story-else the story would be regarded as new. In this task, a system is given a small number (Nt = 1, 2, 4, 8 or 16) of sample stories that are known to be on the same news topic. The system’s job is to monitor the stream of stories that follow that Ntth story to find all subsequent stories on the same topic [8]. This paper presents a simple approach for monolingual event tracking system for Malayalam. II. LITERATURE REVIEW Topic Detection and Tracking (TDT) is an area of Information Retrieval (IR). A TDT system takes a stream of stories as input. It identifies the first story that describes an event and find all stories following a particular event [2]. TDT related research begun in 1996 with DARPA funded pilot study [2]. The researchers conducted a lot of experiments to check the feasibility of TDT systems using existing technology. Allan, Lavrenko and Jin introduced topic tracking [3], and they concluded that first story detection through tracking is inefficient and it requires different approaches. After the launching of TDT program, the scope was confined to event detection and tracking, but later the focus has returned to spotting dynamic topics that center around seminal events [10]. The methods applied in TDT cover a good portion of the prevailing IR methods: the majority of the approaches in TDT have relied on some sort of clustering: Single-Pass Clustering [2],

[12], [17] or hierarchical Group-Average Clustering. Also, Hidden Markov Models [15], Rocchio [16], k-Nearest Neighbors [16], naive Bayes [14], probabilistic ExpectationMaximization models and Kullback-Leibler divergence [10] have been used. Allan et al. investigated the use of Named Entities (NE) in the vector model. Similarly, Yang et al. [18] extracted locations, names of individuals and organizations, time and date references, and sums of money and percentages for NE weighting. The work reported by Anup Kumar Kolya et al. [8] describes a Monolingual Event Tracking System in Bengali. The proposed system determine whether two news documents within a range of dates describe the same event. An event is represented as a vector which consisting of person, location, organization, title and date. No work in the area of event tracking (or, TDT) can be found involving Indian languages except Bengali. III. EVENT TRACKIN SYSTEM IN MALAYALAM The event tracking system has been implemented based on a news corpus developed from the web archive of a leading Malayalam newspaper. From the html file of each news story, its title, date and content were extracted. Then the corpus was automatically tagged with TnT tagger [13]. A news story of a particular date is given as input to the system. The system would consider all the stories published in the preceding and following 30 days as target stories. Similarity between the initial and target stories have been computed. The system will find all the stories that describes CLEAR June 2015

Page 41


the same event as that of the initial story from the target stories based on the similarity value. A. Event Vector Creation An event is considered as a vector consisting of Term, Person, Location, Organization and Date [8]. Person, Location, Organization and other named entities belong to the word class Nouns. A news story is represented as an event vector consisting of nouns (in their root form) in the story. Malayalam is a highly agglutinative language. A root word can be inflected in multiple ways. The nouns have been stemmed using Silpa stemmer [1] to obtain their stems. Using root form of nouns in event vectors helps to reduce the size of the event vectors. Two stories representing the same event may consists of inflections of the same word. Stemming the words helps to achieve better similarity between the stories. B. Similarity Measurement Each story is represented as an event vector. To determine whether two stories discusses the same event their event vectors were compared. The similarity between the initial and target stories were computed using Latent Semantic Analysis [6], [7]. 1) Latent Semantic Analysis: Latent Semantic Analysis (LSA) is an indexing and retrieval method that uses a mathematical technique called Singular Value Decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. LSA is based on the principle that words used in the same contexts tend to have similar meanings. A key feature of LSA is its ability to extract

the conceptual content of a body of text by establishing associations between those terms that occur in similar contexts. 2) Term Document Matrix: LSA begins by constructing a term-document matrix, M, to identify the occurrences of the m unique terms within a collection of n documents. In a term document matrix, each term is represented by a row, and each document is represented by a column, with each matrix cell, aij, assigned a weight wi,j associated with the term document pair [ki, dj ]. The weight wi,j = tf × idf where tf is the term frequency and idf is the inverse document frequency. The term-document matrix is usually very large and very sparse. 3) Singular Value Decomposition: Singular Value Decomposition is performed on the matrix to determine patterns in the relationships between the terms and concepts contained in the text. Formally, the singular value decomposition of an m×n real or complex matrix M is a factorization of the form M = USVt where U is a m×m real or complex unitary matrix, S is an m×n rectangular diagonal matrix with nonnegative real numbers on the diagonal, and Vt (the conjugate transpose of V, or simply the transpose of V (if V is real) is an n×n real or complex unitary matrix. The diagonal entries Si,i of S are known as the singular values of M. The m columns of U and the n columns of V are called the left singular vectors and rightsingular vectors of M, respectively. The singular value decomposition and the eigen decomposition are closely related. Namely: 1) The left-singular vectors of M are eigenvectors of MMt. 2) The right-singular vectors of M are eigenvectors of MtM. CLEAR June 2015

Page 42


3) The non-zero singular values of M (found on the diagonal entries of S) are the square roots of the nonzero eigen values of both MtM and MMt. In the event tracking system, the document collection consists of initial story and its target stories. All the terms in the event vectors of the document collection were found. The number of distinct terms in the document collection is denoted by m and the number of documents in the document collection is denoted by n. The weight wij for each cell in the matrix is calculated. The relationship between any two documents can be obtained from the document-document matrix MtM which is obtained by computing (V S)(V S)t. The next step is to find the correlation value between the initial story and target stories. Let D denote the document-document matrix and let the initial story be ith story in the document collection. Then D[i; i] is taken as the normalizing factor so that the correlation of initial story to itself is 1. Correlation of document i with document j is found only if D[i; j] is non-zero and it is obtained by dividing D[i;j] by the normalizing factor. Many irrelevant target stories might have non-zero correlation with initial story. In order to avoid retrieving irrelevant stories, only those stories with correlation value more than half of the correlation value of the maximum correlated target story were retrieved. C. Evaluation The monolingual event tracking system for Malayalam is developed using a news corpus developed from the web archive of a leading Malayalam newspaper. The system is trained

with 200 news stories. The system is tested with 50 stories corresponding to 10 different events. When the system was tested with an initial story describing “French Open quarter final�, 6 documents were retrieved out of which 5 were relevant. The answer set consisted of 9 documents. Recall (in %) = Total number of stories correctly identified as similar / Total number of similar news documents = 5/9*100%= 55.55%. Precision (in %) = Total number of stories correctly identified as similar / Total number of news documents identified by the system as similar = 5/6*100% = 83.33%

IV. CONCLUSION This paper present a simple approach for monolingual event tracking system in Malayalam. The system is developed based on a news corpus developed from the web archive of a leading Malayalam newspaper. A news story on a particular date is given as input to the system. The system considers all the stories in the preceding and following 30 days of the initial story as target stories. The goal of the system is to retrieve stories that discusses the same event as that of the initial story. Each news story has been represented as a vector consisting of nouns in the story in their root form. Similarity between the event vectors were computed using LSA. Correlation between initial and target stories were computed. Target stories with higher correlation were retrieved. On evaluation the system demonstrated a precision of 83.33% and recall 55.55%.

CLEAR June 2015

Page 43


REFERENCES

semantic

analysis.

Discourse

processes,

25(23):259–284, 1998. [1] Silpa Stemmer, 2014 [Online; accessed 23-

[10] Victor Lavrenko, James Allan, Edward

December-2014]. http://silpa.org.in/Stemmer.

DeGuzman, Daniel LaFlamme, Veera Pollard,

[2] James Allan, Jaime G Carbonell, George

and Stephen Thomas. Relevance models for

Doddington,

Yiming

topic detection and tracking. In Proceedings of

Yang. “Topic detection and tracking pilot study

the second international conference on Human

final report”. 1998.

Language Technology Research, pages 115-

[3] James Allan, Victor Lavrenko, and Hubert Jin.

121.Morgan Kaufmann Publishers Inc., 2002.

“First story detection in TDT is hard”. In

[11] Yuen-Yee Lo and Jean-Luc Gauvain. “The

Proceedings

LIMSI topic tracking system for TDT2001”. Proc.

conference

Jonathan

of on

Yamron,

the

and

ninth

Information

international

and

knowledge

TDT, page 1, 2001.

management, pages 374–381. ACM, 2000.

[12] Juha Makkonen, Helena Ahonen-Myka, and

[4] James Allan, Victor Lavrenko, and Ron

Marko Salmenkivi.” Applying semantic classes in

Papka. Event tracking. UMAss computer science

event detection and tracking”. In Proceedingsof

Department, CIIR Technical Report IR-128,

International Conference on Natural Language

1998

Processing

[5]

James

Allan,

Ron

Papka,

and

Victor

(ICON

2002),

pages

175–183.

Citeseer, 2002.

Lavrenko. “On-line new event detection and

[13] RR Rajeev, Jisha P Jayan, and Elizabeth

tracking”. In Proceedings of the 21st annual

Sherly.” Parts of Speech Tagger for Malayalam”.

international ACM SIGIR conference on Research

IJCSIT

and development in information retrieval, pages

Science and Information Technology, 2(2):209–

37–45. ACM, 1998.

213, 2009.

[6] Ricardo Baeza-Yates, Berthier Ribeiro-Neto,

[14] Kristie Seymore and Roni Rosenfield.

et al. Modern information retrieval, volume 463.

“Large-scale topic detection and language model

ACM press New York, 1999.

adaptation”. 1997.

[7] Scott C. Deerwester, Susan T Dumais,

[15] Paul Van Mulbregt, Ira Carp, Lawrence

Thomas K. Landauer, George W. Furnas, and

Gillick, Steve Lowe, and Jon Yamron. Text

Richard

latent

segmentation and topic tracking on broadcast

semantic analysis. JASIS, 41(6):391–407, 1990.

news via a hidden Markov model approach. In

[8] A.K. Kolya, A. Ekbal, and S. Bandyopadhyay.

ICSLP, 1998.

“A simple

[16] Yiming Yang, Tom Ault, Thomas Pierce, and

A.

Harshman.

Indexing

approach for

by

Monolingual

Event

International

Computer

Charles

Processing”,

Eighth

categorization methods for event tracking”. In

International Symposium on, pages 48–53, Oct

Proceedings of the 23rd annual international

2009.

ACM

[9] Thomas K Landauer, Peter W Foltz, and

development in information retrieval, pages 65–

Darrell

72. ACM, 2000.

Laham.

“An

SNLP

’09.

introduction

to

latent

SIGIR

Lattimer.

of

Tracking system in Bengali. In Natural Language 2009.

W

Journal

conference

“Improvingtext

on

Research

CLEAR June 2015

and

Page 44


[17] Yiming Yang, Jaime G Carbonell, Ralf D

[18] Yiming Yang and Nianli Ma. “CMU in cross-

Brown, Thomas Pierce, Brian T Archibald, and

language information retrieval” at NTCIR-3. In

Xin Liu.”Learning approaches for detecting and

Proceedings of the Third NTCIR Workshop.

tracking news events”. IEEE Intelligent Systems,

Citeseer, 2003.

14(4):32–43, 1999.

MIT Cheetah robot lands the running jump Robot sees, clears hurdles while bounding at 5 mph In a leap for robot development, the MIT researchers who built a robotic cheetah have now trained it to see and jump over hurdles as it runs — making this the first four-legged robot to run and jump over obstacles autonomously. In experiments on a treadmill and an indoor track, the cheetah robot successfully cleared obstacles up to 18 inches tall — more than half of the robot’s own height — while maintaining an average running speed of 5 miles per hour. Sangbae Kim, an assistant professor of mechanical engineering at MIT and his colleagues — including research scientist Hae won Park and postdoc Patrick Wensing — will demonstrate their cheetah’s running jump at the DARPA Robotics Challenge in June, and will present a paper detailing the autonomous system in July at the conference Robotics: Science and Systems. The robot can “see,” with the use of onboard LIDAR — a visual system that uses reflections from a laser to map terrain. The team developed a three-part algorithm to plan out the robot’s path, based on LIDAR data. Both the vision and path-planning system are onboard the robot, giving it complete autonomous control. The team tested the MIT cheetah’s jumping ability first on a treadmill, then on a track. On the treadmill, after multiple runs, the robot successfully cleared about 70 percent of the hurdles. In comparison, tests on an indoor track proved much easier, as the robot had more space and time in which to see, approach, and clear obstacles. In these runs, the robot successfully cleared about 90 percent of obstacles. Kim is now working on getting the MIT cheetah to jump over hurdles while running on softer terrain, like a grassy field. This research was funded in part by the Defense Advanced Research Projects Agency.

Read more: http://newsoffice.mit.edu/2015/cheetah-robot-lands-running-jump-0529

CLEAR June 2015

Page 45


Article Invitation for CLEAR September 2015 We are inviting thought-provoking articles, interesting dialogues and healthy debates on multifaceted aspects of Computational Linguistics, for the forthcoming issue of CLEAR (Computational Linguistics in Engineering And Research) Journal, publishing on September 2015. The suggested areas of discussion are:

The articles may be sent to the Editor on or before 10th September, 2015 through the email simplequest.in@gmail.com. For more details visit: www.simplegroups.in

Editor,

Representative,

CLEAR Journal

SIMPLE Groups CLEAR June 2015

Page 46


Hello World, Malayalam is one of the prominent regional languages of Indian subcontinent spoken by over 35 million people. In 2013 Government of India declared it as a classical language. Being an agglutinative language, computational modeling of Malayalam language is really challenging. The field of Artificial Intelligence called Computational Linguistics tries to capture and implement the inherent language capabilities of humans to computational model. This Edition of CLEAR Journal provides forum for scholars to enhance their background and get exposed to aspiring research areas including machine learning, information retrieval, information extraction and sentiment analysis basically on Malayalam. This edition mentions the milestones crossed by PG students and industry scholars, which includes their works published in National conference on Computational Linguistics and Information Retrieval (NC-CLAIR 2014) which was hosted at GEC Sreekrishnapuram. I would like to sincerely thank the contributing authors, for the sincere effort taken by them regardless of their busy schedule, to broadcast their views on Malayalam computational modeling, thereby making it beneficial to all of us. The pace of Innovation continues in the field of natural language processing. If words like language model, machine translation, information retrieval, artificial intelligence have meaning for you, then you are at the right place to find the right position in convoluted world...!!!

Simple group welcomes more aspirants in this area. Wish you all the best!!! Pelja Paul N

peljapaul@gmail.com

CLEAR June 2015

Page 47


CLEAR June 2015

Page 48


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.