NLP POS tagging using Hidden Markov Models and Mahout
Yogesh Pawar Big Data Architect and Trainer STLSOFT www.stlsoft.in Pune 411 057
Abstract This article discusses use of Hidden Markov Models (HMM) for Natural Language Processing (NLP), implemented on Hadoop. NLP involves voluminous unstructured text processing to derive value. Hidden Markov Models are widely used in Natural Language Processing for Part of Speech (POS) tagging. Parameterised Hidden Markov Models are included in Hadoop machine learning library i.e. Mahout. Large training datasets can be processed in distributed mode in Hadoop and subsequently further large datasets can be analysed. Textual data can be processed in parallel without need for global state, making Hadoop map reduce suitable distributed framework for NLP.
1. Introduction Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. – Wikipedia 1
Words in natural language can have same different meaning based on context. e.g. John saw the man on the mountain with a telescope. Who has the telescope? John, the man on the mountain, or the mountain? e.g. Flying planes can be dangerous. Either the act of flying planes is dangerous, or planes that are flying are dangerous. Ambiguity makes NLP quite challenging task. Earlier systems for NLP were rule based. Formation of rules had been a complex task and provided less flexibility. Machine learning solutions are preferred over earlier rule based system for NLP. Identifying part of speech or word class helps removing ambiguity. The process of classifying words into their parts of speech and labelling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories. The collection of tags used for a particular task is known as a tagset. Table 1: Universal documentation)
Part-of-Speech
Tagset
(source
NLTK
Tag
Meaning
English Examples
ADJ
adjective
new, good, high, special, big, local
ADP
adposition
on, of, at, with, by, into, under
ADV
adverb
really, already, still, early, now
CONJ
conjunction
and, or, but, if, while, although
DET
determiner, article the, a, some, most, every, no, which
NOUN Noun
year, home, costs, time, Africa
NUM
numeral
twenty-four, fourth, 1991, 14:24
PRT
particle
at, on, out, over per, that, up, with 2
Tag
Meaning
English Examples
PRON
pronoun
he, their, her, its, my, I, us
VERB
Verb
is, say, told, given, playing, would
.
punctuation marks . , ; !
X
Other
ersatz, esprit, dunno, gr8, university
Tagging example using NLTK python NLP library >>> text = nltk.word_tokenize("And now for something completely different") >>> nltk.pos_tag(text) [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]
2. Hidden Markov Models Hidden Markov Models are widely used in computational linguistics for POS tagging. Markov property (Future is independent of past, given present) - In probability theory and statistics, the term Markov property refers to the memoryless property of a stochastic process. It is named after the Russian mathematician Andrey Markov. Markov models assume that only current state is needed to fully predict the next state. To explain Markov model and why is it called hidden let’s consider the example of dropping a marble onto peg as shown bellow
3
Table 2: Total Drops =10=N a b Distribution Parameters Likelyhood of b
No of drops in cup N1 = 8 N2 = 2 Bernoulli P1, 1-P1 N2/N=0.2
Hence we can estimate the parameters of the model by observing no of times ball takes path to a or b. Now consider that we introduce another two pegs in the system
4
Table 3: Total Drops = 10 = N 0 1 2 3
No drops by path N0=4 N1=6*Pr(1|b; (i)) N2=6*Pr(2|b; (i)) N3=4
Where 6 is the no of times ball dropped in b. Pr(1|b; (i)) = (P1*(1-P0)) / (P1*(1-P0))+(P0*(1-P2)) Pr(2|b; (i)) = ((1-P2)*P0)) / (P1*(1-P0))+(P0*(1-P2)) Now consider pegs are covered and we cannot see which path ball takes to land in cup b. For paths 0, 3 we can observe the drops but for path 1, 2 we need to estimate probability which is dependent on model parameters itself. Hence EM algorithm is used with some initial values for P and optimization is done to get likely parameters of model. As we cannot observe the paths 1, 2 this Markov model is called hidden. A Hidden Markov Model H is a quintuple (S, V, π, A, B) where S = {s1, . . . , sN} is the set of states. N is the number of states. V = {v1, . . . , vM} is the vocabulary, the set of symbols that may be emitted. π is the initial probability distribution on the states. It gives the probability of starting in each state. A = is the transition probability matrix. B = is the emission probability matrix. Time is discrete and starts with 1.
5
HMM can be used to find the probability of the sequence of observations, estimate model parameter matrices A, B, π and for decoding or finding the hidden sequence. In the next topic we will see how to use HMM for POS tagging using Hadoop Mahout Machine learning library.
3. HMM in Hadoop Hadoop provides Mahout Machine learning library which contains implementations of various important algorithms. Following code shows how use Mahout HMM implementation taken public static void main(String[] args) throws IOException { // generate the model from URL trainModel("http://www.jaist.ac.jp/~hieuxuan/flexcrfs/CoNLL200 0-NP/train.txt"); testModel("http://www.jaist.ac.jp/~hieuxuan/flexcrfs/CoNLL2000 -NP/test.txt"); // tag an exemplary sentence String test = "ABC is a huge company with many employees ."; String[] testWords = SPACE.split(test); List<String> posTags = tagSentence(test); for (int i = 0; i < posTags.size(); ++i) { log.info("{}[{}]", testWords[i], posTags.get(i)); } }
Output: 15/02/09 07:46:10 INFO PosTagger: Read 47377 lines containing 2012 sentences. 15/02/09 07:46:11 INFO PosTagger: POS tagged test file in 1.297 seconds! 15/02/09 07:46:11 INFO PosTagger: Tagged the test file with an error rate of: 0.060176879076345065 15/02/09 07:46:11 INFO PosTagger: ABC[NNP] 15/02/09 07:46:11 INFO PosTagger: is[VBZ] 15/02/09 07:46:11 INFO PosTagger: a[DT] 15/02/09 07:46:11 INFO PosTagger: huge[JJ] 15/02/09 07:46:11 INFO PosTagger: company[NN] 15/02/09 07:46:11 INFO PosTagger: with[IN] 15/02/09 07:46:11 INFO PosTagger: many[JJ] 15/02/09 07:46:11 INFO PosTagger: employees[NNS] 15/02/09 07:46:11 INFO PosTagger: .[.]
6
Output shows POS tags for various input words.
4. Conclusion NLP using HMM can be greatly improved by using large dataset and done faster using distributed Hadoop platform. As large training sets can be used more accurate patterns. As of the writing, Mahout HMM implementation is not parallel. NLTK is python library for NLP. Hadoop streaming can be used with NLTK to perform hadoop map reduce jobs though streaming also has limitations.
7