International Journal on Natural Language Computing (IJNLC) Vol. 6, No.6, December 2017
SETSWANA PART OF SPEECH TAGGING Gabofetswe Malema, Boago Okgetheng and Moffat Motlhanka Department of Computer Science, University of Botswana, Gaborone, Botswana
ABSTRACT Part of speech tagging is one of the basic steps in natural language processing. Although it has been investigated for many languages around the world, very little has been done for Setswana language. Setswana language is written disjunctively and some words play multiple functions in a sentence. These features make part of speech tagging more challenging. This paper presents a finite state method for identifying one of the compound parts of speech, the relative. Results show an 82% identification rate which is lower than for other languages. The results also show that the model can identify the start of a relative 97% of the time but fail to identify where it stops 13% of the time. The model fails due to the limitations of the morphological analyser and due to more complex sentences not accounted for in the model.
KEYWORDS Setswana, part of speech tagging, morphological analysis
1. INTRODUCTION Part of speech tagging is process that identifies parts of speech in a sentence for a given language. It is considered to be one of the fundamental stages of natural language processing for any language. It is a pre-processing stage for advanced applications such as machine learning, translation, and grammar checking [1]. Studies in this area are classified in three main approaches of statistical, rule based and hybrid [1][2][3][4]. Statistical methods are provided with learning/tagged data which trains the tagger. The approach has been shown to give good results in the region of 90’s for many languages [2]. However, the approach works well if there is a good training data. The rule based approach applies the language’s rules to identify parts of speech. Although the rule based method relies heavily on knowledge of the language it can give better accuracy and feedback [2]. The hybrid approach combines the benefits of statistical and rule based approaches. Initially words are tagged based on statistical approach and if there are words that need disambiguation language rules are used to resolve them [4]. The complexity of tagging varies from language to language. In some languages most tags are just single words or tokens separated by space which makes it easy to identify. However, in some languages the problem is nontrivial. For example in which is Setswana written disjunctively, a few words are grouped together to form a part of speech in some cases. There are few studies which have highlighted the complexity in Setswana part of speech tagging and tokenization. In [5] a statistical approach is used to tag Northern Sotho which is close to Setswana language. The study obtained a performance of 94%. The study however does not indicated whether compounds parts of speech were included or not. A finite state machine approach is used [6] to analyse Setswana verb morphology which also obtained a 94% rate. The tokenization problem is related to this study in that for one to perform tagging they also need to tokenize. In [6] the main aim is to tokenize or tag verbs. In both studies performance results are high and promising. However, the results are not good enough to lead to a completely automated part of speech tagger. Studies do not offer general approaches to single words (tokens) and compound tokens or parts of speech. DOI: 10.5121/ijnlc.2017.6602
15