1380

Page 1

Int. J. on Recent Trends in Engineering and Technology, Vol. 10, No. 1, Jan 2014

Brill's Rule-based Part of Speech Tagger for Kadazan Marylyn Alex1, and Lailatul Qadri Zakaria2 CAIT Research Group, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi, Selangor Email: alexmarylyn@gmail.com, laila@ftsm.ukm.my

Abstract— This paper presents the Part of Speech Tagger (POS) for Kadazan language by implementing Brill's approach which is also known as a Transformation-Based Error Driven Learning approach. Kadazan language is chosen because there is not even one POS tagger has been developed for this language yet. Hence, this study has been carried out in order to develop a POS tagger especially for Kadazan language that can tag Kadazan corpus systematically, help to reduce the ambiguity problem and at the same time can be used as a learning language tool. Therefore, the main objective of this study is to automate the tagging process for Kadazan language. Brill' approach is an enhance version of the original Rule-Based approach which it transforms the tags based on a set of predefined rules. Brill’s approach uses rules to transform wrong tags into correct tags in the corpus. In order to achieve the main goal, several objectives have been set which are to create the specific lexical and contextual rules for Kadazan language, by applying Brill’s approach based on rules and to evaluate the effectiveness of Kadazan Part of Speech using Brill’s approach. The tagging process is divided into four main phases. In first phase, Brill’s approach process begins by inputting a new untagged text into the system. In second phase, the input text will go through the initial state annotater to tag all the words inside the corpus to its most likely tags and produce a temporary corpus. In third phase, the temporary corpus is then compared to the goal corpus to detect if there is any errors occurred. In last phase, the rules will be applied to reduce any errors occurred and fix the temporary corpus. The tagging approach has been trained using two Kadazan children’s story books which contain 2069 words. Evaluation process is done by comparing the tagging results of Brill’s approach with the manual tagging. Kadazan Part of Speech Tagger has achieved around 93 % of accuracy. This study has shown how Brill’s tagging approach can be used to identify tags for Kadazan language. Index Terms— Kadazan Language, Transformation-Based, POS tagger, Brill’s approach, Statistical, Rule-Based

I. INTRODUCTION POS tagging is a process of reading text in some languages and marking up a word in the text (corpus) that correspond to a particular POS such as noun, verb, adjective and adverb. In Natural Language Processing, POS tagging is important because it will show how the words relate to each other and also will help to resolve human language ambiguity in different types of analysis levels. It has been used in many applications such as in machine translation, speech recognition and information retrieval. Hence, the importance of POS tagging cannot be ignored at all. There are few different approaches have been applied to POS tagging. The first technique that was used to address POS tagging is rule-based. Then, statistical came into existence and DOI: 01.IJRTET.10.1.1380 © Association of Computer Electronics and Electrical Engineers, 2014


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.
1380 by ides editor - Issuu