International Journal on Natural Language Computing (IJNLC) Vol. 2, No.6, December 2013
ROBUST EXTENDED TOKENIZATION FRAMEWORK FOR ROMANIAN BY SEMANTIC PARALLEL TEXTS PROCESSING Eng. Marius Zubac1 and Prof. PhD Eng. Vasile Dădârlat2 1 2
Department of Computer Science, Technical University, Cluj-Napoca, Romania Department of Computer Science, Technical University, Cluj-Napoca, Romania
ABSTRACT Tokenization is considered a solved problem when reduced to just word borders identification, punctuation and white spaces handling. Obtaining a high quality outcome from this process is essential for subsequent NLP piped processes (POS-tagging, WSD). In this paper we claim that to obtain this quality we need to use in the tokenization disambiguation process all linguistic, morphosyntactic, and semantic-level word-related information as necessary. We also claim that semantic disambiguation performs much better in a bilingual context than in a monolingual one. Then we prove that for the disambiguation purposes the bilingual text provided by high profile on-line machine translation services performs almost to the same level with human-originated parallel texts (Gold standard). Finally we claim that the tokenization algorithm incorporated in TORO can be used as a criterion for on-line machine translation services comparative quality assessment and we provide a setup for this purpose.
KEYWORDS Semantic Tokenization, Bi-lingual Corpora, Romanian WordNet, Bing, Google Translate, Comparative MT Quality Assessment
1. INTRODUCTION When faced with the task of text processing today’s systems need to implement robust procedures for input text acceptance no matter the final processing purpose ranging from information retrieval and sentiment analysis to machine translation applications. This acceptance criteria may include various filtering pre-processing of the text, however the vast majority of researchers [1], [2], [3] agree that tokenization should be the first step of the text processing pipe as shown in the Figure 1 preceding any lexical analysis step.
Raw Text 1. Pre-processing/Filtering 2. Sentence Splitting 3. Tokenization 4. Further Lexical Analysis (POS-Tagging, Parsing) Figure 1. Text processing before linguistic analysis DOI : 10.5121/ijnlc.2013.2602
17