Tr 00046

Page 1

IDL - International Digital Library Of Technology & Research Volume 1, Issue 4,April 2017

Available at: www.dbpublications.org

Internati onal e-J ournal For Technol ogy And Research-2017

Spell checker for Kannada OCR Suma S, Sneha N UG scholars, Depart ment of Information Science and Engineer ing Siddaganga Institute of Technology, Tumakuru suma.sinchu.ds@gmail.com snehan769@g mail.co m

Sharathkumar S Assistant Professor Depart ment of Information Science and Engineering Siddaganga Institute of Technology, Tumakuru skumars@sit.ac.in

Abstract— A spell checker is an application program to process the natural languages in machine readable format effectively. Spelling checking and correction is a basic necessity and a tedious work in any language, so we require spell checker software to do this, which is the fundamental necessity for any work. Spell checker is a set of program which analyzes the wrongly used word and corrects it by the most possible correct word. The challenging task here is the work done for a Kannada language. In a software system many Kannada words are typed in several formats since Kannada has many fonts to write the grammar properly. In this paper, we describe some techniques used in Kannada language by a spell checker. We use NLP, which is a field of computer science having relationship between human (i.e., natural languages) and computers. Usually, we have some modern NLP algorithms based on machine learning to carry out the work. Keywords—S pell checker, NLP, OCR, Dictionary Lookup;

I.

Ex:

may be wrongly written as . Phoneme to grapheme mappi ng errors Errors occurred while writ ing the dictated words.

Ex:

Ex: may be wrongly typed as . OCR generated errors Errors occurred by incorrect recognition of a character by OCR. Ex: may be wrongly recognized as . Errors generated by s peech recog nizer Errors occurred due to wrong pronunciation of words or wrong recognition of words by speech recognizer. Ex:

A linguistic error analy zer is a tool which studies the types and causes of language errors. Errors may be classified as: Conceptualization errors (i.e., thinking), phoneme to grapheme mapping errors (i.e., writing), typing errors, OCR generated errors, errors generated by speech recognizer. Conceptualization errors Errors occurred due to one‘s way of thinking.

IDL - International Digital Library

.

Typing errors Errors occurred wh ile typing by pressing wrong key.

INT RODUCTION

Kannada is a Dravid ian language spoken predominantly by people of Karnataka and other neighboring states. It has roughly forty million native speakers and a total o f 50.8 million speakers according to 2001 census. Spell checking is the critical problem in NLP. The tool named spell checker is the important tool for the number of tightly coupled components for various software like OCR, word processor and even translators. 1.1 Error Analyzer

may be wrongly written as

may be wrongly recognized as

.

1.2 Optical Character Recognition (OCR) Optical character recognition is a technique for moving text fro m paper form to electronic form. To convert an image, written text or e-text into a machine readable format we require an OCR, the input to this can be a plain document, image etc. The source for OCR can be bank statements, ATM transactions, e-statements, mailing documents etc. To process different tasks like speech to text, image to text and vice-versa, analyzing of the text is done in digitized format, so that it can be easily edited, stored and even accessed easily via open-access system. OCR is a field of research in NLP, Machine learning, artificial intelligence and computer vision. In a modern era, there is a need of flexibility to produce an accurate OCR system so that it can recognize any type of fonts with the support of various digital image inputs to get more accurate outputs for the proper inputs supplied .

1 |P a g e

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 4,April 2017

Available at: www.dbpublications.org

Internati onal e-J ournal For Technol ogy And Research-2017 OCR Errors Due to the noise, the following errors may occur

 Reject error The machine reading process may not be able to recognize a character.

 Substitution error OCR may recognize a character incorrectly.

 Character fusions Two or more character images merge to appear as a single connected component.

 Character fragmentation A character image is frag mented into more than one sub image. 1.3 Spell Checker A spell checker is an applicat ion program required by mach ines to process natural languages effectively. Spell checkers can be used as independent tools or they can be a part of larger applications like search engine, translator etc. A simp le spell checker can perform the fo llowing tasks:

 

Scanning and extracting the words contained in the text. Matching of the correctly written words with those typed including special symbols,hyphens..etc is the important step. To handle morphology to process a language dependent algorithm is required. English language also requires a spell checker for the similar words including plurals,verbal forms…etc. So processing even these steps for other languages will be a complicated issue.

Init ially, d ivide the bulk of text data into a series of separate words—further we use an inbuilt analy zer i.e a morphological analyzer which uses the separate dictionaries to access the root word and the suffixed word followed by it. We need to establish a relationship between different varieties of root words and its suffixes—in order to do this process, a mapping function is necessary. Valid ity of a word is checked using morphological analyzer. We have to identify the type of error i.e word is incorrect viz. correct root and incorrect suffix, incorrect root and correct suffix and correct root and matching suffix. These errors are taken care individually and the incorrect words are made transparent by suitable solutions. These words which are mis -interrupted are corrected by the help of user by giving suitable suggestions. But, the drawback of this system is that, it fails when it is imp lemented for OCR output text. It cannot efficiently handle special cases in OCR like character fusion, character frag mentation etc. Dictionary lookup method Dictionary lookup method is a method of comparing the words in the input file with the correct words in the dictionary. This method is used as an advantage over OCR to inspect the letters which are amb iguous—but in a large scale, it leads to size overhead and calculation of probability will become complex and even the cost of searching. N-gram approach An N-gram approach is an arrangement of text in a sequential order for different items like phonemes, graphemes, letters, words and even pair of wo rds. Unigram, bigram, trig ram are the varieties in it. In general, we have N-gram which is a types of predictive model designed by the help of Markov to guess the subsequent item in the form of n-1 order and it follows is a probabilistic language model approach. Design

Related Work The literature survey reveals that most of the research works on Kannada Spell Checker focus on normal text wh ile some efforts have been made in other languages like Pun jabi, Hindi etc., for OCR text . But, no work related to OCR spell checkers are reported for Kannada directly. So me works on OCR Spell Checker in other languages and Kannada spell checker are reported here. This review discusses common spell checking approaches and the problems that may occur during spell checking process. There are two common approaches for imp lementing spell checker: Dict ionary lookup method and N-gram approach.

IDL - International Digital Library

2 |P a g e

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 4,April 2017

Available at: www.dbpublications.org

Internati onal e-J ournal For Technol ogy And Research-2017 In our project we give Ro manized text as a input for this module and get a list of tokenized words for co mparison. Ex: Input: rA manu kAdig.e hodanu Output: [‗rA manu‘, ‗kAdig.e‘, ‗hodanu‘]

Romanizati on Ro manization is a process of converting a written text fro m a specific system to Roman Script. Ro manization includes following methods:

  

2.

Comparison Co mpare each word with a standard dictionary and check for the validity of the word by using min imu m ed it distance algorith m. Mi ni mum Edi t Distance It is a levenstein distance where two strings or words are compared and will result to either similar or dissimilarity, the techniques used to perform this method are substitution, insertion and deletion in order to convert one word to another word and calculate the minimu m distance to convert one string to another by using NLP, where automat ic process ing of data is done for spelling correction with the help of standard dictionary and choose the suitable one by selecting the lowest distance to the word formed. Ex: When two strings INTENTION and EXECUTION are considered, the minimu m edit distance between them is 5 i.e., Minimu m of 5 operations are required to change INTENTION as EXECUTION. The words in the dictionary which has edit distance less than or equal to 3 are suggested for a given misspelled wo rd. Results

Transliteration – for representing written text Transcription – for representing the spoken word Combination of both transliteration and transcription Ex: 1. in Romanized format is written as ‗avanu‘. in Romanized format is written as ‗snEha‘.

In this tool we read line by line fro m an input file and each line is Ro man ized to English. Ex: English as ‗rAmanu kAdig.e hodanu‘.

is Ro man ized to

Tokenization Token ization is a process of forming a set of tokens which has meaningful elements such as words, phrases, symbols in the form of text . Ex: Input: This is a spell checker for Kannada OCR. Output: This, is, a, spell, checker, for, Kannada, OCR.

IDL - International Digital Library

3 |P a g e

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 4,April 2017

Available at: www.dbpublications.org

Internati onal e-J ournal For Technol ogy And Research-2017 Conclusion In this project we have imp lemented spell checker for Kannada OCR. Fro m this project we learnt various tools to implement spell checker. After this project, we understood various problems occur during text processing; also we got to know how to tackle these problems. Although there were lots of problems during Kannada text processing we understood a major way to imp lement a Kannada Spell Checker for Kannada OCR. As this project is a first attempt for implementing spell checker for Kannada OCR, we hop e our project serves as platform fo r beginners to understand various aspects of spell checker for Kannada OCR. Future Work There can be several future work proposed, some of them involving are improving the performance while others can be built on top of the work done here. Here are some o f the works we believe can be performed :    

The methods can be improved to achieve better efficiency. A larger dictionary with set of huge words can be used. The methods can be used to separate root word and affix word to improve the performance. Work can be elaborated for semantic errors as well. Further, the work can be extended by applying the mu lti-threaded approach in the spell checker tool.

References [1]. Rajeshakara Murthy S, Ramakanth Kumar P, ―A non-word Kannada spell checker using morphological analyzer and dictionary lookup method‖. International Journal of Engineering Sciences & Emerging Technologies, June 2012, Volume 2, Issue 2. [2].―OCR Spell: An Interactive Spelling Correction System for OCR Errors in Text‖,Kazem Taghva* and Eric Stofsky. [3].‖SPELL CHECKER FOR OCR‖, Yogomaya Mohapatra, Ashis Kumar Mishra, Anil Kumar Mishra, International Journal of Computer Science and Information Technologies, Vol. 4(1), 2013, 91-97. [4]. ―OCR Post -processing Error Correction Algorithm Using Google‘s Online Spelling Suggestion‖ Yourself Bassil, Mohammed Alwani, Journal of Emerging Trends in Computing and Information Sciences VOL.3, NO. 1, January 2012. [5]. ―A comprehensive survey on OCR techniques for Kannada script‖, Chandrakala.H.T, Thippeswamy.G

IDL - International Digital Library

4 |P a g e

Copyright@IDL-2017


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.