**TOP 10 NATURAL LANGUAGE PROCESSING PAPERS: RECOMMENDED READING – LANGUAGE RESEARCH**

Page 1

TOP 10 NATURAL LANGUAGE PROCESSING PAPERS: RECOMMENDED READING – LANGUAGE RESEARCH

http://airccse.org/journal/ijnlc/index.html


BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOURCE LANGUAGES TAGSET- A FOCUS ON AN AFRICAN IGBO Onyenwe Ikechukwu E1 , Onyedinma Ebele G1 , Aniegwu Godwin E2 and Ezeani Ignatius M3 1

Department of Computer Science, Nnamdi Azikiwe University, Awka, Nigeria { ie.onyenwe, eg.osita}@unizik.edu.ng 2

Federal College of Education (Technical), Umunze, Nigeria aniegwuge@gmail.com 3

University of Sheffield, United Kingdom ignatius.ezeani@sheffield.ac.uk

ABSTRACT In this paper, we demonstrate the efficacy of a POS annotation method that employed the services of two automatic approaches to assist POS tagged corpus creation for a novel language in NLP. The two approaches are cross-lingual and monolingual POS tags projection. We used cross-lingual to automatically create an initial ‘errorful’ tagged corpus for a target language via word-alignment. The resources for creating this are derived from a source language rich in NLP resources. A monolingual method is applied to clean the induce noise via an alignment process and to transform the source language tags to the target language tags. We used English and Igbo as our case study. This is possible because there are parallel texts that exist between English and Igbo, and the source language English has available NLP resources. The results of the experiment show a steady improvement in accuracy and rate of tags transformation with score ranges of 6.13% to 83.79% and 8.67% to 98.37% respectively. The rate of tags transformation evaluates the rate at which source language tags are translated to target language tags.

KEYWORDS Languages, Africa, Part-of-Speech, Corpus, Natural Language Processing, Tagset, Igbo, Bootstrapping. For More Details : http://aircconline.com/ijnlc/V8N1/8119ijnlc02.pdf Volume Link : http://airccse.org/journal/ijnlc/vol8.html


REFERENCES 1. Adams O., Makarucha A., Neubig G., Bird S., Cohn T., “Cross-lingual word embeddings for lowresource language modeling”, Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, vol. 1, p. 937-947, 2017. 2. Adedjouma S. A., John O. R. A., Mamoud I. A., “Part-of-Speech tagging of Yoruba Standard, Language of Niger-Congo family”, Research Journal of Computer and Information Technology Sciences, vol. 1, p. 2-5, 2013. 3. Agić Ž., Hovy D., Søgaard A., “If all you have is a bit of the Bible: Learning POS taggers for truly low-resource languages”, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), vol. 2, p. 268-272, 2015. 4. Agichtein E., Gravano L., “Snowball: Extracting relations from large plain-text collections”, Proceedings of the fifth ACM conference on Digital libraries, ACM, p. 8594, 2000. 5. Atwell E., Hughes J., Souter D., “Amalgam: Automatic mapping among lexicogrammatical annotation models”, The Balancing Act: Combining Symbolic and Statistical Approaches to Language-Proceedings of the ACL Workshop, Association for Computational Linguistics, p. 21-20, 1994. 6. Bamba Dione C. M., Kuhn J., Zarrieß S., “Design and Development of Part-of-SpeechTagging Resources for Wolof (Niger-Congo, spoken in Senegal)”, Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10). Valletta, Malta, European Language Resources Association (ELRA), 2010. 7. Brill E., “Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging”, Computational linguistics, vol. 21, no 4, p. 543565, 1995. 8. Central Intelligence Agency, “The World https://www.cia.gov/library/publications/theworld-factbook/geos/ni.html.

FactBook”,

9. Chungku C., Rabgay J., Faaß G., “Building NLP resources for Dzongkha: a tagset and a tagged corpus”, Proceedings of the Eighth Workshop on Asian Language Resouces, p. 103-110, 2010. 10. Department of Computer Science, Johns Hopkings Whiting School of Engineering, “An Introduction to Transformation-Based Learning”, https://www.cs.jhu.edu/~rflorian/fntbl/tbl- toolkit/node3.html. 11. Ethnologue, “Igbo”, https://www.ethnologue.com/language/ibo.


12. Girma A. D., Mesfin G., “Fast Development of Basic NLP Tools: Towards a Lexicon and a POS Tagger for Kurmanji Kurdish”, International Conference on Lexis and Grammar, Belgrade: Serbia (2010), p. 0, 2010. 13. IgboGuide.org. “Igbo Grammar”, http://www.igboguide.org/HT-igbogrammar.htm. 14. J. T., “The North-West University Bible corpus: A multilingual parallel corpus for South African languages.”, Language Matters, 2006. 15. Jeff A., “The Bible as a Resource for Translation Software: A proposal for MT development using an untapped language resource database”, MultiLingual Computing and Technology, 2002. 16. Moon T., Baldridge J., “Part-of-speech tagging for middle English through alignment and projection of parallel diachronic texts”, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 2007. 17. Ndịàmà Jehova, https://www.jw.org/ig/. 18. Ngai G., Florian R., “Transformation-based learning in the fast lane”, Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies, Association for Computational Linguistics, p. 1-8, 2001. 19. Nichols C., Hwa R., “Word alignment and cross-lingual resource acquisition”, Proceedings of the ACL Interactive Poster and Demonstration Sessions, p. 69-72, 2005. 20. Och F. J., Ney H., “A Systematic Comparison of Various Statistical Alignment Models”, Computational Linguistics, vol. 29, no 1, p. 19-51, 2003. 21. Onyenwe I. E., Developing Methods and Resources for Automated Processing of the African Language Igbo, PhD thesis, University of Sheffield, 2017. 22. Onyenwe I. E., Hepple M., Chinedu U., Ezeani I., “A Basic Language Resource Kit Implementation for the Igbo NLP Project”, ACM Transactions on Asian and LowResource Language Information Processing (TALLIP), vol. 17, no 2, p. 10, 2018. 23. Onyenwe I. E., Uchechukwu C., Hepple M., “Part-of-speech Tagset and Corpus Development for Igbo, an African”, LAW VIIIp. 93, 2014. 24. Onyenwe I., Hepple M., Uchechukwu C., Ezeani I., “Use of Transformation-Based Learning in Annotation Pipeline of Igbo, an African Language.”, Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects, p. 24, 2015. 25. Resnik P., Olsen M., Diab M., “The Bible as a Parallel Corpus: Annotating the ’Book of 2000 Tongues”, Computers and the Humanities. Springer, vol. 33, p. 29-153, 1999.


ISOLATING WORD LEVEL RULES IN TAMIL LANGUAGE FOR EFFICIENT DEVELOPMENT OF LANGUAGE TOOLS Suriyah M, Aarthy Anandan, Anitha Narasimhan and Madhan Karky Karky Research Foundation, India

ABSTRACT With the advent of social media, the amount of text available for processing across different natural languages has become enormous. In the past few decades, there has been tremendous increase in the number of language processing applications. The tools for natural language computing of various languages are very different because each language has its own set of grammatical rules. This paper focuses on identifying the basic inflectional principles of Tamil language at word level. Three levels of word inflection concepts are considered – Patterns, Rules and Exceptions. How grammatical principles for word inflections in Tamil can be grouped in these three levels and applied for obtaining different word forms is the focus of this paper. These can be made use of in a wide variety of natural language applications like morphological analysis, morphological generation, word level translation, spelling and grammar check, information extraction etc. The tools using these rules will account for faster operation and better implementation of Tamil grammatical rules referred from [த ொ ல் த ொ ப் பியம் | tholgaappiyam] and [ நன் னூல் | nannool] in NLP applications. KEYWORDS Natural language processing, Rule based approach, word level rules, Tamil tool, language tools For More Details : http://aircconline.com/ijnlc/V8N1/8119ijnlc03.pdf Volume Link : http://airccse.org/journal/ijnlc/vol8.html

REFERENCES


[1] Omnicore.[Online]. Statistics/

Available:

Https://Www.Omnicoreagency.Com/Twitter-

[2] L.J.Brinton, The Structure Of Modern English: A Linguistic Introduction. Amsterdam, Philadelphia, PA: John Benjamins, 2000. [3] UC Sandiego Linguistics Department.[Online]. Http://Grammar.Ucsd.Edu/Courses/Lign120/08-Intro_Rev.Pdf

Available:

[4] S. Singh And V. M Sarma, “Hindi Noun Inflection And Distributed Morphology” In Proceedings Of The 17th International Conference On Head-Driven Phrase Structure Grammar, 2010, Pp. 307321 [5] M. Ramscar , “The Role Of Meaning In Inflection: Why The Past-Tense Does Not Require A Rule,” Cognitive Psychology, Vol. 45, No. 1, Pp. 45–94, 2002. [6] Wikipedia.[Online]. Available: Https://En.Wikipedia.Org/Wiki/Agglutination [7] Wikipedia.[Online]. Https://En.Wikipedia.Org/Wiki/Agglutinative_Language

Available:

[8] S. C. Reddaiah. “Dravidian Languages And Its Fundamental Grammar,” Indian Journal Of Research, Vol. 3, No. 2, Pp. 164-166, 2014. [9] Anand Kumar M, Dhanalakshmi V, Soman K.P And Rajendran S, “A Sequence Labeling Approach To Morphological Analyzer For Tamil Language”, International Journal On Computer Science And Engineering, Vol. 2, No. 6, Pp. 1944 – 1951, 2010 [10] P. Anandan, K. Saravanan, R.Parthasarathi And T. V. Geetha, “Morphological Analyzer For Tamil” In Proceedings Of International Conference On Natural Language Processing, 2002 [11] Suriyah M, Aarthy Anandan, Anitha Narasimhan And Madhan Karky, “Piripori Morphological Analyser For Tamil” In International Conference On Artificial Intelligence, Smart Grid And Smart City Applications, 2019. [12] [ ளஞ் சியம் | Kalanjiyam].[Online]. Available: Http://Store.Tamillexicon.Com [13] Maanikkavaasakan, Tholkaappiyam, Chennai, TN : Uma Padhippagam, 2010 [14] A. Manikkam, Nannool Kaandigaiyurai,Chennai, TN : Poompuhar Padhippagam, 1988 [15] Seeni Naina Muhammad, Nalla Tamizh Ilakkanam, CITY, TN : Adayalam Padhippagam, 2013


CITIZENSHIP ACT (591) OF GHANA AS A LOGIC THEOREM AND ITS SEMANTIC IMPLICATIONS NakpihIreneous Callistus Department of Mathematics and Computing, St. John Bosco’s College of Education, Navrongo, Ghana

ABSTRACT This paper presents excerpts of the natural text of the Citizenship Act of the Republic of Ghana, ACT 2000 (591), as a logic theorem in First Order Logic (FOL) language. The formalism of this piece of law is done to allow for semantic analysis of the text by machines. The results of this research also deals with the problem of ambiguities in the legal text, and reveals the semantic consequence of the natural construction or textual structure of the piece of law; one major problem of the semantics of legal text has been the problem of ambiguities, which largely affects interpretations that are alluded to legal statements. Some constructed sentences sometimes do not reflect the semantic intentions of the composers of the sentences, due to inherent ambiguities or technical faults in the structure of some parts of the statements. The formalism of the Act in this paper provides logical proofs and deductions which are asserted to establish clarity and to reveal semantic errors, or otherwise preserve the semantic intentions of the Act.

KEYWORDS Natural Language Processing, Formalism, Logic Theorem, Semantic Analysis, Legal text. For More Details : http://aircconline.com/ijnlc/V7N5/7518ijnlc02.pdf Volume Link : http://airccse.org/journal/ijnlc/vol7.html

REFERENCES


[1] T. Endicott (2016) “Law and Language” The Stanford Encyclopedia of Philosophy, accessed 5th October 2018, from [2] J. Dickson (2010) “Interpretation and Coherence in Legal Reasoning” The Stanford Encyclopedia of Philosophy, accessed 5th October 2018, from [3] Aristotle (1955) “On Sophistical Refutations” Loch Classical Library, Cambridge, Mass., Harvard University Press, accessed 5th October 2018, from [4] D. N. Walton (1995) “A Pragmatic Theory Of Fallacy” University of Alabama Press. [5] I. Witczak-Plisiecka (2007) “Language, law and speech acts: Pragmatic meaning in English legal texts” Łódź: WyŜszaSzkołaStudiówMiędzynarodowych w Łodzi. [6] B. H. Sandra (1984) “The discourse of court interpretations: Practice of the law, witness and the interpreter”, John Benjamins Publishing Company. [7] J. P. Butt, and C. Richard (2006) “Modern legal drafting: A guide to using clearer language” New York: Cambridge University Press. [8] A.B. Garner (2002) “The elements of legal style” New York: Oxford University Press. [9] M. P. Tiersma (1999) “Legal language” Chicago: The University of Chicago Press. [10] M. Asprey (2010) “ Plain Language for Lawyers,Federation Press. [11] T. N. Hwee and Z. John (1997) “Corpus-Based Approaches to Semantic Interpretation in Natural Language Processing” The American Association for Artificial Intelligence Magazine, Vol. 18, No. 4. [12] R. S. Trehan (2014) “An “Unfortunate Bit of Legal Jargon”; Prosecutorial Vouching Applied to Cooperating witness” Columbia Law Review Association Inc, Vol. 114, No. 4, pp997-1032. [13] F. Michael and S. Fiona (2013) “Law and Language” Oxford University Press, Vol 15. [14] A. Wagner and S. Cacciaguidi-Fahy (2006) “Legal Language and the search for Clarity: Practice and tools”,Peter Langi. [15] C. William (2006) “Fuzziness in Legal English: what shall we do with ‘shall’?,” Peter Lang. [16] T. Bell (2018) “what is Natural Language Processing? The Business benefits of NLP Explained” CIO from IDG,accessed 5th October 2018,from [17] R. Mihalcea, H. Liu and H. Lieberman (2006) “NLP (Natural Language Processing) for NLP (Natural Language Programming)” in Gelbukh A. (eds) Computational


Linguistics and Intelligent Text Processing. CICLing 2006. Lecture Notes in Computer Science, Vol. 3878. Springer, Berlin, Heidelberg. [18] Grau, Harrock, Motik, Parsia, Patel-Schneider and Sattler (2008) “OWL2: The Next Step for OWL” University of Oxford, United Kingdom. [19] A. Horn (1951) “On sentences which are true of direct union of algebras” Journal of Symbolic Logic, vol. 16, No. 1, pp14-21. [20] P. B. Andrews (2002) “An Introduction to Mathematical Logic and Type Theory: To Truth Through Proof”Kluwer Academic Publishers, Springer. [21] P. K. Pandya, (2005) “Monadic Second-Order Logic; Automata and Practice” University of Trento, TIFR, Mumbai, India, pp10-24. [22] N. Bidoit and R. Hull (1989) “Minimalism, justification and non-monotonicity in deductive databases” Journal of Computer and System Sciences, Vol. 38, pp 290-325. [23] B. Verheij, J. Hage, and A. R. Lodder (1997) “Logical tools for legal argument: a practical assessment in the domain of tort” Dissertation UniversiteitMaastricht, accessed 7th August 2018, from [24] L. O. Kelso, (1946)“Does the law need a technological revolution?”Rocky Mountain Law Review, Vol. 18, pp378–392. [25] N. Love and M. Genesereth (2005) “Computational Law”. ICAIL ’05, ACM, 159593-081, Bologna, Italy, June 6-11, 2005. [26] M. K. Anjali and P.A. Babu (2014) “Ambiguities in Natural Language Processing” International Journal of Innovative Research in Computer and Communication Engineering. Vol. 2, No. 5. [27] G. Sator (19194) “A Formal model of Legal Argumentation” International Journal of Jurisprudence and Philosophy of Law. [28] Wikipedia (2018) “Ghanaian Nationality Law”, accessed 14th June 2018, from [29] United Nations High Commissioner for Refugees (UNHCR) (2016). Terms of Reference: Study on Statelessness in Ghana, Accra, Ghana [30] UN High Commissioner for Refugees (UNHCR), UNHCR& IOM, Nationality, Migration and Statelessness in West Africa (2015), accessed 22nd August 2018, from

ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE


Maibam Indika Devi1 and Prof. Bipul Syam Purkayastha2 1

Department of Computer Science, Indira Gandhi National Tribal University, RCM, Makhan, Manipur 2Department of Computer Science, Assam University,Silchar, Assam

ABSTRACT Manipuri is both a minority and morphologically rich language with genetic features similar to Tibeto Burman languages. It has Subject-Object-Verb (SOV) order, agglutinative verb morphology and is monosyllabic. Morphology and syntax are not clearly distinguished in this language. Natural Language Processing (NLP) is a useful research field of computer science that deals with processing of a large amount of natural language corpus. The NLP applications encompass E-Dictionary, Morphological Analyzer, Reduplicated Multi-Word Expression (RMWE), Named Entity Recognition (NER), Part of Speech (POS) Tagging, Machine Translation (MT), Word Net, Word Sense Disambiguation (WSD) etc. In this paper, we present a study on the advancements in NLP applications for Manipuri language, at the same time presenting a comparison table of the approaches and techniques adopted and the results obtained of each of the applications followed by a detail discussion of each work.

KEYWORDS Manipuri Language, NLP, NER, MT, WSD. For More Details : http://aircconline.com/ijnlc/V7N5/7518ijnlc05.pdf Volume Link : http://airccse.org/journal/ijnlc/vol7.html

REFERENCES


[1] Saiful Islam, Maibam Indika Devi, Prof. Bipul Syam Purkayastha. (2017). A Study on Various Applications of NLP Developed for North-East Languages, International Journal o Computer Science and Engineering (IJCSE) Vol. 9 No.06 Jun 2017 ISSN : 0975-3397 [2] D. Nadeau and S. Sekine.(2007).A survey of named entity recognition and classification, Linguisticae Investigationes, 30(7). [3] Khan Md. Anwarus Salam†, Mumit Khan and Tetsuro Nishino Vandeghinste, V., Schuurman, I., Carl, M., Markantonatou, S. and Badia, T.(2006), METIS-II: Machine Translation for Low Resource Languages,In Proceedings of LREC 2006. [4] Alok Ranjan Pal and Diganta Saha. (2015).Word Sense Disambiguation: A Survey,International Journal of Control Theory and Computer Modeling (IJCTCM) Vol.5, No.3, July 2015. [5] Antony P J and Dr. Soman K P.(2011).Parts Of Speech Tagging for Indian Languages: A Literature Survey,International Journal of Computer Applications (0975 – 8887) Volume 34– No.8. [6] Surjit Singh R.K., Gunasekaran S., Anand Kumar M. and Soman K.P.(2014).A Short Review about Manipuri Language Processing,International Science Congress Association,Research Journal of Recent Sciences ISSN 2277-2502,Vol. 3(3), 99-103. [7] Ningombam Shantikumar, Sagolsem Poireiton Meitei, and Bipul Syam Purkayastha. (2011). Building Manipuri-English machine readable dictionary by implementing ontology, International Journal of Engineering Science and Technology. [8] S.Poireiton Meitei, Shantikumar Ningombam, H.Mamata Devi and Bipul Syam Purkayastha. (2012). A Manipuri-English Bilingual Electronic Dictionary -Design and Implementation, International Journal of Engineering and Innovative Technology (IJEIT). [9] S.Poireiton Meitei and H. Mamata Devi. (2017). Development Of English To Manipuri Electronic Dictionary:A database approach, International Journal of Innovations & Advancement in Computer Science (IJIACS), Volume 6, Issue 3 [10] Sirajul Islam Choudhury, Leihaorambam Sarbajit Singh,Samir Borgohain, and Pradip Kumar Das. (2004). Morphological Analyzer for Manipuri:Design and Implementation, Proceedings of Second Asian Applied Computing Conference, AACC 2004, Kathmandu,Nep al. [11] Thoudam Doren Singh, and Sivaji Bandyopadhyay. (2006). Word class and sentence type identification in manipuri morphological analyzer, Proceedings of MSPIL, Mumbai, India. [12] Singha, Ksh Krishna B., and Bipul Syam Purkyastha. (2012). Morphological Analysis for Manipuri Nominal Category Words with Finite State Techniques, International Journal of Computer Applications 58.15.


[13] Kishorjit Nongmeikapam, Vidya Raj RK, Yumnam Nirmal and Sivaji Bandyopadhyay. (2012). Manipuri Morpheme Identification. Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing (SANLP), pages 95-108, COLING. [14] Nongmeikapam Kishorjit, and Sivaji Bandyopadhyay. (2010). Identification of Reduplicated MWEs in Manipuri: A Rule Based Approach,(ICCPOL-2010), Redwood City, San Francisco, pp49-54. [15] Thoudam Doren Singh and Sivaji Bandyopadhyay. (2010).Web Based Manipuri Corpus for Multiword NER and Reduplicated MWEs Identification using SVM, Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing (WSSANLP), pages 35– 42,the 23rd International Conference on Computational Linguistics (COLING). [16] Nongmeikapam Kishorjit, Ningombam Herojit Singh, Sonia Thoudam, and Sivaji Bandyopadhyay. (2011). Manipuri transliteration from Bengali script to Meitei Mayek: A rule based approach, In Information Systems for Indian Languages (pp. 195-198). Springer, Berlin, Heidelberg. [17] Nongmeikapam Kishorjit, Lairenlakpam Nonglenjaoba, Yumnam Nirmal, and Sivaji Bandhyopadhyay. (2012).Improvement of CRF based Manipuri POS tagger by using Reduplicated MWE (RMWE), arXiv preprint. [18] Nongmeikapam Kishorjit and Sivaji Bandyopadhyay. (2011). Genetic algorithm (GA) in feature selection for CRF based manipuri multiword expression (MWE) identification, arXiv. [19] Thoudam Doren Singh, Kishorjit Nongmeikapam, Asif Ekbal, and Sivaji Bandyopadhyay. (2009). Named Entity Recognition for Manipuri using Support Vector Machine, 23rd PACLIC , Hong Kong, Vol.2 pp. 811-818. [20] Nongmeikapam, Kishorjit, Tontang Shangkhunem, Ngariyanbam Mayekleima Chanu, Laisuhram Newton Singh, Bishworjit Salam, and Sivaji Bandyopadhyay. (2011). Crf based name entity recognition (ner) in manipuri: A highly agglutinative indian language, In Emerging Trends and Applications in Computer Science (NCETACS), 2nd National Conference on, pp. 1-6. IEEE. [21] Thoudam Doren Singh and Sivaji Bandyopadhyay. (2008). Morphology Driven Manipuri POS Tagger, Proceedings of the Workshop IJCNLP on NLP for Less Privileged Languages, 11 January 2008 IIIT, Hyderabad, India. [22] Thoudam Doren Singh, Asif Ekbal, and Sivaji Bandyopadhyay. (2008). Manipuri POS Tagging using CRF and SVM: A Language Independent Approach, Proceedings of ICON-2008: 6th International Conference on Natural Language Processing, Pune, India, pp 240-245.Volume 2, Issue 1.


[23] Kh Raju Singha, Bipul Syam Purkayastha, Kh Dhiren Singha, and Arindam Roy. (2011). Developing a Tagset for Manipuri Part of Speech Tagging, Journal Of Computer Science And Engineering, Volume 5. [24] Kh Raju Singha,Bipul Syam Purkayastha, and Kh Dhiren Singha. (2012). Part of Speech Tagging in Manipuri: A Rule-based Approach, International Journal of Computer Applications (0975 – 8887) Volume 51– No.14. [25] Nongmeikapam, Kishorjit, Lairenlakpam Nonglenjaoba, Asem Roshan, Tongbram Shenson Singh, Thokchom Naongo Singh, and Sivaji Bandyopadhyay. 2012. Transliterated SVM Based Manipuri POS Tagging, In Advances in Computer Science, Engineering & Applications (pp. 989-999). Springer, Berlin, Heidelberg. [26] Kh. Raju Singha, Bipul Syam Purkayastha, and Dhiren Singha. (2012). Part of Speech Tagging in Manipuri with Hidden Markov Model. IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 6, No 2, ISSN (Online): 1694-0814. [27] Thoudam Doren Singh and Sivaji Bandyopadhyay. (2010). Manipuri-English Bidirectional Statistical Machine Translation Systems using Morphology and Dependency Relations, Proceedings of SSST-4, Fourth Workshop on Syntax and Structure in Statistical Translation, pages 83–91,COLING, Beijing. [28] Thoudam Doren Singh and Sivaji Bandyopadhyay. (2011). Integration of Reduplicated Multiword Expressions and Named Entities in a Phrase Based Statistical Machine Translation System, Proceedings of the 5th International Joint Conference on Natural Language Processing, pages 1304– 1312,Chiang Mai, Thailand. [29] Thoudam Doren Singh and Sivaji Bandyopadhyay. (2010). Statistical machine translation of English– Manipuri using morpho-syntactic and semantic information, Proceedings of the Association for Machine Translation in the Americas (AMTA 2010). [30] Thoudam Doren Singh and Sivaji Bandyopadhyay. (2010). Manipuri-English Example Based Machine Translation System, International Journal of Computational Linguistics and Applications (IJCLA), ISSN (2010): 0976-0962. [31] Yumnam Bablu Singh and Bipul Syam Purkyastha.(2011). Experiences in building the indo wordneta wordnet for Manipuri, International Journal of Engineering Science and Technology (IJEST),ISSN : 0975-5462,Vol. 3 No. 5. [32] Richard Laishram Singh, Krishnendu Ghosh, Kishorjit Nongmeikapam and Sivaji Bandyopadhyay. (2014). A Decision Tree Based Word Sense Disambiguation System In Manipuri Language, Advanced Computing: An International Journal (ACIJ), Vol.5, No.4.

SARCASM AS A CONTRADICTION BETWEEN A TWEET AND ITS TEMPORAL FACTS: A PATTERNBASED APPROACH


Santosh Kumar Bharti and Korra Sathya Babu Department of Computer Science and Engineering, National Institute of Technology, Rourkela, Odisha, India

ABSTRACT In the context of Indian languages, sarcasm detection in Hindi is a tedious job as it is rich in morphology and complex in structure. The annotated resources for sarcastic Hindi sentences are almost negligible for machine learning analysis. Here, we propose a pattern-based framework for sarcasm detection in Hindi tweets. It has been observed that a tweet is sarcastic if it contradicts its temporal facts intentionally. The temporal fact is a collection of time-dependent facts which may change over the period. We used Hindi news with timestamp as a corpus of temporal facts. The timestamp describes the fact period of any entity. In this research, a temporal fact is represented as a pair. To form a pair, one need to extract triplets i.e. subject, verb and object for every sentence. Next, a key is formed using subject and verb together. The value is formed using object and timestamp together. To predict the sarcastic tweet; one needs to extract the triplets from input tweet and form a pair. Now, the pair of the input tweet is mapped with related pair in the corpus of temporal facts and are checked if they coincide. If they contradict, the input tweet is considered as sarcastic. The achieved accuracy of the proposed approach outperforms the state-of-the-arts techniques for Hindi sarcasm detection as it attains an accuracy of 82.8%.

KEYWORDS Natural Language Processing, Sentiment, Sarcasm, Social Network, Tweets For More Details : http://aircconline.com/ijnlc/V7N5/7518ijnlc07.pdf Volume Link : http://airccse.org/journal/ijnlc/vol7.html

REFERENCES


[1] Chaffey, D. 2016. Global social media research summary 2016. URL http://www.smartinsights.com/social-media-marketing/social-media-strategy/newglobalsocial-mediaresearch. [2] Tan, W., Blake, M. B., Saleh, I. and Dustdar, S. 2013. Social-network sourced big data analytics. Internet Computing, 17(5), pp. 62-69. [3] Gastelum, Z. N., Whattam, K. M. 2013. State-of-the- Art of Social Media Analytics Research. Pacific Northwest National Laboratory (PNNL), pp. 1-9. Vol.7, No.5, October 2018 78 [4] Lee, Parkvall, M. 2007. Varldens 100 storsta sprak 2007. The Worlds 100, 2007. [5] Mesthrie, R. 1992. Language in indenture: a sociolinguistic history of Bhojpuri-Hindi. South Africa, Routledge, pp. 30-32. ISBN 978-0415064040. [6] Language and Culture. 2015. Top 30 Languages by Number of Native Speakers, 2005. http://www.vistawide.com/languages/top 30 languages.htm. [7] Gonzalez-Ibanez, R., Muresan, S. and Wacholder, N. 2011. Identifying sarcasm in twitter: a closer look. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 2, pp. 581-586. [8] Liebrecht, C. Kunneman, F. and van den Bosch, A. 2013. The perfect solution for detecting sarcasm in tweets #not. In proceedings of the Association for Computational Linguistics, pp. 29-37. [9] Riloff, E., Qadir, A., Surve, P., De Silva, L., Gilbert, N. and Huang, R. 2013. Sarcasm as contrast between a positive sentiment and negative situation. In proceedings of the Empirical methods in natural language processing, pp. 704-714. [10] Rajadesingan, A., Zafarani, R. and Liu, H. 2015. Sarcasm detection on Twitter: A behavioural modelling approach. In Proceedings of the Eighth ACM Inter- national Conference on Web Search and Data Mining, pp. 97-106. [11] Joshi, A., Sharma, V. and Bhattacharyya, P. 2015. Harnessing Context Incongruity for Sarcasm Detection. In proceedings of the Association for Computational Linguistics vol. 2, pp. 757-762. [12] Bharti, S., Sathya Babu, K. and Jena, S. 2015. Parsing- based sarcasm sentiment recognition in Twitter data. In International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp. 1373-1380, IEEE/ACM. [13] Joshi, A., Tripathi, V., Patel, K., Bhattacharyya, P. and Carman, M. 2016. Are Word Embeddingbased Features Useful for Sarcasm Detection?. ArXiv preprint arXiv: 1610.00883. [14] Bouazizi, M. and Ohtsuki, T. O. 2016. A pattern-based approach for sarcasm detection on twitter. IEEE Access, vol. 4, pp. 5477-5488.


[15] Bharti, S.K., Vachha, B. Pradhan, R.K., Babu, K.S. and Jena, S.K. 2016. Sarcastic sentiment detection in tweets streamed in real time: a big data approach. Digital Communications and Networks 2(3), pp. 108-121. [16] Desai, N. and Dave, A.D. 2016. Sarcasm Detection in Hindi sentences using Support Vector machine. International Journal of Advance Research in Computer Science and Management Studies, 4(7), pp. 8- 15. [17] P. Liu, W. Chen, G. Ou, T. Wang, D. Yang, and K. Lei. 2014. Sarcasm detection in social media based on imbalanced classification. In Web Age Information Management, pp. 459–471. [18] D. Tayal, S. Yadav, K. Gupta, B. Rajput, and K. Kumari. 2014. Polarity detection of sarcastic political tweets. In proceedings of International Conference on Computing for Sustainable Global Development (INDIACom), pp. 625–628, IEEE. [19] P. Tungthamthiti, K. Shirai, and M. Mohd. 2014. Recognition of sarcasm in tweets based on concept level sentiment analysis and supervised learning approaches. ACL, pp. 404–413. [20] S. K. Bharti, R. Pradhan, K. S. Babu, and S. K. Jena. 2017. Sarcasm analysis on twitter data using machine learning approaches. In Trends in Social Network Analysis, pp. 51–76, Springer. [21] D. Bamman and N. A. Smith. 2015. Contextualized sarcasm detection on Twitter. In ICWSM, pp. 574–577. [22] Z. Wang, Z. Wu, R. Wang, and Y. Ren. 2015. Twitter sarcasm detection exploiting a context-based model. In International Conference on Web Information Systems Engineering, pp. 77–91, Springer. [23] R. Schifanella, P. de Juan, J. Tetreault, and L. Cao. 2016. Detecting sarcasm in multimodal social platforms. In Proceedings of the 2016 ACM on Multimedia Conference, pp. 1136–1145, ACM. [24] A. Khattri, A. Joshi, P. Bhattacharyya, and M. J. Carman. 2015. Your sentiment precedes you: Using an author’s historical tweets to predict sarcasm. In Proceedings of the WASSA@ EMNLP, pp. 25–30. [25] Bharti, S., Sathya Babu, K. and Jena, S. 2017. Harnessing Online News for Sarcasm Detection in Hindi Tweets. Proceeding of PReMI 2017, LNCS, Springer.


SCORE-BASED SENTIMENT ANALYSIS OF BOOK REVIEWS IN HINDI LANGUAGE Firdous Hussaini1 , S. Padmaja2 and S. Sameen Fatima3 1

Department of CSE, STLW, Osmania University, Hyderabad, India 2Associate Professor, Keshav Memorial Institute of Technology, Hyderabad, India 3 Professor and Head, Department of CSE, UCE Osmania University, Hyderabad, India

ABSTRACT Sentiment analysis has been performed in different languages and in various domains, such as movie reviews, product reviews and tourism reviews. However, not much work has been done in the area of books considering the high availability of book reviews on Hindi blogs and online forums. In this paper, a scorebased sentiment mining system for Hindi language is discussed, which captures the sentiment behind the words of book review sentences. We conducted three experiments using scores from the HindiSentiWordNet (H-SWN), first using parts-of-speech tags of opinion words to extract their potential scores. Then, we focused on word-sense disambiguation (WSD) to increase the accuracy of system. Finally, the classification results were improved by handling morphological variations. The results were validated against human annotations achieving an overall accuracy of 86.3%. The work was extended further using Hindi Subjective Lexicon (HSL). We also developed an annotated corpus of book reviews in Hindi.

KEYWORDS Natural Language Processing, Sentiment Analysis, Lexicon-based, Word-sense Disambiguation For More Details : http://aircconline.com/ijnlc/V7N5/7518ijnlc11.pdf Volume Link : http://airccse.org/journal/ijnlc/vol7.html

REFERENCES


[1] Liu, B. (2012). Sentiment Analysis and Opinion Mining. Morgan & Claypool Publishers. [2] http://trak.in/tags/business/2015/08/19/hindi-content-content-consumption-growthindia-google/ [3] Chevalier, J. A. & Mayzlin, D. (2006). The Effect of Word of Mouth on Sales: Online Book Reviews. Journal of Marketing Research: August 2006, Vol. 43, No. 3, pp. 345354. [4] Hu, M. & Liu, B. (2004). Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 168-177 [5] Kim, S. M. & Hovy, E. (2004). Determining the sentiment of opinions. In Proceedings of the 20th International Conference on Computational Linguistics, p. 1367. Association for Computational Linguistics. [6] Das, A. & Bandyopadhyay, S. (2010). SentiWordNet for Indian languages. Asian Federation for Natural Language Processing, China: 56-63. [7] Joshi, A., Balamurali, A. R. & Bhattacharyya, P. (2010). A Fall-Back Strategy for Sentiment Analysis in Hindi: a Case Study. In Proceedings of the 8th International Conference on Natural Language Processing. [8] Chakrabarti, D., Narayan, D. K., Pandey, P. & Bhattacharyya, P. (2002). An Experience in Building the Indo Wordnet - A Wordnet for Hindi. In Proceedings of First International Conference on Global WordNet. [9] Bakliwal, A., Arora, P. & Varma, V. (2012). Hindi subjective lexicon: A Lexical Resource for Hindi Polarity Classification. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC). [10] Arora, P., Bakliwal, A. & Varma, V. (2012). Hindi Subjective Lexicon Generation using WordNet Graph Traversal. International Journal of Computational Linguistics and Applications 3.1: 25-39. [11] Mittal, N., Agarwal, B., Chouhan, G., Bania, N. & Pareek, P. (2013). Sentiment Analysis of Hindi Review Based on Negation and Discourse Relation. In Proceedings of International Joint Conference on Natural Language Processing. [12] http://www.patrika.com/news/books/ [13] http://hindi.webdunia.com/hindi-booksreview [14] http://ltrc.iiit.ac.in/analyzer/hindi/


SENTIMENT ANALYSIS OF MIXED CODE FOR THE TRANSLITERATED HINDI AND MARATHI TEXTS Mohammed Arshad Ansari and Sharvari Govilkar Department of Information Technology, Pillai College of Engineering, New Panvel, Navi Mumbai, Maharashtra, India 410206

ABSTRACT The evolution of information Technology has led to the collection of large amount of data, the volume of which has increased to the extent that in last two years the data produced is greater than all the data ever recorded in human history. This has necessitated use of machines to understand, interpret and apply data, without manual involvement. A lot of these texts are available in transliterated code-mixed form, which due to the complexity are very difficult to analyze. The work already performed in this area is progressing at great pace and this work hopes to be a way to push that work further. The designed system is an effort which classifies Hindi as well as Marathi text transliterated (Romanized) documents automatically using supervised learning methods (KNN), NaĂŻve Bayes and Support Vector Machine (SVM)) and ontology based classification; and results are compared to in order to decide which methodology is better suited in handling of these documents. As we will see, the plain machine learning algorithm applications are just as or in many cases are much better in performance than the more analytical approach.

KEYWORDS Natural Language Processing, K-NN, K Nearest Neighbor, Support Vector Machine, NaĂŻve Bayes, Ontology, Precision, Recall, F-measure, Confusion Matrix, Random Forest. For More Details : http://aircconline.com/ijnlc/V7N2/7218ijnlc02.pdf Volume Link : http://aircconline.com/ijnlc/V7N2/7218ijnlc02.pdf

REFERENCES


[1] E. M. Gold and T. R. Corporation, “Language identification in the limit,” Inf. Control, vol. 10, no. 5, pp. 447–474, May 1967. [2] E. Annamalai, “The anglicized Indian languages: A case of code mixing,” Int. J. Dravidian Linguist., vol. 7, no. 2, pp. 239–247, 1978. [3] K. Nigam, J. Lafferty, and A. Mccallum, “Using Maximum Entropy for Text Classification,” IJCAI99 Work. Mach. Learn. Inf. Filter., pp. 61–67, 1999. [4] A. Esuli, F. Sebastiani, and V. G. Moruzzi, “SENTIWORDNET: A Publicly Available Lexical Resource for Opinion Mining,” Proc. Lr. 2006, vol. 0, pp. 417–422, 2006. [5] M. K. Chinnakotla and O. P. Damani, “Character Sequence Modeling for Transliteration,” 2009. [6] M. M. Khapra, “Domain Specific Iterative Word Sense Disambiguation in a Multilingual Setting,” Most, 2008. [7] S. Siersdorfer, S. Chelaru, W. Nejdl, and J. San Pedro, “How useful are your comments?,” Proc. 19th Int. Conf. World Wide Web, vol. 15, pp. 891–900, 2010. [8] A. Joshi, B. A. R, and P. Bhattacharyya, “A Fall-back Strategy for Sentiment Analysis in Hindi: a Case Study,” no. October 2015, 2010. [9] a R. Balamurali, A. Joshi, and P. Bhattacharyya, “Harnessing WordNet Senses for Supervised Sentiment Classification,” Proc. Conf. Empir. Methods Nat. Lang. Process., no. 2002, pp. 1081– 1091, 2011. [10] E. Clark and K. Araki, “Text normalization in social media: Progress, problems and applications for a pre-processing system of casual English,” Procedia - Soc. Behav. Sci., 1vol. 27, no. Pacling, pp. 2– 11, 2011. [11] H. Elfardy and M. T. Diab, “Token Level Identification of Linguistic Code Switching.,” in COLING (Posters), 2012, pp. 287–296. [12] B. A. R, A. Joshi, and P. Bhattacharyya, “Robust Sense-based Sentiment Classification,” Proc. Work. Comput. Approaches to Subj. Sentim. Anal. WASSA, no. October, pp. 132–138, 2011. [13] A. Bakliwal, P. Arora, and V. Varma, “Hindi Subjective Lexicon: A Lexical Resource for Hindi Polarity Classification,” The eighth international conference on Language Resources and Evaluation, no. May. pp. 1189–1196, 2012. [14] S. Rana, “Sentiment Analysis for Hindi Text using Fuzzy Logic,” Indian J. Appl. Res., vol. 4, no. 8, p. 16, 2014. [15] S. Karimi, F. Scholer, and A. Turpin, “Machine transliteration survey,” ACM Comput. Surv., vol. 43, no. 3, pp. 1–46, Apr. 2011.


[16] J.-M. Dewaele, “Emotions in multiple languages,” International Journal of Multilingualism, vol. 9, no. 1. pp. 129–130, 2012. [17] B. Kundu and S. Chandra, “Automatic detection of English words in Benglish text: A statistical approach,” in 2012 4th International Conference on Intelligent Human Computer Interaction (IHCI), 2012, pp. 1–4. [18] K. Gupta, M. Choudhury, and K. Bali, “4ining Hindi-English Transliteration Pairs from Online Hindi Lyrics,” Proc. Eighth Int. Conf. Lang. Resour. Eval., pp. 2459–2465, 2012. [19] B. A. R, “Cross-lingual sentiment analysis for Indian languages using linked wordnets,” Proc. COLING 2012, vol. 1, no. December 2012, pp. 73–82, 2012. [20] N. Mittal and B. Agarwal, “Sentiment Analysis of Hindi Review based on Negation and Discourse Relation,” in Sixth International Joint Conference on Natural Language Processing, 2013, pp. 57– 62. [21] S. Gella, J. Sharma, and K. Bali, “Query word labeling and Back Transliteration for Indian Languages: Shared task system description,” 2013. [22] Y. Vyas, S. Gella, J. Sharma, K. Bali, and M. Choudhury, “POS Tagging of EnglishHindi CodeMixed Social Media Content,” "Proceedings Conf. Empir. Methods Nat. Lang. Process., pp. 974– 979, 2014. [23] K. Popat, B. A. R., P. Bhattacharyya, and G. Haffari, “The Haves and the Have-Nots : Leveraging Unlabelled Corpora for Sentiment Analysis,” Proc. 51st Annu. Meet. Assoc. Comput. Linguist., no. October 2015, pp. 412–422, 2013. [24] B. King and S. Abney, “Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods,” Proc. NAACL-HLT, no. June, pp. 1110–1119, 2013. [25] U. Barman, A. Das, J. Wagner, and J. Foster, “Code Mixing: A Challenge for Language Identification in the Language of Social Media,” in Proceedings of the First Workshop on Computational Approaches to Code Switching, 2014, pp. 21–31.

INFORMATION RETRIEVAL TECHNIQUE FOR WEB USING NLP


Rini John and Sharvari Govilkar Department of Computer Engineering of PIIT Mumbai University, New Panvel, India

ABSTRACT Information retrieval is becoming an intricate part of every domain. Be it in acquiring data from various sources to form a single unit or to present the data in such a way that anyone can extract useful information and hence used in data analysis, data mining etc. This arena has gained much importance in the recent years because as of today we are exploded with various kind of information from the real-world. The growing importance of research data and retrieving the intelligent data are the main focus for any business today. So coming years this is a field where major work need to be done. We have focused here to implement a system for information retrieval from the webpages using Natural Language Processing (NLP) and have shown to getting better results than the existing system. Webpages is a home to huge amount of information from various entities in the real-world. Here we have designed a system for information retrieval technique for web using NLP where techniques Hierarchical Conditional Random Fields (i.e. HCRF) and extended Semi-Markov Conditional Random Fields (i.e. Semi-CRF) along with Visual Page Segmentation is used to get the accurate results. Also parallel processing is used to achieve the results in desired time frame. It further improves the decision making between HCRF and Semi-CRF by using bidirectional approach rather than top-down approach. It enables better understanding of the content and page structure.

KEYWORDS Information retrieval, NLP, Entity Extraction, Visual Page Segmentation (VIPS), SemiCRF (Semi-Markov conditional random fields), HCRF (Hierarchical conditional random field) and Parallel processing. For More Details : http://aircconline.com/ijnlc/V6N5/6517ijnlc01.pdf Volume Link: http://airccse.org/journal/ijnlc/vol6.html

REFERENCES


[1] Rini John and Sharvari S Govilkar. Article: Survey of Information Retrieval Techniques for Web using NLP. International Journal of Computer Applications 135(8):23-27, February 2016. Published by Foundation of Computer Science (FCS), NY, USA [2] Rini John and Sharvari S. Govilkar, “ A Novel Approach For Information Retrieval Technique For Web using NLP”, [3] D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma, “VIPS: A vision-based page segmentation algorithm”, Microsoft Tech. Rep., MSR-TR-2003-79, 2003. [4] J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma, “Simultaneous record detection and attribute labeling in web data extraction”, in Proc. Int. Conf. Knowl. Disc. Data Mining, 2006. [5] S. Sarawagi and W. W. Cohen, “Semi-Markov conditional random fields for information extraction”, in Proc. Conf. Neural Inf. Process. Syst., 2004. [6] Fedor Bakalov, Bahar Sateli, Ren´e Witte, Marie-Jean Meurs, Birgitta K, “Natural Language Processing for Semantic Assistance in Web Portals”, 2012. [7] R. Witte and T. Gitzinger, “Semantic Assistants – User-Centric Natural Language Processing Services for Desktop Clients,” in 3rd Asian Semantic Web Conference (ASWC 2008), ser. LNCS, vol. 5367. Bangkok, Thailand: Springer, 2008. [8] Paolo Nesi,Gianni Pantaleo and Marco Tenti, “Ge(o)Lo(cator):Geographic Information Extraction from Unstructured Text Data and Web Documents”, in 9th International Workshop on Semantic and Social Media Adaption and Personalization, 2014. [9] Suma Adindla and Udo Kruschwitz, “Combining the Best of Two Worlds: NLP and IR for Intranet Search”, in IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, 2011. [10] Zhong Liu and Ying Wang, “A Novel method of Chinese Web Information Extraction and Applications”, in WASE International Conference on Information Engineering, 2009. [11] Ruiqiang Guo and Fuji Ren, “Towards the Relationship between Semantic Web and NLP”, 2009. [12] B.Aysha Banu, Dr.M.Chitra , “A Novel Ensemble Vision Based Deep Web Data Extraction” in IEEE International Conference on Advanced Communication Control and Computing Technologies (ICACCCT), 2012. [13] I.Vijayalakshmi, Sobha Lalitha Devi, “Automatic Information Extraction through Mobile”, in ICCCNT’12, Coimbatore, India 2012.


[14] C. Yang, Y. Cao, Z. Nie, J. Zhou, and J.-R. Wen, “Closing the loop in webpage understanding”, in Proc. 17th ACM Conf. Inf. Knowl. Manage., 2008. [15] Z. Nie, F. Wu, J.-R. Wen, and W.-Y. Ma, “Extracting objects from the web”, in Proc. 22nd Int. Conf. Data Eng., 2006. [16] Z. Nie, Y. Ma, S. Shi, J.-R. Wen, and W.-Y. Ma, “Web object retrieval”, in Proc. 16th Int. Conf. World Wide Web, 2007. [17] D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma, “Block-based web search”, in Proc. Special Interest Group Inf. Retrieval (SIGIR) Conf., 2004. [18] Zaiqing Nie, Ji-Rong Wen, and Wei-Ying Ma, “Statistical Entity Extraction from the Web” IEEE September 2012.


BUILDING A SYLLABLE DATABASE TO SOLVE THE PROBLEM OF KHMER WORD SEGMENTATION Tran Van Nam1 , Nguyen Thi Hue2 and Phan Huy Khanh3 1Department of Computer Engineering; University of Tra Vinh, Vietnam 2 School of Southern Khmer Language; University of Tra Vinh, Vietnam 3Department of Computer Engineering; Polytechnic University of Da Nang, Vietnam

ABSTRACT Word segmentation is a basic problem in natural language processing. With the languages having the complex writing system like the Khmer language in Southern of Vietnam, this problem really very intractable, posing the significant challenges. Although there are some experts in Vietnam as well as international having deeply researched this problem, there are still no reasonable results meeting the demand, in particular, no treated thoroughly the ambiguous phenomenon, in the process of Khmer language processing so far. This paper present a solution based on the syllable division into component clusters using two syllable models proposed, thereby building a Khmer syllable database, is still not actually available. This method using a lexical database updated from the online Khmer dictionaries and some supported dictionaries serving role of training data and complementary linguistic characteristics. Each component cluster is labelled and located by the first and last letter to identify entirety a syllable. This approach is workable and the test results achieve high accuracy, eliminate the ambiguity, contribute to solving the problem of word segmentation and applying efficiency in Khmer language processing.

KEYWORDS Ambiguity; component cluster; labelling; lexical database; natural language processing; syllable database; syllable formation; word segmentation. For More Details: http://aircconline.com/ijnlc/V6N1/6117ijnlc01.pdf Volume Link : http://airccse.org/journal/ijnlc/vol6.html

REFERENCES


[1] Hắc Sóc Hi, (2012) Ngữ pháp tiếng Khmer, Học viện Giáo dục Dân tộc. [2] Hồ Xuân Mai, (2012) Vài nét về tiếng Khmer Nam Bộ (trường hợp ở tỉnh Sóc Trăng và Trà Vinh). Tạp chí Khoa học Xã hội, Số 12(172). [3] K. Sok, (2004) Khmer Language Grammar, First Edition of Royal Academic of Cambodia. [4] Ly Vattana. (2011) Các tiếp cận tách từ tiếng Khmer dùng trong cơ sở dữ liệu văn bản. Tạp chí Khoa học ĐHQGHN, Khoa học Tự nhiên và Công nghệ 27 pp251-258. [5] Mai Ngọc Chừ, Vũ Đức Nghiệu, Hoàng Trọng Phiến. (1997) Cơ sở ngôn ngữ học và tiếng Việt. NXB Giáo dục. [6] Narin Bi, Nguonly Taing. (2014) Khmer word segmentation based on bi-directional maximal matching for plaintext and microsoft word document. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2014, Chiang Mai, Thailand, IEEE, December 9-12, pages 1–9. [7] Sopheap Seng, Sethserey Sam, Viet-Bac Le, Brigitte Bigi, Laurent Besacier. (2008) Which units for acoustic and language modeling for Khmer automatic speech recognition. International Workshop on Spoken Languages Technologies for UnderRessourced Languages (SLTU'08). [8] Villavon Souksan, Phan Huy Khánh. (2013. Hội thảo quốc gia lần thứ XVI, 1415/11/2013. [9] Villavon Souksan, Phan Huy Khánh. (2013) Khử bỏ nhập nhằng trong bài toán tách từ tiếng Lào. Tạp chí Khoa học & Công nghệ ĐH Đà Nẵng no1 (62), pp 113-119. [10] Vichet Chea, Ye Kyaw Thu, Chenchen Ding, Masao Utiyama, Andrew Finch, Eiichiro Sumita. Khmer Word (2015) Segmentation Using Conditional Random Fields. In Khmer Natural Language Processing, December 4. [11] Chea Sok Huor, Top Rithy, Ros Pich Hemy, Vann Navy. Word Bigram Vs Orthographic Syllable Bigram in Khmer Word Segmentation. pp 249-253 [12] C. Chan P. Wong (1996). Chinese Word Segmentation Based on Maximum Matching and Word Binding Force. Proceedings of Coling 96, p200-203 [13] Paisarn Charoenpornsawat (1998), Các phương pháp tách từ tiếng Thái, trường Đại học Chulalongkorn, Thái Land. [14] Dinh Dien, Hoang Kiem, Nguyen Van Toan (2001), Vietnamese Word Segmentation, pp. 749 -756. The 6th Natural Language Processing Pacific Rim Symposium, Tokyo, Japan. [15] Dien Dinh (2005), “Building an Annotated English-Vietnamese parallel Corpus”, MKS: A Journal of Southeast Asian Linguistics and Languages, Vol.35, pp.21-36.


[16] Đinh Điền (3/2005), “Xây dựng và khai thác ngữ liệu song ngữ Anh-Việt điện tử“, luận án tiến sĩ ngôn ngữ học so sánh, ĐH Khoa học Xã hội & Nhân văn, ĐHQG Tp.HCM. [17] Khưn Sóc (2007), Ngữ pháp tiếng Khmer, NXB Viện Phật học Campuchia. [18] Viện nghiên cứu và đào tạo phía Nam (1998), Ngữ pháp tiếng Khmer, NXB Văn hóa dân tộc. [19] Sang Sết (2016), “Cách phiên âm chữ Khmer theo Ngữ âm chữ Việt va Cách phiên âm chữ Viết theo Ngữ âm chữ Khmer”, NXB Văn hóa Dân tộc. [20] Long Siêm (2010), Vấn đề từ vưng tiếng Khmer, NXB Viện ngôn ngữ học Cam Puchia; [21] Chan Som Nop (2010), Từ và Các phương thức cấu tạo từ trong tiếng Khmer, NXB Campuchia. [22] Bộ gõ Khmer Unicode. Web: http://dongthapit.com/tieng-khmer-campuchia/bo-gokhmer-unicodecampuchia/ [23] Từ điển trực tuyến Việt-Khmer. Webs: https://khmermaster.com/vn/tudien-viet-khmer-online/

https://vi.glosbe.com/vi/km/,

[24] Dạy học tiếng Khmer. Web: http://mmhomepage.com/burmese/Easy-Khmer-TiengCampuchia/


A NOVEL APPROACH FOR INFORMATION RETRIEVAL TECHNIQUE FOR WEB USING NLP Rini John and Sharvari S. Govilkar Department of Computer Engineering of PIIT Mumbai University, New Panvel, India

ABSTRACT Webpages are loaded with vast and different kinds of information about the entities in the real-world. Information retrieval from the Web is of a greater significance today to get the accurate queried data within the desired time frame which is increasingly becoming difficult with each passing day. Need to develop a system to solve entity extraction problems from the web as compared to the traditional system. It’s critical to have a clear understanding of natural language sentence processing in the web page along with the structure of the web page to get the correct information with speed. Here we have proposed approach for information retrieval technique for Web using NLP where techniques Hierarchical Conditional Random Fields (i.e. HCRF) and extended SemiMarkov Conditional Random Fields (i.e. SemiCRF) along with Visual Page Segmentation is used to get the accurate results. Also parallel processing is used to achieve the results in desired time frame. It further improves the decision making between HCRF and Semi-CRF by using bidirectional approach rather than top-down approach. It enables better understanding of the content and page structure.

KEYWORDS Information retrieval, NLP, Entity Extraction, Visual Page Segmentation (VIPS), SemiCRF (Semi-Markov conditional random fields), HCRF (Hierarchical conditional random field) and Parallel processing. For More Details : http://aircconline.com/ijnlc/V6N1/6117ijnlc04.pdf Volume Link : http://airccse.org/journal/ijnlc/vol6.html


REFERENCES [1] Rini John, Sharvari S. Govilkar., (2016) “Survey of Information Retrieval Techniques for Web using NLP”, VOL.135, NO. 8. [2] D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma, “VIPS: A vision-based page segmentation algorithm”, Microsoft Tech. Rep., MSR-TR-2003-79, 2003. [3] J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma, “Simultaneous record detection and attribute labeling in web data extraction”, in Proc. Int. Conf. Knowl. Disc. Data Mining, 2006, pp. 494–503. [4] S. Sarawagi and W. W. Cohen, “Semi-Markov conditional random fields for information extraction”, in Proc. Conf. Neural Inf. Process. Syst., 2004, pp. 1185–1192. [5] C. Yang, Y. Cao, Z. Nie, J. Zhou, and J.-R. Wen, “Closing the loop in webpage Understanding”, in Proc. 17th ACM Conf. Inf. Knowl. Manage., 2008, pp. 1397–1398


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.