R K Das NLP 2015 Article by Indira Group of Institutes

NLP and its prospect in India Ranjan Das dasranjan257@gmail.com

Introduction Natural language processing (NLP) is a field of study under artificial intelligence, computer science, and linguistics. It is related to the area of human–computer interaction. The goal of NLP is to build computational models of natural language for its analysis and generation. The main motivations of NLP research are technical or engineering motivation and scientific motivation. The engineering motivation aims at building an intelligent computer system, as a man-machine interface to computer, where we can communicate through natural languages. Here we design, implement, and test systems that process natural languages for practical applications. The scientific motivation is to gain better insight into human cognitive and linguistic faculty involved while using language, that is, to gain better understanding into how humans communicate using natural language. Here we study and analyze various aspects of natural language and try to formulate theory of it taking the clue from linguistics, psychology, logic and philosophy. Then we identify the computational machinery needed for an agent to exhibit various forms of linguistic behavior. As language exists in a living world, and expresses meaning, intention, etc of speaker-listener, the tools of NLP study includes formalisms for representing world knowledge and reasoning mechanisms along with grammar formalisms, algorithms, data structures, etc. Main challenges in NLP involve ‘natural language understanding’ and ‘natural language generation’. The ultimate goal of research on NLP is to parse and understand language (interpretation), process according to the instruction and then generate the output in a natural language. But the main challenge is to enable computers to derive meaning from natural language input. The phrase "natural language processing" may or may not be taken as synonymous with "natural language understanding." "Processing" most naturally is used for both interpretation and generation, while one would think "understanding" is better used for only the interpretation part. To say that a system is capable of natural language understanding implies that the system can only interpret natural language. To say that the system can process natural language allows for both understanding (interpretation) and generation (production). But the phrase "natural language understanding" seems to be used by some authors as synonymous with "natural language processing," and on this use includes interpretation and generation.

Major task-areas under NLP Some of the most commonly researched tasks in NLP can be listed as follows on the basis of sub-fields of study (well-defined problem setting) and applications. Some of these tasks have direct real-world applications, while others more commonly serve as sub-tasks that are used to aid in solving larger tasks. 1.1Information retrieval (IR) and Information extraction (IE) IR is concerned with storing, searching and retrieving information. It is a separate field within computer science (closer to databases), but relies on some NLP methods (for example, stemming). Some current research and applications seek to bridge the gap between IR and NLP. IE is concerned in general with the extraction of semantic information from text. This covers tasks such as named entity recognition, co-reference resolution, relationship extraction, etc. 1.2 Automatic summarization It produces a readable summary of a chunk of text, is often used to provide summaries of text of a known type, such as articles in the financial section of a newspaper. Three main steps used are: Extract important sentences (compute document keywords and score document sentences with respect to these keywords); Cohesion check (spot anaphoric references and modify text accordingly); Balance and coverage (modify summary to have an appropriate text structure). 1.3 Relationship extraction Given a chunk of text, the task is to identify the relationships among named entities (e.g. who is the brother of whom).

1.4 Sentiment analysis Also known as opinion mining, it aims to identify and extract subjective information in source materials. The process helps in determining polarity- positive, negative or neutral- of a given document or a set of documents, often using online reviews. It is very useful for identifying trends of public opinion in the social media, for the purpose of marketing. 1.5 Question answering Given a human-language question, the system aims to supply users with 'just the right information' instead of merely providing a list of hits. Typical questions have a specific right answer (such as "What is the capital of India?"), but sometimes open-ended questions are also considered (such as "What is the meaning of life?"). START is the world's first Web-based question answering system developed by Boris Katz and his associates of the InfoLab Group at the MIT Computer Science and Artificial Intelligence Laboratory. 1.6 Machine translation Machine Translation (MT) is an automated system that analyzes text from Source Language (SL) and produces “equivalent” text in Target Language (TL), ideally without human intervention. MT is an area of applied research under NLP that draws ideas and techniques from computer science, Artificial Intelligence (AI), translation theory (Linguistics), and statistics. This is one of the most difficult problems, and is termed as a type of "AI-complete" problems. Broadly there are two approaches in machine translation - Rule based approach, and corpus based approach. Rule based MT techniques require large amounts of linguistic knowledge to be encoded as rules (grammar) and lexicon (dictionary). Corpus based approach uses a training corpus (parallel corpora) of already translated texts - a parallel corpus to guide the translation process. Again, there are two approaches in Corpus based approaches: (1) Examples based machine translation (EBMT), and (2) Statistical Machine Translation (SMT). Statistical MT provides a way of automatically finding correlations between the features of two languages from a parallel corpus, overcoming to some extent the knowledge bottleneck in MT. 1.7 Word sense disambiguation Many words have more than one meaning and we have to select the meaning which makes the most appropriate sense in the given context. Word sense disambiguation (WSD) module does the task. 1.8 Co-reference resolution It is a well-studied problem in discourse. To derive the correct interpretation of a text, or even to estimate the relative importance of various mentioned subjects, pronouns and other referring expressions must be connected to the right individuals or objects. In a given sentence or chunk of text, the task is to determine which words ("mentions") refer to the same objects ("entities"). Anaphora resolution is a specific example of this task, and is specifically concerned with matching up pronouns with the nouns or names that they refer to. The more general task of co-reference resolution also includes identifying so-called "bridging relationships" involving referring expressions. 1.9 Speech Technology Speech Technology or speech processing includes Speech Recognition and Text-to-Speech Synthesis (TTS) as its subcomponents. An Automatic Speech Recognition system (ASR) system takes a speech sound clip of a person or people speaking as input and transcribes it into written texts. This is one of the extremely difficult problems of AI colloquially termed "AI-complete". Opposite to speech recognition process is that of Text-to-Speech Synthesis (TTS) where the system will convert the given text into corresponding spoken wave form. 1.10 Optical character recognition (OCR) OCR system takes an image representing printed text and converts that to corresponding text in an editable form.

NLP Activity in India India is a multilingual country, with 22 official languages and 12 scripts. Languages of four language families are spoken in the country. Linguistic richness and diversity poses a great degree of barrior to people to people communication in socio-economic arena. English, although, serves as national link language, its use is only restricted to a small segment of the population. In this context, language technology has a great prospect in India to break the communication barrier. It can play a role in national development by helping the common man getting the benefits of information technology through software tools and human machine interface systems that are available in people’s own languages. To enable wide proliferation of information and communication technology in Indian languages, tools, products and resources should be freely available to the general public. The Department of Electronics and Information

Technology (DeitY), MC&IT, Government of India has taken a major initiative ‘National Roll-Out Plan’ for wider proliferation of Indian language Software Tools and Fonts. The Department, under its Technology Development for Indian Languages (TDIL) Programme has the objective of developing Information Processing Tools and Techniques to facilitate human-machine interaction without language barrier; creating and accessing multilingual knowledge resources; and integrating them to develop innovative user products and services. Initiatives have been taken for long term research for development of Machine Translation System, Optical Character Recognition, On-line Handwriting Recognition System, Cross-lingual Information Access and Speech Processing in Indian languages. [http://tdil-dc.in/] MC&IT and Ministry of HRD are funding a lot for R&D in NLP in India to the institutes and through consortia. Under Machine Translation System there are options for English to Indian Languages Machine Translation Systems (Anuvadaksh and Angla-Bharti) and Indian Language to Indian Language Machine Translation System (Sampark). Some of the major Consortium projects are: • Cross Lingual Information Access (CLIA): It provides service for user queries in Bengali, Hindi, Marathi, Punjabi, Tamil and Telugu by retrieving documents in Hindi and English and displaying content in the query language. • Indian Language to Indian Language Machine Translation (ILILMT): It is a consortium of 12 academic institutions to build Machine Translation Systems for 9 Indian language pairs, like Hindi, Bengali, Marathi, Punjabi, Tamil, Telugu and Urdu. These are bi-directional and hence 18 systems in all. • English to Indian Language Machine Translation (EILMT) (Anuvadaksh): Machine Translation from English to Bengali, Hindi, Marathi, Oriya, Tamil, Urdu, Bodo and Gujarati in the domain of Tourism and Health. • IndoWordnet • AnglaBharati Machine Translation Systems: English to Assamese, Bangla, Hindi, Malayalam, Nepali, Punjabi, Telugu, Urdu. It is a Rule Based Machine Translation System, designed for translating Text in English to Indian languages with pseudointerlingua approach by IIT, Kanpur. It analyses English only once and creates an intermediate structure with most of the disambiguation performed and is used to generate Indian Language translated output. This approach is adapted to create Eight MT systems with the support of TDIL, DeitY, by CDAC centers. • Web OCR - Indian Language OCR - The system has been developed for Bangla, Devanagari, Gurumukhi, Kannada, Malayalam, Tamil, Telugu, Urdu, Assamese and it will soon be available for Gujarati, Oriya, Tibetan, Manipuri, script in future. [http://tdil-dc.in/] • Sandhan - a monolingual search system for tourism domain in five Indian languages viz., Bengali, Hindi, Marathi, Tamil and Telugu. [http://tdil-dc.in/] The Linguistic Data Consortium for Indian Languages (LDC-IL) at Central Institute of Indian Languages (CIIL) Mysore is building resources in Indian languages that will be useful for researchers and developers worldwide in the field of corpus linguistics and language technology related to Indian Languages Some leading academic institutes and organisations of India involved in NLP activities are IIT Bombay, IIT Kanpur, IIT Kharagpur, IIIT Hyderabad, IISc Bangalore, CDAC Pune, CDAC Noida, Jadavpur University, IIIT Allahabad, Indian Statistical Institute, Guwahati University, Manipur University, Assam University, Dravidian University, Goa University, Amrita University, University of Hyderabad, Thapar Institute and Punjabi University, etc. MANTRA system of CDAC Pune (AAI Group) is used by Rajya Sabha secretariat to produce Hindi proceedings (texts) from English. Many industrial giants like IBM Research, Microsoft, etc. in India are working in the area of NLP. Their interests cover a wide range of topics from Machine Translation, to Information Extraction, to Question Answering.

The future of NLP in India NLP research is gradually incorporating more and more semantics into its purview. It is shifting from lexical semantics to compositional semantics and, further on discourse and narrative understanding. However, human-level natural language processing is an AI-complete problem. That is, it is equivalent to solving the central artificial intelligence problem - making computers as intelligent as human beings. But there is a lot of scope to move forward and many new applications are coming up. Application of NLP components in the area of Data analytics for business and social media is increasing. Semantic search and sentiment analysis have proved very useful in taking crucial business decisions. Social media analytics can understand, respond to and extract valuable insights from social interaction and then infuse that insight across

Brand marketing, Public and Community relations, Prediction, Classification etc. Even political opinions and prospects can be inferred from NLP enriched data analysis. The commercial value of this field, in this context, is gradually increasing. Indiaâ&#x20AC;&#x2122;s multilingual environment not only imposes challenges for NLP communities to break the language barrior, but also offers a lot of scope for development. Knowledge dissemination is an essential part of digitally empowered India. For that we have to attain the technical ability to provide the people at the lowest strata of society with the access to information and means. And that has to be in the local languages of people. Thus NLP activities in India have a long way to go. References: Bharati, Akshar, Vineet Chaitanya and Rajeev Sangal, 1995, Natural Language Processing â&#x20AC;&#x201C; A Paninian Perspective, New Delhi: Prentice-Hall of India Private Limited. Allen, James, 1995, Natural Language Understanding, second edition, Redwood City: Benjamin/Cummings, 1995. Jurafsky, Daniel and James H. Martin, 2000, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition: Prentice Hall Series in Artificial Intelligence (2nd Edition, 2008) Steedman, Mark, 1996, "Natural Language Processing," in Margaret A. Boden, editor, Artificial Intelligence ,San Diego: Academic Press.