CLEAR December 2014
1
CLEAR December 2014
2
C Editorial…………………… 4 News & Updates……….5 CLEAR March 2015 Invitation…………………33 CLEAR December 2014 Volume-3 Issue-4 CLEAR Journal (Computational Linguistics in Engineering And Research) M. Tech Computational Linguistics Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad 678633 www.simplegroups.in simplequest.in@gmail.com Chief Editor Dr. P. C. Reghu Raj Professor and Head Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad Editors Dr. Ajeesh Ramanujan Raseek C Nisha M Anagha M Cover page and Layout Sarath K S Manu.V.Nair
CLEAR December 2014
Last word…………………34
Introduction Terminology Sreejith C
to
Ontology
Concepts
and
Natural Language Generation: Scope, Application and Approaches Manu Madhavan
Domain Specific Sentence Level Mood Extraction from Malayalam Text Revathy P
Natural Language Generation in Statistical Dialogue Systems. Pelja Paul M
Cloud Computing and Big Data Analytics with Text analytics Processing. Archana S M
3
Dear Readers! Greetings! I am happy to reach out to you with the last edition of CLEAR for the year 2014, featuring diverse topics such as Ontology basics and terminology, Natural Language Generation, Text analytics, and statistical dialogue systems. It also features an article on Mood extraction from Malayalam text. More importantly, it gives a brief about the National conference on Computational Linguistics and Information Retrieval (NC CLAIR 2014) that was organized by the CSE department. We plan to come out with next issue of CLEAR as a special issue featuring selected papers from this conference. Hope you enjoy this edition too. Send us your comments and valuable feedback. The entire CLEAR team wishes the readers a joyous and successful 2015!
Warm Regards, P.C. Reghu Raj (Chief Editor)
CLEAR December 2014
4
National Conference on Computational Linguistics and Information Retrieval (NC-CLAIR, 2014)
The National Conference on Computational Linguistics and Information Retrieval (NC-CLAIR, 2014) has been successfully organized during 29-31 December 2014. It is the first national level conference organized by the Department of Computer Science and Engineering, Govt. Engineering College, Sreekrishnapuram, and incidentally this happens to be the first conference at any level conducted by the college. The conference brought together researchers working in the areas of Indian language computing, Machine Translation, Speech Processing, Information Retrieval, Big Data Analysis, Machine Learning and other related fields. The conference was sponsored by TEQIP-II. There was active participation both from Academics and the industry.
Placements •
Kavitha Raju, Rekha Raj C T, Sreerakha T V and Vidya P V of M. Tech Computational Linguistics, 2013-15 batch got Placement for the post of Associate System Engineer at IBM.
•
Abitha Anto of M. Tech Computational Linguistics, 2012-14 batch got placement for the post of Trainee Software Engineer at Experion Technologies, Technopark, Thiruvananthapuram.
CLEAR December 2014
5
Internship •
Amal Babu, Kavitha Raju, Manu.V.Nair, Anagha M, Sarath K S, Sreetha K of M. Tech Computational Linguistics, 2013-15 batch got Internship at EY, Thiruvananthapuram.
Publications •
Sarath K S, Manu.V.Nair, Rajeev R R, P.C. Reghu Raj, Dialect Resolution: A Hybrid Approach, accepted for the iDravidian' 2014 Symposium.
•
Anagha M, Sreetha K, Raveena R Kumar, Rajeev R R, P.C. Reghu Raj, Lexical Resource Based Hybrid Approach for Cross Domain Sentiment Analysis in Malayalam, accepted for the iDravidian' 2014 Symposium.
•
Amal Babu, Alen Jacob, Rajeev R R, P.C. Reghu Raj, TnT Tagger for Malayalam with Fuzzy Rule Based Learning, accepted for the iDravidian' 2014 Symposium.
•
Kavitha Raju, Sreerekha T.V, Vidya P.V, Rajeev R. R, P.C. Reghu Raj, Tamil to Malayalam Transliteration, accepted for the IEEE International Conference on Signal Processing, Informatics, Communication and Energy Systems (IEEE SPICES)-2015.
•
Anagha M, Sreetha K, Raveena R Kumar, P.C. Reghu Raj, Fuzzy Logic Based Hybrid Approach for Sentence Level Sentiment Analysis of Malayalam Movie Reviews, accepted for the IEEE International Conference on Signal Processing, Informatics, Communication and Energy Systems (IEEE SPICES)-2015.
•
Alen Jacob, Amal Babu, P.C. Reghu Raj, TnT Tagger with Fuzzy Rule Based Learning, accepted for the IEEE International Conference on Signal Processing, Informatics, Communication and Energy Systems (IEEE SPICES)-2015.
CLEAR December 2014
6
Introduction to Ontology Concepts and Terminology Sreejith C Data Science Engineer DeCirc, Banglore sreejith@decirc.com
This article describes about the ontology concepts and processing techniques using. The article is arranged as follows: Section 1 introduce about what is an ontology. Section 2 explains RDF, RDFS, OWL the means of representing knowledge. The basics of building a base ontology is explained in Section 3 and Section 4 explains how to populate the ontology with instances.
I.
Ontology
An ontology is a common, shared and formal description of important concepts in a specific domain. Ontologies are able to operate as repositories to organize information for specific communities. They can be used as a tool for knowledge acquisition. Ontologies allow users to reuse knowledge in new systems. They can form a base to construct knowledge representation languages. There are many definitions of ontology, and perhaps each of them views ontology from a different perspective. In the world of the Semantic Web, let us use the operational definition of ontology from W3Cs OWL Requirements Documents. An ontology defines the terms used to describe and represent an area of knowledge. There are several aspects of this definition that need to
CLEAR December 2014
be clarified. First, this definition states that ontology is used to describe and represent an area of knowledge. In other words, ontology is domain specific; it is not there to represent all knowledge, but an area of knowledge. A domain is simply a specific subject area or sphere of knowledge, such as photography, medicine, real estate, education, etc. Second, ontology contains terms and the relationships among these terms. Terms are often called classes, or concepts; these words are interchangeable. The relationships between these classes can be expressed by using a hierarchical structure: superclasses represent higher-level concepts and subclasses represent finer concepts, and the finer concepts have all the attributes and features that the higher concepts have. Third, besides the aforementioned relationships among the classes, there is another level of relationship expressed by using a special group of terms: 7
properties. These property terms describe various features and attributes of the concepts, and they can also be used to associate different classes together. Therefore, the relationships among classes are not only superclass or subclass relationships, but also relationships expressed in terms of properties. Reasons to build an ontology: To share common understanding of the structure of information between people or software agents. To enable reuse of domain knowledge. To make domain assumptions explicit. To separate domain knowledge from operational knowledge. To analyse domain knowledge.
expressed in RDF. RDF allows interoperability among applications exchanging machine understandable information on the Web.
Ontologies provide a common vocabulary of an area and define, with different levels of formality, the meaning of the terms and the relationships between them. RDF, RDFS and OWL are means to express increasingly complex information or knowledge.
1. This resource is a real-world object, i.e., a Nikon D70 camera; it is a single lens reflex (SLR) camera.
II.
Representing RDFS, OWL
Ontology:
RDF,
RDF is the basic building block for supporting the Semantic Web. RDF is to the Semantic Web what HTML has been to the Web. RDF is a language recommended by W3C, and it is all about metadata. RDF is capable of describing any fact (resource) independent of any domain. RDF provides a basis for coding, exchanging, and re-using structured metadata. RDF is structured; i.e., it is machine-understandable. Machines can do useful operations with the knowledge CLEAR December 2014
A. Basic Elements of RDF RESOURCE: The first key element is the resource. RDF is a standard for metadata; i.e., it offers a standard way of specifying data about something. This something can be anything, and in the RDF world we call this something, resource. A resource is identified by a uniform resource identifier (URI), and this URI is used as the name of the resource. Here is an example. The following URI uniquely identifies a resource: http://www.simplegroups.in/photography/SL R#Nikon-D70
2. URL “http://www.simplegroups.in/photo graphy/SLR” is used as the first part of the URI. More precisely, it is used as a namespace to guarantee that the underlying resource is uniquely identified; this URL may or may not exist. 3. At the end of the namespace, “#” is used as the fragment identifier symbol to separate the namespace from the local resource name, i.e., Nikon-D70. 4. Now the namespace + “#” + local Resource Name gives us the final URI for the resource; it is globally named PROPERTY: Property is a resource that has a name and can be used as a property; i.e., it 8
can be used to describe some specific aspect, characteristic, attribute, or relation of the given resource. The following is an example of a property:
what kinds of objects can be the values of these properties.
http://www.simplegroups.in/photography/SL R#weight This property describes the weight of the D70 camera. STATEMENT: An RDF statement is used to describe properties of resources. It has the following format: resource (subject) + property (predicate) + property value (object) The property value can be a string literal or a resource. Therefore, in general, an RDF statement indicates that a resource (the subject) is linked to another resource (the object) via an arc labelled by a relation (the predicate). It can be interpreted as follows: <subject> has a property <predicate>, whose value is <object> For example: http://www.simplegroups.in/photography/SL R#Nikon-D70 has a property http://www.simplegroups.in/photography/SL R#weight whose value is 1.4 lb. B. RDFS RDFS is written in RDF. RDFS stands for RDF Schema. RDFS is a language one can use to create a vocabulary for describing classes, subclasses, and properties of RDF resources; it is a recommendation from W3C. The RDFS language also associates the properties with the classes it defines. RDFS can add semantics to RDF predicates and resources: it defines the meaning of a given term by specifying its properties and CLEAR December 2014
Figure 1 : A simple camera vocabulary.
RDFS is all about vocabulary. For example, Fig 1. shows a simple vocabulary. This simple vocabulary tells us the following fact: We have a resource called Camera, and Digital and Film are its two sub-resources. Also, resource Digital has two sub-resources, SLR and Point-and-Shoot. Resource SLR has a property called has-spec, whose value is the resource called Specifications. Also, SLR has another property called owned-by, whose value is the resource Photographer, which is a sub-resource of Person. C. OWL OWL (Web Ontology Language) is the latest recommendation of W3C and is probably the most popular language for creating ontologies today. OWL = RDF schema + new constructs for expressiveness. 9
With RDF Schema it is possible to define only relations between the hierarchy of the classes and property, or define the domain and range of these properties. The scientific community needed a language that could be used for more complex ontologies and therefore they started to work on a richer language that would be later released as the OWL language. OWL is built upon RDF and therefore the two languages share the same syntax. An OWL document can be seen as a RDF document with some specific OWL constructs. The expressiveness in achieved by adding more constraints on properties. OWL come up in 3 different forms: OWLLite, OWL-DL and OWL-Full. III.
Building the Base Ontology
Building base Ontology requires complete understanding of the domain for which ontology has to be made. This process can be explained with Example, an extract from Wikipedia article about Sachin Tendulkar. Example 1. Tendulkar was born at Nirmal Nursing Home on 24 April 1973. His father Ramesh Tendulkar was a reputed Marathi novelist and his mother Rajni worked in the insurance industry. On 14 November 1987, Tendulkar was selected to represent Mumbai in the Ranji Trophy. A year later, on 11 December 1988, aged just 15 years and 232 days, Tendulkar made his debut for Mumbai against Gujarat at home and scored 100 not out in that matchmaking him the youngest Indian to score a century on first-class debut. In this example, birth, debut, and match CLEAR December 2014
are three events. The birth event has, person, date and location as entities and are related to the event by relations hasPerson, hasDateOf Birth and hasLocation respectively. The ontology is represented in web ontology language (OWL), a world wide web consortium (W3C) standard for semantic web representation. In OWL, the events and entities are represented as classes and the relations are called properties. OWL provide many features to add semantics to the ontology. For example, a property can be defined as inverse of another property. In our running example, we can define a property hasPerson, which relate an event to the person with hasPerson property. Then a property hasEvent can be defined as inverse of hasPerson, which relates the person to event class. The classes and properties can be arranged in hierarchy, by adding sub-class, super-class relations. When a reasoner is enabled, it can infer the implicit relations existing between the entities. A sample base ontology structure is shown in Figure 2.
Figure 2: Base Ontology.
10
While implementing this base structure using Jena, the ontology graph is referred as model object. Then there have methods which create ontology classes (OWL Classes) and add to the ontology. A sample code snippet is given below: Model model=ModelFactory. createDefault Model();
Figure 4: OWL Representation of Properties.
OntClass Person=model.addClass(“PERSON”);
Populating base Ontology
OntClass Event=model.addClass(“EVENT”);
IV.
OntClass Birth=model.addClass(“BIRTH”);
When the information extraction is completed, the next step is storing the information into knowledge base. The information is added as instance of ontology classes. The Jena provides the methods to add individuals to ontology. The labels given to the chunks will help to identify the base class and the value is added as the individual to that class.
Event.addSubClass(Birth); ObjectProperty hasBirthDate = model.addObjectTypeProperty(“hasBirthDa te”); hasBirthDate.addDomain(Event); hasBirthDate.addRange(Date);
The corresponding OWL representation will is given in Figure 3 and Figure 4.
For example, consider the first line in our running example, “Tendulkar was born at Nirmal Nursing Home on 24 April 1973”. IE step will identify this as a birth event, with Nirmal Nursing Home as location and 24 April 1973 as date. Then class birth is selected from ontology and the properties hasPerson, hasLocaion and hasBirthDate will be populated with Tendulkar, Nirmal Nursing Home, 24 April 1973 respectively. So the resulting triples after populating to base ontology will have triples like: < Tendulkar was born at Nirmal Nursing Home on 24 April 1973 hasBirthDate 24 April 1973 >
Figure 3: Ontology.
OWL
Representation
CLEAR December 2014
of
Base
< Tendulkar was born at Nirmal Nursing Home on 24 April 1973 hasPerson Tendulkar >
11
< Tendulkar was born at Nirmal Nursing Home on 24 April 1973 hasLocation Nirmal Nursing Home >
The OWL representation of populated ontology is shown in figure 5.
This is an overview about various concepts and terminology related to ontologies. You can read more about apache jena from here: http://jena.apache.org/tutorials/.
Figure 5: OWL Representation of Populated ontology
IBM Cognitive Computers Detect Skin Cancer Quickly with Visual Machine Learning International Business Machines (IBM) has announced their partnership with Sloan Kettering Cancer Centre on Cognitive Computer technology to analyse and detect dermatological images of skin lesions so as to help identify the various cancerous disease states as many as 97 percent of the time. IBM technology after scanning around 3000 images could detect Melanoma, the most deadly form of skin cancer with an accuracy of about 95 percent as compared to todayâ&#x20AC;&#x2122;s methods which can detect with an accuracy of 75 percent to 84 percent, as per a PC World report. Computers with cognitive visual capabilities are being developed at IBM, which will be trained to identify specific features and patterns by gaining experience and knowledge through analysis of large collection of educational research data. IBM researcher Noel Codella said that the technology has proven to be adept at analysing a large number of images quickly and with a detailed level of measurement than any doctor would do manually. The system evaluates the images in less than a second. Google too is working on a project to detect cancer cells in a human body. Visit: http://www.iamwire.com/2014/12/ibm-cognitive-computers-detect-skincancer-quickly-visual-machine-learning/106661
CLEAR December 2014
12
Natural Language Generation: Scope, Application and Approaches Manu Madhavan Asst. Professor Sreepathy Institute of Management and Technology mmnamboodiry@gmail.com Natural Language Generation is a subfield of computational linguistic that is concerned with the computer systems which can produce understandable texts in some human languages. The system uses machine understandable logical form as input and produces syntactically and semantically valid sentences in natural language. The different stages of NLG include Content selection, Lexical selection, Sentence structuring and Discourse planning. The applications of NLG include text summarization, machine translation and question answering. The effectiveness of the NLG depends on the efficiency of internal knowledge representation. An ontology based Knowledge representation will improve the output text quality.
I.
Introduction specification and ill-formed input. On the other hand, the non-linguistic input to the NLG system is relatively unambiguous, well-specified and well-formed [9].
Natural language Generation (NLG) is a NLP task of generating sentence from word knowledge and information provided in a logical representation. NLG is the fascinating area of research and emerging technology with many real world applications. A sentence is an abstract notion of an idea.
The most of the NLG system follow a method of accepting some internal representation as an input and produce a natural language output. So, the problem of NLG is twofold [6]:
The capability of a system to generate a meaningful sentence indicates its intelligence to generate an idea. Natural Language Generation is the inverse of natural language understanding (NLU). NLG maps from meaning to text, while NLU maps from text to meaning. The input to NLG system varies widely from one application to another. But, in NLU, all the texts are governed by relatively common grammar. NLU has been characterized by ambiguity, under CLEAR December 2014
ď&#x201A;ˇ
Selecting a Representation(KR)
ď&#x201A;ˇ
Transforming the information to Natural language(NL)
Knowledge
The major question in NLG is how one can produce high quality natural text from some computer internal representation of information. The effectiveness of this 13
representation involves the understandability of the embodied information. The major task involved here is the designing of an unambiguous representation of world knowledge. The Indian language Sanskrit has a systematic approach meant for cognitive knowledge description. This work analyses the scope and implementation issues of Natural Language Generation, its applications and a Karaka based input representation for NLG. II.
an NLG system needs all these modules; most systems use a computational architecture where one module simultaneously performs several tasks. The dominant concern of an NLG system is choice. A generation system must make the following choices [9]: Content Selection: Content Selection is the process of deciding what information should be communicated in the text. This is described as one of the process creating a set of messages from the systemâ&#x20AC;&#x2122;s inputs or underlying data sources; these messages are the data objects then used by the subsequent language generation processes. Both the message creation process and the form and content of the messages created are highly application-dependent.
System Definition
Natural language generation is the process of converting an input knowledge representation into an expression is natural language (either text or speech) according to the application. The input to the system is a four tuple [8]: (K, C, U, D) where K is the knowledge source, a database of world knowledge. C is the communication goal, specified as independent of language which is using. U is the user model based on which the system is working. Probabilistic models are most commonly used in generation process. Finally, D is the discourse history, which deals with the ordering of information in the output text. The output will be natural language text which can be followed by a speech synthesizer according to the application. III.
Discourse Planning: Discourse Structure Planning is the process of imposing ordering and structure over the set of messages to be conveyed. A text is not just a random collection of pieces of information: the information is presented in some particular order, and there is usually an underlying structure to the presentation. Lexical Selection: The system must choose the lexical item most appropriate for expressing particular concepts. Lexicalization is especially important, of course, when the NLG system produces output texts in multiple languages.
NLG Task
The task of a natural language generation system can be characterized as mapping from some input data to an output text [8]. However, as with most computational processes, it is useful to decompose this task into a number of more neatly characterized sub-steps. Note that this does not mean that CLEAR December 2014
Sentence Aggregation: Sentence Aggregation is the process of grouping messages together into sentences. Aggregation is not always necessary. Each message can be expressed in a separate sentences, but in many cases good 14
aggregation can significantly enhance the fluency and readability of a text.
diagrammatically shows pipelined architecture [6].
the
three-stage
Referring Expression: The system must determine how to refer to the objects being discussed. Referring expression generation is closely related to lexicalization, since it is also concerned with producing surface linguistic forms which identify domain elements. Linguistic Realization: Linguistic Realization is the process of applying the rules of grammar to produce a text which is syntactically, morphologically, and orthographically correct.
IV.
NLG Architecture
The simplest architecture for NLG is to build a separate module for each task, and connect these modules via a one-way pipeline. In other extreme, one can think of a single module which performs all the tasks described in previous section. From a pragmatic perspective the most common architecture in present-day applied NLG systems is a three-stage pipeline with following stages [6].
The initial input to the document planner module is document plan, represented as messages and it is application dependent. The document planner produce a text plan, which is given to the micro-planner module. The micro planner produce a discourse plan and surface realizer applies grammar rules to produce valid natural language output.
Document Planning: This stage combines the content determination and discourse planning tasks described above. This reflects the fact that in many practical applications it is difficult to separate these two activities.
V.
The annotation scheme carries out the analysis of each sentence taking into consideration the verb as the central, binding element of the sentence. Sanskrit grammarians like Panini had used this idea in their grammar. In this, he introduced a set of grammatical functions called Karakas, which specify relations between nominal and verbal
Micro Planning This stage combines sentence aggregation, lexicalization, and referring expression generation. Surface realization: This stage receives the fully specified discourse plan and generates individual sentences as constrained by its lexical and grammatical resources. Fig. 1 CLEAR December 2014
Karaka Relations
15
root. Akshar Bharati et.al [4] observed that the derivation of the sentence proceed as follows: The speaker selects (or the grammar freely generate) verbs corresponding to an action and nouns as participants in it. Karaka relations between the nouns and verbs are selected depending on the semantic relation between them, and time reference is associated with the verb. So, the Karaka relation provides a syntactic-semantic relation between lexicons of the sentence.
determination system chooses karaka based verb function which contains appropriate words to describe domain concepts. If it is important to not overuse words so as to maintain variety in a text, it may be necessary to have several templates for the same basic message and to put in place some mechanism for choosing between them [11]. For example if the sentence to be generated is John gave pencil to Mary, the karaka system view this as a gave action of pencil by John to Mary. Represented as: gave(John, pencil, k3, Mary, k5, k6). The karaka can represented as shown in Fig 2.
Panini classifies six karakas according to the way in which they participate in the action of the verb. These may be listed as follows [4]: k1: Karta: central to the action of the verb k2: karma: the one most desired by the Karta k3: Karana: instrument which is essential for the action to take place k4: Sampradaan: recipient of the action
Other karakas are not specified in the above example. If the sentence to be generated is John gave pencil to Mary as a birthday gift through courier from Home Town, the input representation to Karaka based NLG will be: gave(John, Pencil, Courier, Mary, k5,Home Town).
k5:Apaadaan: movement away from a source k6: Adhikarana: location of the action So an action can be represented as function of verb (k1, k2, k3, k4, k5, k6). This representation can be used as the input to a NLG system. The system generates lexicons to fill the places denoted by karakas.
VI.
Then the necessary grammatical rules are applied as specified by surface realizer. The representation of features in this function will improve the effectiveness of sentence generation task. In actual generation system, a source text is given for learning. The parser analyses the structure and automatically identifies the role of each lexicon. By including the grammar features to this function, the representation becomes more informative for generation. This creates generalized grammar function like action
Karaka Based NLG
The karaka based approach is similar to template-based generation system. Content determination and discourse planning proceed as described in previous sections, but the end result is a text plan whose leaves are templates, which may include linguistic annotations for generation. The Content CLEAR December 2014
16
(k1, k2, k3, k4, k5, k6). The part-of-speech tagging of lexicons at parser level will help to find the words suitable for each role. The properties of each argument can be inferred from a knowledge base. Then from the available vocabulary, system finds the suitable word and forms the sentence using grammar rules.
like systemic grammar and FUG, which are the extensions of grammar models used for understanding problems. The karaka based approach for NLG have advantage of predicate-like structure and can be used efficiently in free word order languages also. VIII.
References
[1] Albert Gatt and Ehud Reiter, “SimpleNLG: A
A. Advantages and Disadvantages
realization engine for practical applications”, in Proceedings of the 12th European Workshop on
The Karaka based NLG approach has following advantages: The karaka based approach give a predicate-like representation for NLG. This give the easy implementation using Prolog and other logic programming paradigms. The karaka relations helps to identify the roles of participants unambiguously. This is more applicable with free word order languages like Sanskrit, Malayalam and Hindi, where the roles are identified by case endings (vibhakti).
Natural Language Generation, 2009.
[2] Allen J, Natural Language Understanding. Benjamin/Cummings
Publication
company,
California, 1988. [3] Ashwini V, Samar H, Prashanth M, and Dipti M S, “A Karaka Based Annotation Scheme for English”, in Proceedings of CICLing, 2009, pp. 41–52. [4] Bharati A, Chaitanya V, Sangal R, Natural Language Processing: A Paninian Perspective. Prentice-Hall of India, New Delhi, 1995. [5]
Bharati
“Paninian
A,
Chaitanya
Grammar
V,
Sangal
R,
Framework
Applied
to
English”, in Indian Institute of Technology
The disadvantages of this approach are: It is difficult to represent all sentences by limited set of arguments. This put some restriction on NLG system. VII. Conclusion
Kanpur,1996. [6] Ehud R, ”Building Applied Natural Language Generation
in
Applied
Natural
Language Processing Conference in Washington dc, 1997. [7] John A. Bateman,” Sentence generation and systemic
The NLG is the process of automatically producing natural language output. The various tasks of NLG include choices on content selection, sentence ordering and lexicalization. The most important task in NLG is coding grammatical knowledge to the system. There are efficient approaches CLEAR December 2014
Systems”,
grammar:
an
introduction”,
in
Iwanami Lecture Series: Language Sciences, Iwanami Shoten Publishers, 1997. [8] Jurafsky D and Martin H, Speech and Language Processing, Prentice Hall Inc., 2008. [9] RickBrigs, ”Knowledge Representation in Sanskrit and Artificial Intelligence”, in THE AI MAGAZINE
17
Spring,
1985.
Domain Specific Sentence Level Mood Extraction from Malayalam Text Revathy P M. Tech Computational Linguistics GEC, Sreekrishnapuram, Palakkad revathyravindranp@gmail.com
There exists a wide range of applications for NLP, of which sentiment analysis (SA) plays a major role. In sentimental analysis, the emotional polarity of a given text is analyzed, and classified as positive, negative or neutral. A more difficult task is to refine the classification into different moods such as happy, sad, angry etc. Analyzing a natural language for mood extraction is not at all an easy task for a computer. Even after achieving capabilities of massive amount of computation within a matter of seconds, understanding the sentiments embodied in phrases and sentences of textual information remains one of the toughest tasks for a computer till date. This paper focuses on tagging the appropriate mood in Malayalam text.
I.
Introduction and objective nature. Subjectivity indicates that the text contains/bears opinion content whereas Objectivity indicates that the text is without opinion content.
Sentiment analysis or opinion mining refers to the application of natural language processing, computational linguistics, and text analytics to identify and extract subjective information in source materials. This is an emerging field in Natural Language Processing (NLP) which has a wide scope in many areas. A basic task in sentiment analysis is classifying the polarity of a given text at the document, sentence, or feature/aspect level whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive, negative, or neutral. Sentiment Analysis focuses on categorizing the text at the level of subjective
CLEAR December 2014
II.
General Approaches
Much of the work has been done for the English language with less focus been given to the Scarce Resource Languages. Some of the common and popular approaches used for sentiment analysis are-
18
ď&#x201A;ˇ
Using Subjective Lexicon
ď&#x201A;ˇ
Using N-Gram Modeling
ď&#x201A;ˇ
Using Machine Learning
III.
character a word is called its semantic orientation. A word with a positive semantic orientation conveys the evaluation that the item is desirable (e.g., beautiful) and a negative orientation conveys the evaluation that the item is undesirable (e.g., absurd).This article presents a general strategy for inferring semantic orientation from semantic association.
Main Challenges
Some of the general challenges while addressing the problem of sentiment analysis are â&#x20AC;&#x201C; *Unstructured Data- The data available on the internet is much unstructured, there are different forms of the data talking about the same entities, persons, places, things and events. The web contains data from different sources varying from books, journals, web documents, health records, companyâ&#x20AC;&#x2122;s logs, internal files of an organization and even data from multimedia platforms comprising of texts, images, audios, videos etc. *Noise (slangs, abbreviations) - The web content available is very noisy. *Contextual Info-Identifying the context of the text becomes an important challenge to address. *Sarcasm Detection- Sarcasm is defined as a sharp, bitter, or cutting expression or remark; a bitter jibe or taunt usually conveyed through irony or understatement. Itâ&#x20AC;&#x2122;s a hard task for human beings to interpret sarcasm, making a machine able to understand same is a more difficult task. *Word Sense Disambiguation- The same word can have multiple meanings, and based on the sense of its usage the polarity of the word also changes. *Language Constructs- Each language has its own nature and style of writing which is accompanied by its own challenges and specifications. IV.
A. Semantic Association
from
General strategy in this paper is to infer semantic orientation from semantic association. Seven positive words (good, nice, excellent, positive, fortunate, correct, and superior) and seven negative words (bad, nasty, poor, negative, unfortunate, wrong, and inferior) are used as paradigms of positive and negative semantic orientation. The semantic orientation of a given word is calculated from the strength of its association with the seven positive words, minus the strength of its association with the seven negative words. These fourteen words were chosen using intuition. They are based on opposing pairs (good/bad, nice/nasty, excellent/poor, etc.). B. Semantic Orientation from PMI-IR PMI-IR (Turney, 2001) uses Point wise Mutual Information (PMI) to calculate the strength of the semantic association between words (Church & Hanks, 1989). Word cooccurrence statistics are obtained using Information Retrieval (IR). PMI- IR has been empirically evaluated using 80 synonym test questions from the Test of English as a Foreign Language (TOEFL), obtaining a score of 74% (Turney, 2001). For
Semantic Orientation
Words communicate the speaker's evaluation of the item that is under discussion as desirable or undesirable. This evaluative CLEAR December 2014
Orientation
19
comparison, the Pointwise Mutual Information (PMI) between two words, word 1 and word 2, is defined as follows (Church & Hanks, 1989):
Frequency) for the word in the chunk. (TF.IDF is a standard tool in Information Retrieval.) The next step is to apply SVD to A, to decompose A into a product of three matrices U∑VT, where U and V are in column orthonormal form (i.e., the columns are orthogonal and have unit length) and S is a diagonal matrix of singular values (hence SVD). If A is of rank r, then ∑ is also of rank r. Let ∑k , where k< r, be the matrix produced by removing from ∑ the r - k columns and rows with the smallest singular values, and let Uk and Vk be the matrices produced by removing the corresponding columns from U and V. The matrix Uk∑kVTk is the matrix of rank k that best approximates the original matrix A, in the sense that it minimizes the sum of the squares of the approximation errors. We may think of this matrix Uk∑kVTk as a smoothed or compressed version of the original matrix A. SVD may be viewed as a form of principal components analysis. LSA works by measuring the similarity of words using this compressed matrix, instead of the original matrix. The similarity of two words, LSA (word 1, word 2), is measured by the cosine of the angle between their corresponding compressed row vectors. The semantic orientation of a word, word, is calculated by SO-LSA as follows:
Here, p (word1&word2) is the probability that word1 and word2 co-occur. If the words are statistically independent, the probability that they co-occur is given by the product p(word1) p(word2) .The ratio between p(word1&word 2 ) and p( word 1 ) p( word 2 ) is a measure of the degree of statistical dependence between the words. The log of the ratio is the amount of information that we acquire about the presence of one word when we observe the other. The semantic orientation of a word, word, is calculated by SO-PMI-IR as follows:
C. Semantic Orientation from LSA SO-LSA applies Latent Semantic Analysis (LSA) to calculate the strength of the semantic association between words (Landauer and Dumais, 1997). LSA uses the Singular Value Decomposition (SVD) to analyze the statistical relationships among words in a corpus. The first step is to use the text to construct a matrix A, in which the row vectors represent words and the column vectors represent chunks of text (e.g., sentences, paragraphs, documents). Each cell represents the weight of the corresponding word in the corresponding chunk of text. The weight is typically the TF.IDF score (Term Frequency times Inverse Document CLEAR December 2014
20
further increased by including some features from the supervised learning algorithms. We have focused on a specific domain in order to obtain a minimum level of precision. Crossdomain sentiment analysis would be an interesting and useful topic to work on as another extension of our work.
VI.
References
[1] Neethu Mohandas, Janardhanan PS Nair,
Govindaru
Sentence Malayalam
Level
V
“Domain
Mood
Text”,
Specific
Extraction
2012
from
International
Conference on Advances in Computing and Communications [2] Piyush Arora, “Sentiment Analysis for Hindi Language.” [3] B Pang, L Lee, and S Vaithyanathan, “Thumbs up? Sentiment classification using
Fig: Proposed Method
machine
learning
techniques”,
Proc.
EMNLP-02, the conference on Empirical
V.
Conclusion and Future Works
Methods in Natural Language Processing. [4] Antony P.J, Santhanu P Mohan, Soman
In this a method of extracting the mood from a Malayalam sentence. The two commonly used methods for sentiment analysis of a given text is machine learning method and semantic orientation method. Both classes of methods machine learning and semantic orientation is briefly described along with some related work. We propose a method based on the latter method, using the SO-PMI-IR algorithm suitably modified for this.
K.P, “SVM Based Part of Speech Tagger for Malayalam”, Amrita University, Coimbatore, India, International Conference [5]
Turney
“Unsupervised
and
Learning
M.L of
Littman, Semantic
Orientation from a Hundred-Billion-Word Corpus”,
National
Institute
for
Research
Information
Technical Report (2002).
The level of accuracy of this semantic orientation method can be CLEAR December 2014
P.D.
21
Council,
Technology,
Natural Language Generation In Statistical Dialogue Systems Pelja Paul. N M. Tech Computational Linguistics GEC, Sreekrishnapuram, Palakkad peljapaul@gmail.com Natural language generation (NLG) in statistical spoken dialogue systems (SDS) using a data-driven statistical optimization framework for incremental information presentation (IP). The trained IP model is adaptive to variation from the current generation context and it incrementally adapts the IP policy at the turn level. Reinforcement learning is used to automatically optimize the IP policy with respect to a data-driven objective function. Language generation faces new challenges. First, in fully statistical dialogue systems, all components can introduce uncertainty, i.e. other components cannot know for sure how "higher up" or lower down components in the dialogue system pipeline will perform, but may only have an estimate of their likely behaviour. Secondly, NLG for incremental dialogue systems needs to be able to have the ability to accommodate user barge-ins. Apply the framework to adaptive information presentation (IP) in spoken dialogue systems. This work was the first to apply a data-driven optimization method to this decision space from data collection to user testing. The IP model is adaptive to variability observe in a stochastic SDS, and it incrementally adapts the IP policy at the turn level. Reinforcement learning is used to automatically optimize the IP policy with respect to a data-driven objective function.
I.
contribute to the global/ overall dialogue task, so as to maximize task completion. NLG for Spoken Dialogue Systems (SDS) converts Speech Acts from the Dialogue Manager (DM) into spoken prompts. Information Presentation (IP) is a central aspect of Natural Language Generation (NLG) for Spoken Dialogue Systems (SDS). Information presentation strategies are one of the main contributors to dialogue duration and are positively correlated with task success and user satisfaction.
Introduction
Natural language allows us to achieve the same communicative goal ("what to say") using many different expressions ("how to say it"). In a spoken dialogue system (SDS), an abstract communicative goal (CG) can be generated in many different ways. Natural Language Generation (NLG) for Spoken Dialogue Systems serves two goals. On the one hand the local NLG task is to present enough information to the user while keeping the utterances short and understandable. On the other hand, better Information Presentation should also CLEAR December 2014
22
II.
approaches or supervised learning and classifier learning and re-ranking. These supervised approaches involve the ranking of a set of completed plans/utterances and do not adapt online to the context or the user. Reinforcement learning (RL) provides a principled, data-driven optimization framework for our type of planning problem maximizing some notion of long-term reward or utility.
Natural Language Generation As Planning Under Uncertainty
The general framework of NLG as planning under uncertainty. Some aspects of NLG have been treated as planning, but not as statistical planning. Within an SDS architecture, NLG actions take place in a stochastic environment, consisting for example of a user and a stochastic realizer, where the individual NLG actions have uncertain effects on the environment. For example, presenting differing numbers of attributes to the user, makes the user more or less likely to choose an item for multimodal interaction.
III.
WoZ setup involves 4 stages: First, the user's utterance is tagged with attribute values by the wizard. For some of the cases artificial ASR noise is introduced. In the noise condition it is the experimenter who listens to the user's utterance and does the tagging. For the no-noise condition the wizard directly listens to the user's utterance and translates into attribute values. For example, "I am looking for Indian restaurants in the Old Town" gets tagged as cuisine=Indian, loca-tion=Old Town. The wizard then queries the database, where we use a real database of Edinburgh restaurants provided by the list, 1 and selects an NLG strategy. There were no general time constraints on the wizards. Wizards were encouraged to act as quickly as possible. The strategy then gets generated by a surface realiser. The _nal utterance is played back to the user via TTS, using the Cereproc speech synthesiser. It manually transcribed restaurant names (especially those with international names) in order to improve the TTS quality for proper names. Fig 1 shows the web-based interface for the wizard. The experimenter, the user, and the
Most SDSs employ fixed template-based generation. It employs a nondeterministic sentence planner and surface realizer for SDS. This introduces additional variation, to which higher level NLG decisions will need to react. The NLG component must achieve a high-level Communicative Goal from the Dialogue Manager (e.g. to present a number of items) through planning a sequence of lower-level generation steps or actions, for example first to summarize all the items and then to recommend the highest ranking one. Each such action has uncertain effects due to the stochastic realizer. For example, the realizer might generate a different sentence structure or employ different numbers of attributes depending on its own processing constraints. The user may be likely to choose an item after hearing a summary, or they may wish to hear more. The problem of planning how to (incrementally) generate an utterance for SDS falls naturally into the class of statistical planning problems, rather than rule-based CLEAR December 2014
Wizard-Of-Oz Data Collection
23
experimenter have similar interfaces, which communicate with the wizard's page using a web based server-client architecture. The audio is transmitted and recorded using VOIP. The wizard GUI contains 5 main panels (Fig 1):
D: An utterance is automatically generated by the NLG surface realiser every time the wizard selects a strategy, and is displayed in an intermediate text panel. E: The wizard can decide to add the generated utterance to the final output panel. The text in the final panel is sent to the user via TTS, once the wizard decides to stop generating. IV.
Implemented an NLG surface realiser in order to generate IP strategies in real time. This generator is based on data from a stochastic sentence planner called SPaRKy. It replicate the variation observed in SPaRKy by analysing high ranking example outputs (given the highest possible score by both SPaRKy judges) and implement the variance in dynamic templates. The realisations vary in sentence aggregation, aggregation operators (e.g. 'and', 'full stop', or ellipsis), contrasts (e.g. 'however', 'on the other hand') and referring expressions (e.g. 'it', 'this restaurant') used. The following realisations of Information Presentation actions ď&#x201A;ˇ RECOMMEND the top-ranking restaurant (according to UM). ď&#x201A;ˇ COMPARE the top 2 restaurants by Item or by Attribute. ď&#x201A;ˇ SUMMARY of all matching restaurants with or without User Model (UM). The approach using a UM assumes that the user has certain preferences (e.g.cheap) and only tells him about the relevant options, whereas the approach with no UM lists all the options.
Fig:1 Wizard interface
A: The wizard receives the user's query as noisy attribute values from the noise model. The experimenter has a similar input panel. There are 5 searchable attributes in total, which can also be negative ("not expensive"). B: The retrieved database items are presented in an ordered list. To use a User Modeling approach for ranking the restaurants, assume that a default user cares about cheap food with high quality and good service. C: The wizard then chooses which strategy and which attributes to generate next, by clicking radio buttons. The attribute/s specified in the last user query are preselected by default. CLEAR December 2014
NLG prompt generation
24
V.
Challenge for NLG is that of 'generation under uncertainty', where language must be generated for users even though there is some uncertainty about their state. This uncertainty can be about their location, their gaze direction and objects in their field of view, or even about their goals and preferences. Regarding generation under uncertainty, an interesting research direction will be to explicitly represent uncertainty about the generation context using techniques. Currently only investigated a small number of "lower level" features, such as sentence length. Future work could also include the predicted TTS quality as a feature for optimizing NLG decisions.
Conclusion and Future Work
Here present and evaluate a novel framework for adaptive natural language generation where the problem is formulated as stochastic incremental planning under uncertainty, which can be approached using reinforcement learning methods. All the user studies show that an adaptive NLG component significantly contributes to the (perceived or objective) task success of the system. Thus, such data-driven adaptive NLG strategies have "global" ejects on overall system performance. The data-driven planning methods applied here therefore promise significantly upgraded performance of generation modules, and thereby of Natural Language interaction in general.
Google to offer real-time voice translation on smartphones Google is planning to make up lost ground on Microsoft's Skype by automatically detecting the language spoken by foreigners and translating their speech. Now, according to the New York Times, Google is planning to beef up its own Translate mobile app. A forthcoming update will automatically detect if someone is speaking one of a few common languages and turn their speech into text. Automatic spoken language detection is already part of the Google Translate desktop app, although it's not clear from the report whether this will eventually be built into Google Hangouts to provide a genuine alternative to Skype Translator. Microsoft's Bing Translator also comes in mobile and desktop versions, and underpins the automatic translation services offered by Twitter and Facebook. Microsoft CEO Satya Nadella revealed last year that machine learning greatly improves the accuracy of these translation services, as they begin to gain a greater understanding of human speech patterns and grammar. Visit: http://www.expertreviews.co.uk/technology/1402413/google-to-offer-realtime-voice-translation-on-smartphones
CLEAR December 2014
25
Cloud Computing and Big Data Analytics with Text Analytics Archana S M M. Tech Computational Linguistics GEC, Sreekrishnapuram, Palakkad 1111archa@gmail.com Bigdata is generated by many of the applications but it becomes difficult to managed by traditional relational database management system. Traditionally, data warehouses has been used to manage data. But, for big volumes of data warehouses are not practical and their infrastructure is costly and the analytics of data is slow. Cloud based approach offers a means for performing large-scale data analytics in cost effective and scalable manner. As with other cloud environments, data management in the cloud benefits end users by offering a pay-as-you-go (or utility based) model and adaptable resource requirements that free up enterprises from the need to purchase traditional hardware. The data management, integration and analytics can be offloaded to clouds. Bigdata analytics with cloud computing is useful in natural language processing where there is a needs of analyzing large amounts of data with more NLP tools. ILLINOIS CLOUD NLP is a cloud-based service that aims to provide scalable Text Analytics processing capabilities to non-expert end users. This framework build around Amazon webserviceâ&#x20AC;&#x2122;s elastic compute cloud (EC2). This provides a simple interface to end users via end user can upload plain text documents, specify a set of Text Analytics tools (NLP annotations) to apply, and process and store or download the processed data.
I.
infrastructure is costly and the analytics of data is slow. To overcome the shortcomings of traditional warehousing and systems cloud computing is used. Cloud provides IT resources as a service like Compute, storage, databases etc. Comparatively cloud is cheap and allow businesses to off-load computing tasks while saving IT costs and resources. Cloud computing used for performing big data analytics in cost effective and scalable manner.
Introduction
Big data is the collection of datasets so large and complex. Every day we are creating 2.5 quintillion bytes of data it may be from sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. Querying and analysing such large data is inefficient by traditional relational database systems. Because big-data exceeds the processing capacity of conventional database systems. And warehouses are not practical and their CLEAR December 2014
This paper outlines the techniques used to analyse the bigdata for both data in motion
26
and data at rest. For big data at rest there are two systems:
II.
Cloud computing is the use of computing resources (hardware and software) that are delivered as a service over a network. Big data cannot be managed by traditional warehouses or database systems, but it can be managed effectively by cloud. And the cloud itself provides the additional hardware to the end-users so there is no need to purchase those things. End-user need to pay to the cloud providers according to the usage. Cloud computing performs large scale analytics in a cost effective manner. The data management, integration and analytics can be offloaded to public or/and private clouds. With private cloud usage, businesses can improve the utilization of existing infrastructure with public cloud usage, businesses can get processing power and infrastructure as needed. By using cloud computing, businesses can concentrate on their core work and enhancement rather than face data analytics problems.
1. No SQL 2. System for large scale analytics based on Map Reduce Paradigms such as Hadoop. For big data in motion data stream management system (DSMS). Natural Language Processing (NLP) continues to grow in popularity in a range of research and commercial applications. In these areas of NLP, there is a need of analysing large amounts of data. So the advantage of cloud computing and bigdata analytics can be used in NLP. ILLINOIS CLOUD NLP is a text analytics service in the cloud which can process large document sets effectively. As a demonstration, Cognitive Computation Group developed a simple application over ILLINOIS CLOUD NLP by processing 3.05 million documents from the TAC KBP tasks with segmentation, Part of Speech, Named Entity Recognition, and Wikification in approximately 20 hours, at a cost of approximately US$500. This task would require about a month of continuous processing on a single local server, a nontrivial installation effort, and a lot of human expertise and supervision in case of failure.
CLEAR December 2014
Cloud Data Management
There are few factors to be taken care of while choosing a cloud provider: 1. AVAILABILITY GUARANTEES: Out of real time transactional data analytics infrastructure, one may be willing to put its analytics infrastructure over the cloud. Hence, each cloud computing provider can ensure a certain amount of availability guarantees. 27
2. RELIABILTY: User must ensure the clouds reliability for providing services before loading it with data.
2. Sharding: Also known as partitioning requires the application to be partitioned first, so no longer have relationships/joins across partitions.
3. SECURITY: For organisations with confidential information, the cloud must be secure and password enabled.
NoSQL Datamodels NoSQL databases are flexible, they can support various types of data models.
4. MAINTAINABILITY: User must make sure that maintenance operations and data organisation facilities are provided by the cloud. III.
1. Key value stores: In a key-value read and write operations to a data item are uniquely identified by its key. Eg: Amazon dynamo
Techniques Used
2. Document Stores: In document stores, value associated with the key is a document which is not opaque to the database; hence, it can be queried. Eg: Amazons simple DB, Mongo DB and apaches couch DB
A. NoSQL NoSQL stands for Not Only SQL, it is a class of non-Relational data storage systems. It consists of a no. of techniques used for processing massive data in distributed manner, which includes efficient capture, storage, search, sharing, analytics and visualization of the massive scale data. When the dataset becomes large there are issues with scaling but RDBMS are not designed to support distributed horizontal scaling so NoSQL becomes used. Thus, we use two technologies for meeting scalability requirements.
3. Column family stores: In this model, data is organized into tables. Each record is identified by a row-key. Each row has a number of columns which are organized into column- families. It supports transactions only under a single row key. B. Hadoop Hadoop is a framework for running applications on large clusters built of commodity hardware. It implements a computational paradigm named Map/Reduce, for writing applications which process vast amounts of data (multi-terabyte data-sets) in parallel on large clusters in fault
1. Replication: In replication Master-Slave architecture is used where all writes are written to the master. And all reads performed against the replicated slave databases. CLEAR December 2014
28
tolerent manner. Map-Reduce computation have two phases Map and Reduce.
C. Data Stream Management Systems DSMS is for analysis of data in motion. In contrast to conventional databases, the analysis is done in real time and actions are performed just in time. Various stream processing systems includes: Twitterâ&#x20AC;&#x2122;s Storm, IBM Infosphere.
Map: Master node divides and distributes input data and problems to worker nodes. Each map process takes set of {key, value} pairs and generate one or more intermediate {key, value} pairs for each input key Reduce: In the reduce step, intermediate key-value pairs are processed to produce the output of the input problem. Each reduce instance takes a key and an array of values as input and produces output after processing the array of values.
IV.
For processing data in hadoop requires programming MapReduce using programming languages like python, JAVA, etc. This is time consuming, highly skilled developers are required, proper scheduling required and that all reducers should have equal distribution of data to process. To overcome these problems high level query languages have been developed.
In hadoop, data are stored in hadoop distributed file system (HDFS) which is designed to run on cheap commodity hardware. Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework. HDFS has a Master-Slave architecture. One master called Name node and number of Slave nodes called Data nodes. The name node manages the file system name space. It divides the file into blocks, and replicates them to different machines hence fault tolerence is achieved. Data nodes, manage the storage corresponding to that node. Master continuously monitor the progress in data processing at slave nodes and in case of fail or slow it will reassign the data blocks to other slave nodes. CLEAR December 2014
Querying Data Over Cloud
1. Hive: A system for managing and querying structured data built on top of Hadoop. Itâ&#x20AC;&#x2122;s Lanquage is like SQL so easy to learn and use. Facebook uses Hadoop and Hive to generate reports for third-party developers and advertisers who need to track the success of their applications or campaigns. 2. Pig PIG is platform for analyzing large data sets and sits on top of hadoop. PIG generates and compiles a Map/Reduce program. Pig Latin 29
is the language for transformation flows.
expressing
data
which they can deploy one or more NLP CURATOR instances on EC2, upload plain text documents, specify a set of Text Analytics tools (NLP annotations) to apply, and process and store or download the processed data. NLP annotations includes tokenization, Part of Speech (POS) tagging, shallow parsing, Named Entity recognition and Wikification.
3. JAQL JAQL is a query language for the JavaScript Object Notation (JSON) which is developed by IBM. Designed especially for working with large volumes of structured, semistructured and unstructured data. V.
ILLINOIS CLOUD NLP infrastructure therefore allows the user to process documents on demand, and requires no NLP, Machine Learning, or Cloud Computing expertise.
ILLINOIS CLOUD NLP-Text analytics application
NLP is an active research area and its use in commercial enterprises has balooned with an advent of analyzing bigdata. Installation, Maintaining and Running of NLP Tools is difficult and end users may needs a large processing capacity. These needs led to the commercial success of cloud services such as Amazon Web Services Elastic Compute Cloud (EC2): a knowledgeable user can select from a wide variety of virtual machine images, start multiple instances on a group of EC2 host machines, and process data in parallel to achieve fast, high-volume data processing.
A. NLP Curator ILLINOIS CLOUD NLP builds on NLP Curator which is a NLP management system. NLP CURATOR provides a single point of access to all these services, and a programmatic interface that allows end users to request and access NLP annotations within applications. It is complemented by EDISON, a Java library that provides a large suite of NLP data structures and supports feature extraction and common experimental NLP tasks. Together, EDISON and NLP CURATOR provide a straightforward API for applying NLP tools to plain text documents.
ILLINOIS CLOUD NLP, cloud-based service that aims simply to provide scalable Text Analytics processing capabilities to non-expert end users. This framework built around NLP CURATOR and Amazon Web Services Elastic Compute Cloud (EC2). It provides a simple interface to end users via CLEAR December 2014
30
B. Infrastructure ILLINOIS CLOUD NLP use amazon elastic compute cloud (EC2) which is an elastic infrastructure provided by AWS to support computing on demand. This framework provides a very economical way to support on-demand NLP services, since most state of the art NLP systems require a relatively large amount of RAM and processing power. In ILLINOIS CLOUD NLP, all annotation processing is performed in EC2. Amazon Simple Storage Service (S3) provides cloud based key-value pair storage. Keys are identifiers associated with some piece of data, while a value can be an arbitrary data structure representing a document or, in this case, a set of annotations over a document. ILLINOIS CLOUD NLP uses S3 to store processed data for retrieval by the user and/or later use on Amazon EC2.
It stores a trained model to Amazon S3, so that NLP CURATOR Workers can load it from S3 and run it on data.
Shared Data Store
A shared data store, Amazons S3 service, that is accessed by all ILLINOIS CLOUD NLP components.
VI.
Summary and Conclusions
The increasing amount of data led to the different technologies such as NoSQL, Hadoop, Streaming data processing; Pig, Jaql, Hive etc. And there are many advantages in moving to cloud resources for bigdata analytics.
ILLINOIS CLOUD NLPs NLP services are run as sets of clusters on EC2. Each curator cluster has following components:
ILLINOIS CLOUD NLP is a cloudbased service that allows users to process plain text documents with a suite of NLP tools via a simple user interface. This solution is cost-effective for end users with intermittent processing needs and requires no NLP, Machine Learning, or Cloud Computing expertise. As a demonstration,
NLP CURATOR Worker
A number of NLP CURATOR workers receive documents to be processed from the job queue, annotate the given document, and store the processed result. CLEAR December 2014
Manager
A control node that starts and stops Workers. It runs a central queue that stores all incoming jobs, which are then transferred to Workers.
C. Cloud Infrastucture
TRAINING UNIT
31
Cognitive Computation Group developed a simple application over ILLINOIS CLOUD NLP by processing 3.05 million documents from the TAC KBP tasks with segmentation, Part of Speech, Named Entity Recognition, and Wikification in approximately 20 hours, at a cost of approximately US$500. This task would require about a month of continuous processing on a single local server, a nontrivial installation effort, and a lot of human expertise and supervision in case of failure.
VII.
Nasa Looks to Machine Learning to Faster Identify Stars Nasa astronomers are now turning to "machine learning" to help them understand the properties of large numbers of stars. The research is part of the growing field of machine learning, in which computers learn from large data sets, finding patterns that humans might not otherwise see. Miller and his colleagues started with 9,000 stars as their training set. They obtained spectra for these stars which revealed several of their basic properties: sizes, temperatures and the amount of heavy elements, such as iron.
References
The varying brightness of the stars had also been recorded by the Sloan Digital Sky Survey, producing plots called light curves. By feeding the computer both sets of data, it could then make associations between the star properties and the light curves.
[1] Choudhary Nitin, Prateek Singh “Cloud Computing
and
Big
Data
Analytics”,
International Journal of Engineering Research Technology (IJERT), Vol. 2 Issue 12, December 2013.
Once the training phase was over, the computer was able to make predictions on its own about other stars by only analysing light-curves.
[2] Aaron Dai, Dan Roth, Fei Zhiye, Hao Wu, Mark Sammons, Mayhew Stephen “ILLINOIS CLOUD NLP: Text Analytics Services in the
The team's next goal is to get their computers smart enough to handle the more than 50 million variable stars.
Cloud”, LREC, 2014 [3] Tom White “Hadoop-The Definitive Guide” Published
by
OReilly
Media,
Inc.,
The report was published in the Astrophysical Journal.
1005
Gravenstein Highway North, Sebastopol, CA
Visit: http://gadgets.ndtv.com/science/news/nasalooks-to-machine-learning-to-faster-identifystars-647554
95472. 2-Edition October 2010.
CLEAR December 2014
32
M.Tech Computational Linguistics Dept. of Computer Science and Engg, Govt. Engg. College, Sreekrishnapuram Palakkad www.simplegroups.in simplequest.in@gmail.com
SIMPLE Groups Students Innovations in Morphology Phonology and Language Engineering
Article Invitation for CLEAR March 2015 We are inviting thought-provoking articles, interesting dialogues and healthy debates on multifaceted aspects of Computational Linguistics, for the forthcoming issue of CLEAR (Computational Linguistics in Engineering And Research) Journal, publishing on Mar 2015. The suggested areas of discussion are:
The articles may be sent to the Editor on or before 10th Mar, 2015 through the email simplequest.in@gmail.com. For more details visit: www.simplegroups.in Editor,
Representative,
CLEAR Journal
SIMPLE Groups
CLEAR December 2014
33
Hello World, Year 2014 had been special for simple groups in many ways, since we have had several proud moments this year that can be cherished for a very long time. 2014 changed CLEAR Magazine to CLEAR Journal which was a huge transformation. Moreover we could organize several Workshops, Expert Talks and Seminars on various useful topics during this year. Interestingly, by the end of 2014, we had got another joy to be celebrated. During 29-31 Dec 2014, we hosted our debut National level conference on Computational Linguistics and Information Retrieval, aptly named NC-CLAIR. NC-CLAIR aims to bring together, researchers working in the areas of Indian Language Computing, Machine Translation, Speech Processing, Information Retrieval, Big Data Analytics, Machine Learning and other related fields. More on the conference will be mentioned in the forthcoming edition of CLEAR Journal. This edition of CLEAR Journal provides a forum for students to enhance their background and get exposed to aspiring research areas including Natural Language Generation, Cloud Computing and Big Data Analytics, and also gives an introduction to Ontology concepts and terminologies. This edition also mentions the milestones crossed by PG students, which includes their works published in different National and International conferences and placements achieved during the previous trimester. I would like to sincerely thank the contributing authors, for the sincere effort taken by them regardless of their busy schedule, to broadcast their views on the latest developments in the field of language engineering, thereby making it beneficial to all of us. The journey of language engineering through simple groups continues crossing milestones. Let us hope the upcoming National Conference on Computational Linguistics and Information Retrieval will be a grand success, to be paved in the success tiles of GEC, Palakkad. CLEAR December 2014 Sreekrishnapuram,34 Simple groups wishes you all a very happy and prosperous New Year! Anagha M
anaghamanoharan3@gmail.com
CLEAR December 2014
35