Clear june 2014

Page 1

CLEAR June 2014

1


CLEAR June 2014

2


C

Editorial

4

SIMPLE News & Updates

5

M.Tech Computational Linguistic Project Abstracts (2012-2014) 38

CLEAR June 2014 Volume-3 Issue-2 CLEAR Magazine (Computational Linguistics in Engineering And Research) M. Tech Computational Linguistics Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad 678633 www.simplegroups.in simplequest.in@gmail.com Chief Editor Dr. P. C. Reghu Raj Professor and Head Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad Editors Sreejith C Reshma O K Gopalakrishnan G Neethu Johnson

CLEAR Call for Articles

46

Last word

47

Language Computing – A new Computing Arena Elizabeth Sherly 7 Data Completion using Cogent Confabulation Sushma D K 12 Malayalam POS Tagging Using Conditional Random Fields 17 Archana T. C, Haritha Lakshmi, Krishnapriya, Sreesha, Jayasree N V Topic modelling and LDA algorithm Indu M

22

Tamil to Malayalam Transliteration Kavitha Raju, Sreerekha T V, Vidya P V

26

Memory-Based Language Processing And Machine Translation Nisha M

29

Dialect Resolution Manu.V.Nair, Sarath K.S

34

Cover page and Layout Sreejith C

CLEAR June 2014

3


Greetings! This edition of CLEAR is marked by its variety in content and also by contribution from eminent scholars such as Prof. Elizabeth Sherly of IIITMK. It is heartening to note that our efforts are attracting wider attention from the academic community. We seek the readers' suggestions/comments on the previous edition on Indian Language Computing. As our efforts will be to bring language technology to the mainstream academics, we plan to include some articles on NLP/CL tools and platforms for specialized tasks and also on latest computing paradigms like Map Reduce and MBLP and their relevance to language technology. So, keep reading! With Best Wishes, Dr. P. C. Reghu Raj (Chief Editor)

CLEAR June 2014

4


NEWS & UPDATES

Publications 

Text Based Language Identification System for Indian Languages Following Devanagiri Script, Indhuja k, Indu M, Sreejith C, Dr. Reghu Raj P C, International Journal of Engineering Research & Technology (IJERT), Volume. 3 , Issue. 04 , April - 2014, ISSN: 2278-0181

"Eigenvector Based Approach for Sentence Ranking in News Summarization", Divya S, Reghuraj P C, JCLNLP, April-2014

"A Natural Language Question Answering System in Malayalam Using Domain Dependent Document Collection as Repository", Pragisha K and P C Reghu Raj, IJCLNLP, April-2014

Box Item Generation from News Articles Based Paragraph Ranking using Vector Space Model, Sreejith C, Sruthimol M P, P C Reghuraj, International Journal of Scientific Research in Computer Science Applications and Management Studies ( IJSRCSAMS), Volume 3, Issue 2, March 2014, ISSN 2319 – 1953

SIMPLE Groups Congratulates all the authors for their achievement!!!

SIMPLE Groups Congratulates all of them for their S for her achievement s !!!

CLEAR June 2014

5


Industrial Training at IIITM-K Virtual Resource Centre for Language Computing(VRC-LC) department of Indian Institute of Information Technology and Management -Kerala(IIITM-K) had organized a short course and industrial training on Natural Language Processing exclusively for the M.Tech students of GEC, Sreekrishnapuram. The course was focused on recent trends in Computational Linguistics and Machine Learning related to Malayalam Computing. It was 15 days programme (from 5th May - 20th May). During the course various reserch scholars and eminent faculties of VRCLC delivered their sessions on various aspects of Language Processing. The discussion of various works on Malayalam computing preceedings in their centre gave clear idea on the working and challenges involved in Malayalam Computing for the participants.

CLEAR June 2014

6


Language Computing: A New Computing Arena Elizabeth Sherly, Professor, IIITM-Kerala The great strides made in Information

and services in high growth markets such as

Technology in every walk out have slowly

mobile application, healthcare, IT services,

developed a

financial services, online retail, call centers,

Computing,

prominence in Language which

has

got

immense

importance in today's computing world.

publishing and media etc.

A

country like India, which values its culture,

Thanks

heritage and language, is greatly in need of

mathematician

local language support so as to combat the

expressed structure of human language to the

dominance of English in computing. India

level of mathematically viable symbols by

has had about 25 years of history in language

introducing the concepts of TREE diagram.

computing (LC) which has gone through its

Then

ups and down. But during the last decade

linguistics gained good momentum because

there has been a paradigm shift with a

there is a way to represent linguistics in

significant leap to LC as a new computing

mathematical and logical form, that enable to

arena.

Scientists who were somewhat

convert a piece of text into a programmer

reluctant to take up language computing for

friendly data structure. As natural language

research turned out to choose LT as a

involves

main stream of research and interestingly

processing, Artificial Intelligence also plays

industry giants like Microsoft and Google

a significant role in the development of

entered into Language Computing in a big

Computational Linguistics models. Here the

way.

term

So, a phenomenal shift in LT has

the

to

Noam

turned

research

human

Language

Chomsky, linguist,

in

who

computational

understanding

Computing

and

(LC)

and

(CL)

are

happened as Language Technology tools

Computational

become inevitable to enhance the products

sometimes used interchangeably, both are

CLEAR June 2014

Linguistics

a

7


the key terminologies derived from Natural

challenge in this work is that each language

Language Processing (NLP).

has diverse linguistic nature with varied morphological features, inflectional sets,

Some of the major applications in LC are

Machine

Information

language computing research more complex.

Summarization,

The availability of large corpora and

Question Answering, Automatic Speech

dictionary in each language is another

Recognition, Language Writing and Spoken

constraint for researchers in LC.

Retrieval,

Aids,

Translation,

grammatical structure etc. which made

Automatic

Dialog

Systems,

Man-Machine

Interfaces, Knowledge Representations etc.

In India, there has been a phenomenal

Machine Translation (MT) is one of the

shift

major tasks in which research has gone back

computational

to the 1950's, but still remains as an open

development in the last several years as

problem.

Language

There are no 100% accurate

in

Language

Technology

Linguistics

research

Technology

tools

or and

become

Machine Translation systems in any pair of

ineviatable in many applications.

languages in the world. The day when we

Department of Electronics and Information

achieve that, most of the other problems in

Technology

LT will be resolved.

Technology

The major tasks

(DEITY),India Development

The

initiated for

Indian

involved in Machine Translation Systems are

Languages(TDIL)

to describe the language into syntactic and

with the objective of developing Information

Semantic

syntactic

Processing Tools and Techniques to facilitate

using

human-machine interaction without language

Parts-of-Speech

barrier; creating and accessing multilingual

(POS) tagging, Chunking, and Parsing. The

knowledge resources; and integrating them

semantic information are achieved using

to develop innovative user products and

Word-Sense

services.

Information.

information

are

Morphological

Analysis,

The

generated

Disambiguation

(WSD),

(http://tdil.mit.gov.in)

The major activities of TDIL

Semantic Role Labelling (SRL), Named

involves Research and Development of

Entity Recognition (NER), and Anaphora

Language Technology,

Resolution(AR).

The primary research in

machine translation, multi-lingual parallel

CL is basically to develop models and tools

corpora, cross lingual information access,

for the above mentioned components.

optical character recognition and text to

CLEAR June 2014

The

which includes

8


speech conversion,

and

development of

needs local language support for its various

standards related to Language Technology

features. This is yet another potential area for

etc. Various projects and research have been

research. Since the Internet of Things (IOT)

undergoing under TDIL with the effort of

is closely associated with Machine -to -

number of scientific organizations and

Machine, Machine -to -Human, Machine-to-

Educational

Institutions.

are

small devices, it needs local language

institutions

like

IIIT-

enablement to provide information to the

Hyderabad,

IIITM-Kerala,

IITs

There in

India,

CIIL-Mysore,

local masses.

Hyderabad Central University, Centre for Computer Science and Sanskrit Centre-JNU,

The opportunities in Industry are also

New Delhi, CDAC are the main centres

very demanding for CL. One look at the

where computational linguistics is taught and

2013 Gartner predictions tells us that there is

researched.

a huge demand for Computational Language Scientists. Natural Language Processing and

Apart from Machine Translation, the

Speech Recognition are no doubt the

research directions in Language Computing

prominent

are towards Automatic Speech Recognition,

expected technologies in the near future.

Speech to Text Processing, Web and

Industries

Semantics,

Named

linguists are tremendously forging ahead

Anaphora

because currently most of the web, mobile,

Resolution, Word-Sense Disambiguation etc.

and social media based work need extensive

It is not too futuristic to believe that we

language support.

could be talking to the computer which can

mainly three roles in Language Computing:

act according to our commands.

linguists,

Entity

Sentiment

analysis,

Recognition(NER),

New

considerations

that

employ

for the most

computational

The industry looks for

computer

programmers

(both

techniques and models have to be explored

having knowledge in language computing)

for better results. Ontology and Semantics,

and researchers.

phycho

Linguistics and Computer Science also have

linguistic

analysis

are

another

upcoming areas relies on research to get

good

more sensible information from the web and

Microsoft, Google, IBM, HP and several

other applications. The mobile phone being

other companies working heavily in this field

one of the handy and largely used gadgets

needs trained manpower in LT. The global

CLEAR June 2014

opportunities

Researchers in both

in

the

industry.

9


market value of 19.3 billion in 2011 has been

actively pursuing research and product

predicted to shoot to 30 billion in 2015. The

development

in

major

There

also

thrust

Translation,

areas to

are

create

in

Machine

systems

and

technologies catering to today‘s multitude of

are

Malayalam

Computing.

certain

NGOs

and

Communities actively engaged in work related to Malayalam Computing.

translation scenarios; multilingual systems, to

develop

approach

in

a

natural-language-neutral

all

aspects

of

Despite all the efforts made by many

linguistic

contributors, the awareness among masses is

computing; and natural-language processing,

very less. A mechanism for encouraging and

Automatic Speech Recognition, to design

promoting Indian Language Computing is

and build software which can analyze,

the need of the hour.

understand, and generate natural human

ship programmes and movements in the

languages so as to enable addressing a

state, in academic institutions and other

computer like addressing a human being.

organizations to give more visibility and

There should be flag

accessibility which can ensure reachability in The work on Malayalam Computing

every corner of the nation.

has been active for the last decade so as to

Computing

enable computer to understand and process

standardization have to be well placed in

Malayalam. The major work in Malayalam

India‘s IT policy, both at the national and

Computing involves Machine Translation

state levels.

System for Malayalam to another language

actively promote their language and culture,

and vice versa. Spell Checker, Malayalam

not only from their own state people, but

Search Engine, Malayalam text to speech

from wherever that language group has

and speech to text system, Morphological

existence

Analyser and POS taggers for Malayalam,

Malayalam as a language and some of the

Malayalam Text Editors, Human Interaction

promotional activities that can be done at the

Interfaces,

state and college levels are listed below.

Malayalam

Language

tutors,

and

its

use,

Language support,

and

Each language group shall

worldwide.

We

may

take

Corpora building, Dictionary building etc. CDAC, Chennai,

CDIT,

IIITM-K,

CIIL-Mysore,

AUKBAC-

IIIT-Hyderabad,

Include the importance of Malayalam Computing in Kerala‘s IT Policy.

Amrita are some of the major institutions CLEAR June 2014

10


Create Internet mailing list, say

attention

Malayalam.Net and introduce various

Computational Models for various tasks

tools and products developed with its

namely Morph Analysis, POS tagging etc are

use.

to be refined for more accurate results. Most

Form a Malayalam Computing Task

of the models are rule based, statistical or

Force. Promote and support various

hybrid models like Support Vector Machine,

activities

HMM and TnT that shows 95-100 %

and

form

PMUs

for

 

better

outlook.

The

accuracy. However the Machine Translation

implementation. 

for

non-profit,

System has shown up to an accuracy of 60-

Government/non Government body

65% only. Deep Learning has recently

to promote Malayalam Computing.

shown many promises for NLP applications.

Promote

periodical

The deep Neural Network with its capability

seminars

and

Establish

a

workshops, in

to learn distributed representation based on

Malayalam Computing in educational

similarity of words and its ability to learn

institutions

multiple

conferences

level

of

representations

is

Computing

encouraging for cognitive problem solving

mandatory for Computer Science

like natural language understanding and

courses at the university level

Speech Recognition.

Promote

Computer with Malayalam input devices and

Make

Language

Malayalam

Computing

The days of using

using social media

interfaces with Malayalam Processing and

Publish articles, research papers,

understanding that the way human handles is

tools and products, news on LC in

not very far.

magazines,

journals

and

notice

boards.

Due to inherent structure of Malayalam language as a highly agglutinative and inflectional language, an issue in Malayalam Computing is many compared to other languages.

The rendering issues on the

display of Malayalam font require greater CLEAR June 2014

11


Data Completion using Cogent Confabulation Sushma D K M.Tech Dept. of computer Science SJBIT, Bangalore

studied extensively. Later, this data can be

Introduction Information is being generated on a daily basis and continuously uploaded onto the web inmassive quantity. This information is of many types ranging from simple text files to videofiles. This information may sometimes be incomplete due to various reasons like variations in electrical signals, intentional causes, unknowingly overwriting the data, etc,. Cogent confabulation is a unique, and alien technique to complete the missing data. The Cogent confabulation problem consists of two major tasks: (1) training the available data to generate knowledge bases (matrices storing the appearance counts of words in the training corpus); (2) querying over this trained data to predict the next word or phrase, given the starting few words of the phrase or sentence. To

overcome

the

difficulty

of

completing the missing information, the previously available data has to be first CLEAR June 2014

trained at the semantic level using a particular model, and used in the completion of the missing data. Cogent confabulation is a new model of vertebrate cognition used for training and querying the available data. Confabulation is the process of selecting that one symbol (termed as the conclusion of the confabulation) whose representing words happen to be receiving the highest level of excitation (appearance count). It explains confabulation product as the probability of occurrence

of

the

assumed

words

individually with the target word. Cogency is defined as the probability of occurrence of all the assumed words together with the target word. To predict and fill the missing data, the procedure starts with maximizing the cogency p(αβγδ|ε) and confabulation product p(α|ε).p(β|ε).p(γ|ε).p(δ|ε). The order in which the corpus words are chosen for training is meant to reflect the pattern of text contained 12


in the expository text. The approach uses

simultaneously, they are said to co-occur,

lexical analyses based on appearance count

which creates the opportunity to associate

of

retrieval

the two symbols. For instance, after seeing a

measurement, to determine the extent of

face and hearing a name together, the

possibility of a particular word being the

symbols representing each may become

conclusion word. Phrase completion, and

associated. Each strengthened unidirectional

sentence continuation using confabulation

association between two symbols is termed a

model should be useful for many text

knowledge link. Collectively knowledge

analysis

tasks,

information

links comprise all cognitive knowledge.

retrieval

and summarization, generating

Each thalamocortical module performs the

words,

an

information

including

artificial stories etc

same

processing

contraction of a list of symbols, termed a

Confabulation is a new model of cognition

information

operation, which can be thought of as a

Confabulation

vertebral

single

which

mimics

the

confabulation. Throughout a confabulation, input excitation is delivered to the module

Hebbian learning, and the information

through

processing procedures of human brain.

symbols in other modules lists of candidate

Cognitive information processing is a direct

conclusion symbols, driving the activation of

evolutionary re-application of the neural

these knowledge links target symbols in the

circuits controlling movement, and thus

module performing the confabulation. When

functions just like movement. Conceptually,

conclusion symbols contract, there is no

brains are composed of many muscles of

physical movement in the brain, rather

thought (termed thalamocortical modules in

symbols currently on the list compete (based

mammals). A module contains symbols, each

upon their relative excitation levels) for

of which is a sparse collection of neurons

eventual exclusive activation (a so-called

which functions as a descriptor of the

winner-take-all competition) within that

attribute of that module. For example, if the

module and, as a result, the number of active

attribute of a module is color, then a single

symbols is gradually reduced.

symbol represents a particular color. Each thalamocortical module is connected to many other modules. When two symbols are active CLEAR June 2014

knowledge

links

from

active

Crucially, this contraction of the candidate conclusion symbol list in each thalamocortical

module

is

externally 13


controlled by a thought control signal

launched. These issued action commands are

delivered to the module. A confabulation in a

proposed as the source of all non-reflexive

thalamocortical module is controlled by a

and

graded analog control input. The thought

Thalamocortical

control signal determines how many symbols

confabulations, delivering excitation through

remain in the competition, but has no effect

knowledge

on selecting which symbols are in the

knowledge through the issuance of action

competition. Which symbols are in the

commands

competition is determined by the excitation

foundation of all mammalian cognition.

level of a symbol as it dynamically reacts to

These basic concepts of cognition, and

knowledge link input from active symbols in

confabulation are applied in the artificial

other modules (which cause its excitation

intelligence, and machine learning field to

level to increase) or to a reduction or

predict, fill, and complete the missing data in

cessation of such input (which causes its

the sentences. It can also be used for context

excitation level to fall). Ultimately, the

representation, context exploitation, itelligent

thought control signal is used to dynamically

text recognition, and to generate artificial

contract the number of active symbols in a

stories

module from an initial many less-active symbols to, at the end of the confabulation, a

non-autonomic

links,

behaviors.

modules

and

constitute

performing

applying

the

skill

complete

Methodology and applications

The

One of the methodologies that are used for

resulting single active symbol is termed the

data completion is Confabulation, which is

confabulation

explained before.

single

maximally-active

conclusion.

symbol.

The

learned

The other concept is

association between each symbol of a

cogency. It is the occurrence of all the

module and its set of action commands is

assumed fact words together with the

termed skill knowledge. Skill knowledge is

assumed target word i.e. p(αβγδ|ε).

stored in the module, but the learning of these

associations

is

controlled

A. Terminologies:

by

subcortical brain nuclei. When a conclusion

According to the confabulation model,

is reached in a module, those action

assuming that the combined assumed facts

commands which have a learned association

αβγδ are true, then the set of all symbols (in

from that conclusion symbol are instantly

the answer lexicon from which conclusions

CLEAR June 2014

14


are being sought) with p(αβγδ|ε) > 0 is called

works fill in the missing data based on the

the expectation; the elements of which, are

Bayesian models, where the words are

termed answers, in the descending order of

selected

their

probability (selecting the conclusion which

cogencies.

(maximization

Confabulation

of

the

product

by

increasing

their

posterior

has the highest probability of being correct).

p(α|ε).p(β|ε).p(γ|ε).p(δ|ε) or, equivalently, the

Though

sum of the logarithms of these probabilities)

extensively used in neural networks, it is an

as a surrogate for maximizing cogency (it is

awful

assumed

pairwise

completion. Because, it simply selects a

conditional probabilities p(Ψ|λ) between

word with highest probability value, even if

symbols

This

it is irrelevant to the given context. Many

exhaustive

open source tools can be used for the easy

that Ψ

all

and

assumption

is

knowledge).

Each

required

λ

are

known.

termed meaningful

non-zero

Bayesian

model

development.

for

These

theory

has

cognition,

been

and

data

basic concepts

of

p(Ψ|λ) is termed an individual item of

cognition, and confabulation are applied for

knowledge.

context representation, context exploitation, intelligent text recognition, and to generate

B. Applications

artificial stories. The longer execution times

These basic concepts of cognition and

due to lexicon overhead can be reduced by

confabulation are applied in the artificial

parallel

intelligence, and machine learning field to

demand high throughput will have to

predict, fill, and complete the missing data in

evaluate the proposed confabulation method

the sentences. It can also be used for context

depending on the hardware available.

representation,

context

Applications

that

exploitation,

intelligent text recognition, and to generate artificial stories. Conclusion and Future Works This paper is an initiative for better data completion, and manipulation. The major requirement for the process is the previous complete text corpus. The previous CLEAR June 2014

processing.

REFERENCES [1] Robert Hecht-Nielsen, Confabulation Theory, The Mechanism of Thought 2007. [2] Qinru Qiu , Qing Wu, Daniel J. Burns, Michael J. Moore, Robinson E. Pino, Morgan Bishop, and Richard W. Linderman, Confabulation Based Sentence Completion for Machine Reading, IEEE, 2011. 15


[3] Qinru Qiu, Qing Wu, and Richard

[4] Fan Yang 1 , Qinru Qiu , Morgan Bishop

Linderman, Unified Perception-Prediction

, and Qing Wu, Tag-assisted Sentence

Model for Context Aware Text Recognition

Confabulation

on a Heterogeneous Many-Core Platform,

Recognition, IEEE, 2012

Proceedings

of

International

Joint

Conference on Neural Networks, San Jose, California, USA, July 31 – August 5, 2011

for

Intelligent

Text

[5] Darko Stipaničev, Ljiljana Šerić, Maja Braović, Damir Krstinić,

Toni

Jakovčevic, Maja Štula, Marin Bugarić and Josip Maras, Vision Based Wildfire and Natural Risk Observers, IEEE, 2012.

Computer chatbot 'Eugene Goostman' passes the Turing test Eugene Goostman is a chatterbot. First developed by a group of three programmers; the Russian-born Vladimir Veselov, Ukranian-born Eugene Demchenko, and Russian-born Sergey Ulasen in Saint Petersburg in 2001,Goostman is portrayed as a 13-year old Ukranian boy—a trait that is intended to induce forgiveness in users for his grammar and level of knowledge. The Goostman bot has competed in a number of Turing test contests since its creation, and finished second in the 2005 and 2008 Loebner Prize contest. In June 2012, at an event marking what would have been the 100th birthday of their namesake, Alan Turing, Goostman won what was promoted as the largest-ever Turing test contest, successfully convincing 29% of its judges that it was human. On 7 June 2014, at a contest marking the 60th anniversary of Turing's death, 33% of the event's judges thought that Goostman was human; the event's organizer Kevin Warwick considered it to have "passed" Turing's test as a result, per Turing's prediction that by the year 2000, machines would be capable of fooling 30% of human judges after five minutes of questioning.

CLEAR June 2014

16


Malayalam POS Tagging Using Conditional Random Fields Archana T. C, Haritha Lakshmi, Krishnapriya, Sreesha, Jayasree N V Dept. of Computer Science and Engg, Sreepathy Institute of Management and Technology, Vavanoor Parts of speech tagging is the process of assigning tags to the words of a given sentence. This paper presents the building of Part-Of-Speech (POS) tagger for Malayalam Language using Conditional Random Field (CRF). POS tagger plays an important role in Natural language applications like speech recognition, natural language parsing, information retrieval and information extraction. The present tagset consists of 100 tags.It consists of a language model, which is trained by an annotated corpus of 3026 sentences (36,315 words). This model checks the trigram possibility of occurrence of tag in the training corpus. We present a trigram HMM-based (Hidden Markov Model) part-of-speech (POS) tagger for Malayalam language, which will accept a raw text to produce a POS tagged output. We can improve the accuracy of the system by increasing the size of the annotated corpus. Although the experiments were performed on a small corpus, the results show that the statistical approach works well with a highly agglutinative language like Malayalam..

I. Introduction

that links them together. There is a need to

India is a large multi lingual country of diverse culture. It has many languages with written forms and over a thousand spoken languages.

The

Constitution

of

India

recognizes 22 languages, spoken in different parts of the country. The languages can be categorized into two major linguistic families namely Indo Aryan and Dravidian. These classes of languages have some important differences. Their ways of developing words and grammars are different. But both include a lot of Sanskrit words. In addition, both have a similar construction and phraseology

develop information processing tools to facilitate human machine interaction ,in Indian

Languages

and

multi

lingual

knowledge resources. A POS tagger forms an integral part of any such processing tool to be developed. Parts of Speech Tagging, a grammatical tagging, is a process of marking the words in a text as corresponding to a particular part of speech, based on its definition and context- i,e relationship with adjacent and related words in a phrase, sentence ,or paragraph. This is the first step towards understanding any languages. It finds its major application in the speech and

CLEAR June 2014

17


NLP like Speech Recognition, Speech

and both the observed and hidden words

Synthesis, Information retrieval etc. A lot of

must be in a sequence. It has been

work has been done relating to this in NLP

experimentally shown that the accuracy of

field. A lot of work has been done in part of

the

speech tagging of western languages. These

significantly

taggers vary in accuracy and also in their

template an efficient corpus and a widely

implementation. A lot of techniques have

accepted

also been explored to make tagging more and

concentrates on designing a POS Tagger,

more accurate. These techniques vary from

using CRF++, which is an open source

being purely rule based in their approach to

implementation of CRF.

being completely stochastic. Some of these taggers achieve good accuracy for certain

POS

tagger by

can

be

introducing

tagset.

This

improved a

paper

trigram

mainly

System Architecture

languages. But unfortunately, not much work

The system consists of 3 modules

has been done with regard to Indian

namely Preprocessing, Training and Testing.

languages especially Malayalam. In this

The architecture of the proposed system is

paper we have developed a POS tagger based

depicted in Fig.1.

on

Conditional

Random

Field

(CRF).

Conditional Random Fields (CRFs), the undirected graphical models, are used to calculate the conditional probability of values on designated output nodes given values on other designated input nodes. It can clearly be seen that generative model (HMM) here has performed quite close to CRF. A Hidden Markov Model (HMM) is a statistical model in which the system

Fig 1. System Architecture

modeled is thought to be a Markov process with the unknown parameters. In this model, the assumptions on which it works are the probability of the word in a sequence may depend on its immediate word preceding it CLEAR June 2014

Preprocessing is the initial stage for the implementation of a Malayalam POS Tagger. In preprocessing it takes the input sentence and tokenize the sentence,i.e it 18


receives the input as sentence and split it into

increasing the number of grams, the template

words. Thus splitted words are stored in a

becomes more useful and tagging becomes

file and it becomes the input for the testing

more efficient. The template file can be

phase.

represented as shown in Fig. 2. The

proposed

method

uses

a

supervised mode of learning for POS tagging. The simplest statistical approach finds out the most frequently used tag for a specific word from the annotated training data and uses this information to tag that word in the unannotated text. This is done during the training phase. We are using statistical approach for POS tagging i.e. We train and test our model for this we have to calculate frequency and probability of words of given corpus. During training, the annotated corpus is trained using the CRF Template. All the required files in performing the procedure is

Fig. 2 CRF Template

explained with example below. For training, The tagset used in the system is

the CRF Learn command is used as shown

developed by Bureau of Indian Standards.

below:

Tagset is a list of short forms representing crf learn template file train file model file The template file provides the correct position of the words to be tagged .If a unigram template is used, then a word is tagged based on the current word. If Bigram statistics is used, then the tagging is done based on the current and previous word. On CLEAR June 2014

the components of sentence like nouns, verbs and their sub forms. The corpus training is performed with the help of the tag set. The tag set used in the system contains 100 tags. The training will create a model file as output,

which

contain

the

learned

probabilities of the corpus.

19


tags of each and every words. Use crf test command: crf learn template file train file model file A screenshot of the proposed system is given in Fig.4.

Fig 4: Screenshot Conclusion The motivation of this project is to help children and foreigners to learn the Fig.3 Training Corpus

Testing is the process of comparing

structure of a Malayalam sentence. By using Python

programming

language

as

the

the trained model file with the tokenized

development environment the application is

input file and obeying rules in template and

built keeping in mind about the design

tags in tagset finding out the corresponding

standards and maintainability of the code. CRF++, by using the Hidden Markov Model and Tkinter features provide a rich user

CLEAR June 2014

20


experience to the users of the software. This

4. Anish, ‖Part of Speech Tagging for

application is very simple to use and is

Malayalam‖, Amritha School of Engineering

helpful to people who are in the preliminary

Coimbatore,

stage of learning Malayalam language.

Vidyapeetham,Coimbatore-641105

References

5. Steven Bird, et al, Natural Language

1. Rajeev R R, Jisha P Jayan, "Part of speech tagger

for

malayalam",

Computer

Processing

Amritha

with

Python,

Vishwa

Orielly

Publications, 2011.

Engineering and Intelligent Systems, Vol 2,

6.

No.3 June 2011.

Markov Model ‖, Lecture Notes.

2. Christopher.D.Manning,Hinrich Shutze, "

7.

Fundamentals

https://code.google.com/p/crfpp/, Last visited

of

Statistical

Natural

Language Processing", MIT Press, 1999. 3.

Asif

Ekbal,Rajwanul

Michael Collins,‖Tagging with Hidden

CRF++

Home

Page,

on March 2014.

Haque,Sivaji

Bandhopadhyay,‖Bengali Part Of Speech Tagging

using

Fields‖,Department

Conditional of

CSE

Random Javadpur

University Kolkata-70032,India

Making the world’s knowledge computable Wolfram Alpha introduces a fundamentally new way to get knowledge and answers not by searching the web, but by doing dynamic computations based on a vast collection of built-in data, algorithms, and methods.

http://www.wolframalpha.com/

CLEAR June 2014

21


Topic modelling and LDA algorithm Indu M Dept. of computer Science Govt. Engineering College, Sreekrishnapuram A topic model that which automatically captures the thematic patterns and identiďŹ es emerging topics of text streams and their changes over time. It is used to check models, summarize the corpus, and guide exploration of its contents. Topic Modelling can enhance information network construction by grouping similar objects, event types and roles together.

Introduction

hidden thematic structure in large archives

As electronic documents become

of documents. One of the simplest topic

available, it becomes more difficult to find

models is latent Dirichlet location (LDA).

and discover what we need. We need new

The intuition behind LDA is that documents

tools to help us organize, search, and

exhibit multiple topics. Most topic models,

understand

such as LDA [1] are unsupervised. Only the

these

vast

amounts

of

words in the documents are modeled, and the

information. Topic models are based upon the idea that documents are mixtures of topics, where a topic is a over words. A topic model is a generative model for documents: it specifies

goal is to infer topics that maximize the likelihood (or the posterior probability) of the collection. The

main

application

of

topic

a simple probabilistic procedure by which

modeling is in Information Extraction. And it

where a topic is a documents can be

can also be used to analyze, summarize, and

generated. To make a new document, one

categorize the stream of text data at the time

chooses a distribution over topics. Then, for

of its arrival. For example, as news arrives in

each word in that document, one chooses a

streams,organizing it as threads of relevant

topic

articles is more efficient and convenient.

at

random

according

to

this

distribution, and draws a word from that topic.

Topic modelling programs do not know anything about the meaning of the

Probabilistic topic models are a suite of algorithms whose aim is to discover the CLEAR June 2014

words in a text. Instead, they assume that any piece of text is composed (by an author) by 22


selecting words from possible baskets of

analysis to determine the number of genes

words where each basket corresponds to a

that an organism needs to survive.

topic. If that is true, then it becomes possible to mathematically decompose a text into the probable baskets from whence the words first came.

In this it is highlighted the different words that that are used in the article. Words about data analysis, such as ―computer" and ―prediction," are highlighted in blue; words

To make a new document in topic

about evolutionary biology, such as ―life"

modelling, one chooses a distribution over

and ―organism", are highlighted in pink;

topics. Then, for each word in that

words about genetics, such as ―sequenced"

document, one chooses a topic at random

and ―genes," are highlighted in yellow. Here

according to this distribution, and draws a

in this figure at left side, this represents some

word from that topic.

no of topics. Each document is assumed to

The model specifies the following distribution over words within a document:

be generated as follows. First choose a distribution over the topics (the histogram at right); then, for each word, choose a topic assignment (the coloured coins) and choose the word from the corresponding topic.

where T is the number of topics, p(z) for the

LDA models can be used to find

distribution over topics z in a particular

topics that describe a corpus and each

document and p(w|z) for the probability

document exhibit multiple topics.

distribution over w words given z topic. LDA

The basic ideas behind latent Dirichlet allocation (LDA), which is the simplest topic model [3]. The intuition behind LDA is that documents exhibit multiple topics. For example, consider the article in Figure 1[1]. This article, entitled ―Seeking Life's Bare

Graphical model of LDA

(Genetic) Necessities," is about using data CLEAR June 2014

23


(DCM)

mixtures,

probabilistic

Latent

Semantic Indexing (pLSI), were also used to find the relevant topics. We can also simplify the topic distribution by modelling each topic

as

a

discrete

probability

contains

a

over

documents. A.Algorithm The corpus documents.

Methodology and applications One of the methodologies that are used for topic detection is LDA, which I have explained before. Theoretical studies of topic modelling focus on learning the model‘s parameters assuming the data is actually generated from it. Existing approaches for the most part rely on Singular Value Decomposition (SVD), and consequently have one of two limitations: these works need to either assume that each document contains only one topic, or else can only recover the span of the topic vectors instead of the topic vectors themselves. Here in SVD, we assume that each document contain

collection

of

For each document in the collection, we generate the words in a two-stage process. 1. Randomly choose a distribution over topics 2. For each word in the document: (a) Randomly choose a topic from the distribution over topics in step #1. (b) Randomly choose a word from the corresponding distribution over the vocabulary. This statistical model rejects the intuition that documents exhibit multiple topics. Each document exhibits the topics with different proportion (step #1); each word in each document is drawn from one of the topics (step #2b), where the selected topic is chosen from the per-document distribution

only one topic or else span of topic vector instead of topic vector themselves.

Here

in

order

to

evaluate

the

predictive power of a generative model on Other probabilistic models such as naive-

unseen data, there is a standard way known

Bayes Dirichlet Compound Multinomial CLEAR June 2014

24


as perplexity. By this, it is able to predict

applications, scientific applications such as

relate words from different languages.

genetic/ neuro science etc.

B. Applications

Conclusion and Future Works

Topic models are good for data exploration, when there is some new data set and you don't know what kinds of structures that you could possibly find in there. But if you did know what structures you could find in your data set, topic models are still useful if you didn't have the time or resources to construct classification models based on supervised machine learning. Lastly, if you did have the time and resources to construct classification models based on supervised learning, topic models would still be useful as extra features to add to the models in

Documents

are

partitioned

into

topics, which in turn have terms associated to varying degrees. However in practice, there are some clear issues: the models are very sensitive to the input data small changes to the stemming/ tokenization algorithm can result in completely different topics; topics need to be manually categorized in order to be useful; topics are ―unstable ― in the sense that adding new document can cause significant change to the topic distribution. REFERENCES

order to increase their accuracy. This is the

[1] D. Blei. Introduction to probabilistic

case because topic models act as a kind of

topic models, Communications of the ACM

"smoothing" that helps combat the sparse

pp. 77-84 2012.

data problem that is often seen in supervised learning.

[2] Vivi Nastase, Introduction to topic models: Building up towards LDA, summer

The topic modeling can be used in computer vision, an inference algorithm applied to natural texts in the services of text retrieval, classification, an d organization

semester 2012 [3] D. Blei. Et. al. supervised topic models, Princeton university

build text hierarchy etc. it can also be used

[4] David Hall et al. Studying the History of

in

Ideas

WSD,

machine

learning,

organize,

summarize, and help users to explore large corpora,

information

CLEAR June 2014

Using

Topic

Models,

Stanford

University.

engineering

25


Tamil to Malayalam Transliteration Kavitha Raju, Sreerekha T V, Vidya P V M.Tech CL GEC, Sreekrishnapuram Palakkad Abstract : Transliteration can form an essential part of transcription which converts text from one writing system to another. This article discusses about the applications and challenges in machine transliteration from Tamil to Malayalam, two languages that belong to the Dravidian family. Transliteration can be used to supplement machine translation process by handling the issues that can happen due to the presence of named entities.

I. Introduction The rewriting or conversion of the characters of a text from one writing system to another writing system is called transliteration. Here each character of the source language is assigned to a different unique character of the target language, so that an exact inversion is possible. If the source language consists of more characters than the target language, combinations of characters and diacritica can be used. Machine transliteration systems have a great importance in a country like India which has a fascinating diversity of languages. Even though there are groups of language that comes from common origin, the difference in scripts makes the task cumbersome. Machine transliteration systems can be classified into rule-based and corpus-based approaches, in terms of their core methodology. Rule-based systems develop linguistic rules that allow the words to be put in different places, to have different scripts depending on context, etc. The main CLEAR June 2014

approach of these systems is based on linking the structure of the given input with the structure of the demanded output, necessarily preserving their unique meaning. Minimally, to get a transliteration of source language sentence one needs a dictionary that will map each source language word to an appropriate target language word, rules representing source language word structure and rules representing target language word structure. Because rule-based machine transliteration uses linguistic information to mathematically break down the source and target languages, though, it is more predictable and grammatically superior than Statistical Method. Statistical machine transliteration employs a statistical model, based on the analysis of a corpus, to generate text in the target language. The idea behind statistical machine transliteration comes from information theory. A document is transliterated according to the probability distribution p(e|f ) that a string e in the target language is the transliteration of a string f in the source language.

26


There are two primary motivations behind pursuing this project. First one is that in India, majority of the population still use their mother-tongue as the medium of communication and the next is that in spite of globalization and wide-spread influence of the West in India, most of the people still prefer to read, write and talk in their mothertongue. II. Language Features Malayalam is a language spoken in India, predominantly in the state of Kerala. It is one of the 22 scheduled languages of India and was designated a Classical Language in India in 2013. Malayalam has official language status in the state of Kerala and in the union territories of Lakshadweep and Puducherry. It belongs to the Dravidian family of languages. The origin of Malayalam, whether it was from a dialect of Tamil or an independent offshoot of the Proto Dravidian language, has been and continues to be an engaging pursuit among comparative historical linguists. Tamil is a Dravidian language spoken predominantly by Tamil people of South India and North-east Sri Lanka. It has official status in the Indian states of Tamil Nadu, Puducherry and Andaman and Nicobar Islands. Tamil is also an official language of Sri Lanka and an official language of Singapore. It is legalised as one of the languages of medium of education in Malaysia along with English, Malay and Mandarin. It is also chiefly spoken in the states of Kerala, Karnataka, Andhra Pradesh and Andaman and Nicobar Islands as one of the secondary languages. It is one of the 22 scheduled languages of India and was the CLEAR June 2014

first Indian language to be declared a classical language by the Government of India in 2004. Both Tamil and Malayalam are languages of southern states of India. Even though Malayalam is said to belong to Dravidian family of languages,it is more similar to the Arya language. When Aryans came at northeast border of Bharatam, the people of Harappa and Mohen Ja Daro moved to east and south and they replanted their civilisation in South India, based in Tamil Nadu. Tamil is one of the oldest languages. When Dravidians moved to south, there were people of soil to receive them and help. The geography helped them to receive Tamil. Dravidians settled in Tamil Nadu and developed the Tamil Literature and Tamil Civilization. In comparison to all Indian Languages, Tamil has only 12 vowels and 18 consonants. Malayalam is one of the most updated languages with clarity in voice and comparatively more difficult to study. There are 56 letters. The vowels are most resembling to Sanskrit. Since Tamil has only few alphabets in comparison to Malayalam, it is not possible to have a one to one mapping between these languages. A letter in Tamil can be transliterated as more than one letters in Malayalam. III. Applications and Challenges The various applications of machine transliteration system are as follows: •

It aids machine translation.

•

It helps to eliminate language barriers. 27


It supports localization.

• It enables the use of a keyboard in a given script to type in a text in another one. Transliteration also has many challenges. They are: • Dissimilar alphabet sets of the source and target languages. • Multiple possible transliterations for the same word. • Finding exactly matching tokens is difficult for some of the vowels and a few consonants. • The size of the corpus required is very large in order to build an accurate transliteration system. IV. Conclusion In this article, we discussed about Tamil to Malayalam transliteration systems in general. Various applications of transliteration and the challenges associated with it were also pointed out. It is inevitable for a machine

CLEAR June 2014

translation system for handling named entities. In case of a dictionary based translation systems, this is very useful as it would save lot of time and resources. But it has been observed that a large corpus size is required to model the system accurately. REFERENCES [1] K Saravanan, Raghavendra Udupa, A Kumaran: Crosslingual Information Retrieval System Enhanced with Transliteration Generation and Mining, Microsoft Research India, Bangalore, INDIA. [2] Rishabh Srivastava and Riyaz Ahmad Bhat: Transliteration Systems Across Indian Languages Using Parallel Corpora, Language Technologies Research Center, IIIT-Hyderabad, India. [3] R. Akilan and Prof. E.R.Naganathan: Morphological Analyzer for Classical Tamil texts: A rule-based approach

28


Memory-Based Language Processing And Machine Translation Nisha M M.Tech CL GEC, Sreekrishnapuram Palakkad

Memory-based language processing (MBLP) is an approach to language processing based on exemplar storage during learning and analogical reasoning during processing. From a cognitive perspective, the approach is attractive because it does not make any assumptions about the way abstractions are shaped, and does not make any a priori distinction between regular and exceptional exemplars, allowing it to explain fluidity of linguistic categories, and irregularization as well as regularization in processing. Memory-based machine translation can be considered as a form of Example-Based Machine Translation. Machine translation problem can be treated as a classification problem and hence memory based learning can be applied. This paper demonstrates memory-based approach for machine translation.

MBLP finds its computational basis I. Introduction

in the classic k-nearest neighbor classifier

Memory-based language processing,

(Cover & Hart, 1967). With k = 1, the

MBLP, is based on the idea that learning and

clasfier searches for the single example in

processing are two sides of the same coin.

memory that is most similar to B, say A, and

Learning is the storage of examples in

then copies its memorized mapping A‘ to B‘

memory, and processing is similarity-based

(as visualized schematically in Figure 1).

reasoning with these stored examples.

With k set to higher values, the k nearest neighbors to B are retrieved, and some voting procedure (such as majority voting) determines which value is copied to B‘. Memory-based machine translation can be considered as a form of ExampleBased

Machine

characterizing

Translation.

Example-Based

In

Machine

Translation, Somers (1999) refers to the CLEAR June 2014

29


common use of a corpus or database of

intermediate structure is always implicitly

already translated examples, and the process

encoded somehow in the words at the

of matching new input against instances in

surface, and the way they are ordered, and

this database. This matching is followed by

memory-based models may be capable of

extraction of fragments which are then again

capturing the knowledge that is usually

recombined to form the final translation.

considered to be necessary, in an implicit

The task of mapping one language to the other can be treated as a classification

way, so that they do not need to be explicitly computed.

problem. The method can be described as

Classes of natural processing tasks in

one in which the sentence to be translated is

which this question can be investigated in

decomposed into smaller fragments, each of

extreme are processes in which form is

which

for

mapped to form, i.e., in which neither the

classification. The memory-based classifier

input nor the output contains abstract

is trained on the basis of a parallel corpus in

elements to begin with, such as machine

which the sentence pairs have been reduced

translation.

to smaller fragment pairs. The assigned

translation tools, such as the open source

classes thus correspond to the translation of

Moses toolkit (Koehn et al., 2007), indeed

the fragments. In the final step, these

implement a direct mapping of source to

translated fragments are re-assembled to

target text, leaving all of syntax and

derive the final translation of the input

semantics implicit; they hide in the form of

sentence.

statistical translation models between col

is

passed

to

a

classifier

II. Background and Related Literature

Natural language processing models and systems

typically employ abstract

linguistic

representations(syntactic,

semantic, or pragmatic) as intermediate working units. Memory-based models enable asking the question whether we can do without

them,

CLEAR June 2014

since

any

invented

Many

current

machine

locationally strong phrases, and of statistical language models of the target language. MBLP approach on this problem involves using context on the source side, and using memory-based classification as a translation model (Van Gompel, Van den Bosch, & Berck,2009). There is an encouraging number of recent studies that attempt to link statistical 30


and memory-based models of language that focus on discovering strong n-grams (for phrase-based statistical machine translation or for statistical language modeling) to the concept of constructions and to the question to what extent human language users exploit constructions. To mention two, we note that Mos, Van den Bosch, and Berck have reported that a memory-based language model shows a reasonable correlation with unit segmentations that test subjects generate in a sentence copy task; the model implicitly captures several strong complex lexical items (constructions), although it fails to capture long distance dependencies, a common issue with local n-gram based statistical models. In a related study, (Arnon & Snider,2010) show that subjects are sensitive to the frequency (a rough

approximation

of

collocational

strength) of four-word n-grams such as ‗don‘t have to worry‘, which are processed faster when they are more frequent. Their argument is again focused on the question whether strong subsequences need to have linguistic structure that assume hierarchy, or could simply be taken to be flat n-grams — it is exactly this question that we aim to explore further in our work with memory based language processing models.

III. Methodology The process of translating a new sentence is divided into a local phase (corresponding to the first two steps in the  process) in which memory-based translation of source trigrams to target trigrams takes place, and a global phase (corresponding to the third  step) in which a translation of a sentence is assembled from the local predictions. A.Local classification Both in training and in actual translation, when a new sentence in the source language is presented as input, it is first converted into windowed trigrams, where each token is taken as the center of a trigram once. First trigram of the sentence contains an empty element, and the last trigram contains an empty right element. At training time, each source language sentence is accompanied by a target language translation.

Word alignment should be

performed before this step, so that classifier know for each source word whether it maps to a target word, and if so, to which. Given the alignment, each source trigram is mapped to a target trigram of which the middle word is the target word to which the word in the middle of the source trigram aligns and right neighboring words of the target trigram are

CLEAR June 2014

31


the center word‘s actual neighbors in the

with the first three Dutch words. Fourth

target sentence.

predicted English trigram, however, overlaps to its left with the fifth predicted trigram, in one position, and overlaps in two positions to the right with the sixth predicted trigram, suggesting that this part of the English sentence is positioned at the end. Note that in this example, the ―fertility‖ words take and this, which are not aligned in the training trigram mappings (cf. Figure 1), play key roles in establishing trigram overlap.

Figure2: an example training pair of sentences, converted into six overlappingtrigrams with their aligned trigram translations.[1] When translating new text, trigram outputs are generated for all words in each new source language sentence to be translated, since our system does not have clues as to which words would be aligned by statistical

IV. Conclusion and Future Works

word alignment. B.Global search

The study described in this paper has demonstrated how memory-based learning

To convert the set of generated target

can be applied to machine translation.

trigrams into a full sentence translation, the

Memory based learning stores all examples

overlapbetween the predicted trigrams is

in memory it settles for a state of maximal

exploited. Figure 3 illustrates a perfect case

description length. This extreme bias makes

of a resolutionof the overlap (drawing on the

memory-based

example of Figure 2), causing words in the

comparative case against so-called eager

English sentence tochange position with

learners, such as decision tree induction

respect to their aligned Dutch counterparts.

algorithms and rule learners. Phrase based

learning

an

interesting

First three English tri-grams align one-to-one CLEAR June 2014

32


memory based machine translation is also

[2] Antal van den Bosch and Walter

implemented which use Statistical toolkits

Daelemans,

such as Moses for phrase extraction. Future

Processing, Cambridge University Press,

works include developing memory based

New York

techniques for phrase extraction.

Memory-Based

Language

[3] Antal van den Bosch and Walter Daelemans, Implicit schemata and categories

REFERENCES

in memory-based language processing

[1] A. van den Bosch and P. Berck. Memorybased machine translation and language modeling.

The

Prague

Bulletin

of

Mathematical Linguistics, 91:17–26, 2009.

The Stanford Natural Language Processing Group The Natural Language Processing Group at Stanford University is a team of faculty, research scientists, postdocs, programmers and students who work together on algorithms that allow computers to process and understand human languages. The work ranges from basic research in computational linguistics to key applications in human language technology, and covers areas such as sentence understanding, machine translation, probabilistic parsing and tagging, biomedical information extraction, grammar induction, word sense disambiguation, and automatic question answering. The Stanford NLP Group makes parts of their Natural Language Processing software available to everyone. These are statistical NLP toolkits for various major computational linguistics problems. They can be incorporated into applications with human language technology needs. A distinguishing feature of the Stanford NLP Group is their effective combination of sophisticated and deep linguistic modeling and data analysis with innovative probabilistic and machine learning approaches to NLP. Our research has resulted in state-of-the-art technology for robust, broadcoverage natural-language processing in many languages. These technologies include their competition-winning coreference resolution system; a state-of-the-art part-of-speech tagger; a high performance probabilistic parser; a competition-winning biological named entity recognition system; and algorithms for processing Arabic, Chinese, and German text. All the software is written in Java. All recent distributions require Oracle Java 6+ or OpenJDK 7+. Distribution packages include components for command-line invocation, jar files, a Java API, and source code. A number of helpful people have extended our work with bindings or translations for other languages. As a result, much of this software can also easily be used from Python (or Jython), Ruby, Perl, Javascript, and F# or other .NET languages. Link: http://nlp.stanford.edu/software/

CLEAR June 2014

33


Dialect Resolution Manu.V.Nair, Sarath K.S Department of Computer Science and Engineering Govt. Engg. College Sreekrishnapuram Palakkad, India -678 633 Abstract: A regional or social variety of a language distinguished by pronunciation, grammar, or vocabulary, especially a way of speaking that differs from the standard variety of the language. It is a recognized and formal variant of the language spoken by a large group of one region, class or profession. Whereas slang, that is different from dialect, consists of a lexicon of non-standard words and phrases in a given language. Dialect resolution is an approach to represent a dialect dialog/word into its formal format without losing its semantic. It is a localized approach, through which a local person can express his idea with his own style into a formal format. It is also a method to resolve the slang words also.

I. Introduction India is a special nation, holding highly linguistically diverse population with 18 officially recognized languages and other unofficial languages. Diversity is more than China (7 languages and hundreds of dialects) though area size India covers only one third of China. A dialect is a form of a language spoken in a particular geographical area or by members of a particular social class or occupational group, distinguished by its vocabulary, grammar, and pronunciation. The term is applied most often to regional speech patterns, but a dialect may also be defined by other factors, such as social class. A dialect that is associated with a particular social class can be termed a Sociolect, a dialect that is associated with a particular ethnic group can be termed as Ethnolect, and CLEAR June 2014

a regional dialect may be termed a regiolect or topolect. According to this definition, any variety of a language constitutes

"a

dialect",

including

any

standard varieties. A standard dialect is a dialect that is supported by institutions. Slang consists of words, expressions, and meanings that are informal, and are used by people who know each other very well or who have the same interests. It include mostly expressions that are not considered appropriate for formal occasions; often vituperative or vulgar. Use of these words and phrases is typically associated with the subversion of a standard variety and is likely to be interpreted by listeners as implying particular attitudes on the part of the speaker. In some contexts a speaker's selection of slang words or phrases may convey prestige, indicating

group

membership

or 34


distinguishing group members from those

includes

who are not a part of the group.

machine learning approaches. We can also

Among Indian languages, Malayalam is a highly inflectional language, distinct due to political and geographical isolation, the impact of Christianity and Islam, and the arrival of the Namboothiri Bhrahmins a little over

thousand

years

ago,

all

created

conditions favourable to the development of the

local

dialect

Malayalam.

The

Namboothiri grafted a good deal of Sanskrit on to the local dialect and influenced its

rule

based,

statistical

based,

use a hybrid approach, ie. mixture of the above mentioned approaches.

A hybrid

approach always have the positives of the previous mentioned approaches. So, a hybrid approach may have the higher accuracy since it have all the positives of each of the single approaches. III. Applications A. Localization

physiognomy. Malayalam language itself

Language localization is the process

contain many number of dialect variations.

of adapting a product that has been

Each one of the districts of Kerala have their

previously

own dialect changed languages.

languages to a specific country or

translated

into

different

region.Localization can be done for II. Dialect Resolution

Dialect resolution is an approach to represent a dialect dialog/word into its

regions or countries where people speak different languages or where the same language is spoken

formal format without losing semantic. It is a

Dialect Resolution have an important

localized approach, through which a local

role in localization. Since the rich

person can express his idea with his own

morphological language Malayalam have

style into a formal format. Highly informal

different types of dialects, we have to

slang words also to be resolved.

incorporate dialect resolution to the

From the computational point of view, Dialect Resolution is one of the difficult task. There are different types of dialect variations even for a single language. Computational methods for dialect resolution CLEAR June 2014

localization process.

Different

tribal

people can express

their ideas

&

knowledge to the outer world with their own language. The Dialect resolution system will convert it into the formal 35


language that they belongs to. This will

The application of Dialect Resolution

attract them to engage with outer world,

system in the field of speech to text

mainly the government processes &

application is crucial. If the speech to

services that aiming to them. They will

text application is possible with certain

feel free to use their mother tongue.

good accuracy, then that text may contain the dialect words. It must be resolved to

B. Machine Translation

use it as a formal one. It is very useful,

Machine Translation is the sub-field of

computational

linguistics

when the text in local language can be

that

converted into formal language and

investigates the use of software to

through that into another languages. This

translate text or speech from one natural

is very useful in the places like

language to another.

parliament, legislative assemblies etc.

Dialect Resolution is very much useful when a story in a colloquial is translated to another language. First, we can go through the Dialect Resolution process and make the story in a normal language format.

translation system. Without the help of a dialect resolution system, it will be difficult to deal with such colloquial We

can

implement

similar

process in the case of old transcripts. Old transcripts may contain the details of rare medicines, several histories etc. These Information can be translated to another language only with the support of a dialect resolving system. C. Speech to text applications CLEAR June 2014

As already said, Dialect Resolution is not an easy task. It is a difficult one, and it have different issues on implementation. A. Input Level

Then, it is given to the machine

works.

IV. Issues in Dialect Resolution

Thrissur dialect can be simple and more complex with compound and slang words. Some of the inputs like, sentence containing metaphor expressions, compound

named words,

entity, or

large

redundant

words, are more complex to resolve together. Each of them should be handled separately. As all know, Malayalam is a highly agglutinative language with free order character. Almost many slang words have change at tail, and 36


remaining part with its context words will provide clue about actual word. B. Machine Learning Level The

dialect

It is a great step in localized language resolution in Malayalam language. The ultimate possibilities are, bringing all dialect

patterns

are

ideas and informations to a formal format,

learned through a learning process.

which will be readable and understandable to

Smaller the corpus for learning,

others without any language boundary within

lesser will be the accurate for

the language. After that, translation to other

unknown items. Selection of method

language will be more easier.

for machine learning is also a crucial factor. Different learners perform differently on same corpus. Getting feature annotated corpus is also a big task. Named entities in training data will miss-lead, they have to be transliterated.

can be adapted in other slang of other language, with sufficient support of corpus. As an extension of this approach, gender, tense, person and number informations are labeled for disambiguation. It will be better with large corpus of formal sentences and

C. Corpus Level Availability of very large Malayalam formal-sentence corpus is a big issue. Also, a dictionary for all slang words are also a problem.

dictionary with all slang words. A better machine learning technique like TnT, SVM or CRF with this large corpus will provide more better results. VI. REFERENCES

D. Output Level Keeping naturality in output is great challenge. For handling ambiguous

Dialect resolution in thrissur slang

words,

informations.

need

Even

context though

Malayalam is a highly free order

[1] Daniel Jurafsky and James H Martin. Speeh

and

Introduction

Language to

Processing:

Natural

An

Language

Processing, Computational Linguistics and Speech Recognition, Prentice Hall, 1999.

language, for generating semantically correct sentences we need a proper

[2] Thorsten Brants, TnT – A Statistical Part-

word ordering.

of-Speech

Tagger, Saarland University,

Computational Linguistics, 2000. V. Conclusion & Future works

CLEAR June 2014

37


M.Tech Computational Linguistics Department of Computer Science and Engineering 2012-2014 Batch Details of Master Research Projects Title Name of Student Abstract

Tools Place of Work

Spoken Language Identification Abitha Anto Spoken Language Identification(LID) refers to the automatic process through which the identity of a language spoken in a speech sample is determined. This project is based on the phonotactic approach. The phone/phoneme sets are different from one language to another. Also the frequency of occurrences of certain phone/phoneme differs from one language to another. Based on the phone sequences of each language, a language dependent n-gram probability distribution model is estimated. The language identification is done by comparing the frequency of occurrences of certain phone or phone sequences with that of the target languages. The applications of LID system fall into two categories - preprocessing for machine understanding systems and preprocessing for human understanding systems. This project tries to identify the Indian languages such as English (Indian), Hindi, Malayalam, Tamil and Kannada. HTK, SRILM, Matlab Amrita Vishwa Vidyapeetham, Coimbatore.

Title Name of Student

Statistical Approach to Anaphora Resolution for Improved Summary Generation

Abstract

Anaphora resolution deals with finding of noun phrase antecedents of the anaphors. It is used in the natural language processing applications such as text summarization, information extraction, question answering, machine translation, topic identification etc. The project aims to resolve the mostly occurring pronouns in the documents such as third person pronouns. A statistical approach is utilized to perform the accomplished task. Important features are extracted and they are used for finding the antecedents of anaphors. The proposed system includes the components such as pronoun extractor, noun phrase extractor, gender classifier, subject identifier, named entity recognizer, chunker and part-of-speech tagger. NLTK, TnT, Stanford Parser GEC Sreekrishnapuram

Tools Place of Work

Ancy Antony

CLEAR June 2014

38


Title Name of Student

Anaphora Resolution in Malayalam

Abstract

An anaphor is a linguistic entity which indicates a referential tie to some other linguistic entity in the same text. Anaphora Resolution is a process of automatically finding the pairs of pronouns or noun phrases in a text that refer to same incidence, thing, person, etc. called referent. The first member of the pair is called antecedent and the next member is called anaphora. This project tries to resolve anaphora in Malayalam. We outline an algorithm for anaphora resolution, while working from the output of a Subject-Verb-Object tagger. And also Person-Number-Gender agreement is also included in the exisiting tagging system. Anaphora resolution is done based on the tagging and the degree of salience(salience value). The anaphora resolution system itself can improve the performance of many NLP applications such as text summarisation, term extraction and text categorisation. TnT IIITMK, Thiruvananthapuram, Kerala

Tools Place of Work Title Name of Student Abstract

Tools Place of Work

Title Name of Student Abstract

Tools Place of Work

Athira S

Extractive News Summarization using Fuzzy Graph based Document Model Deepa C A This system describes a news summarization system based on Fuzzy Graph Document Models. Modelling documents to Fuzzy Graphs is used to summarize a set of similar news paper articles. Each article is represented as a fuzzy graph, whose nodes represent sentences and edges connecting nodes are present if there exists a similarity between those sentences. This proposed extractive document summarizer uses fuzzy similarity measure to weight the edges. WebScrapy, NLTK Government Engineering College, Sreekrishnapuram,

Text Summarization using Machine Learning Approach Sreeja M This project aims the comparison of summarization algorithms using Rouge toolkit in DUC 2001 dataset and development of a new algorithm for summarization and the comparison with the previous works. Eclipse, Rouge, R-Studio Centre for Artificial Intelligence and Robotics , DRDO, Bangalore

CLEAR June 2014

39


Title Name of Student Abstract

Tools Place of Work

Fuzzy model-based emotion and sentiment analysis of text documents Divya M Computer systems are inevitable in almost all aspects of everyday life. With the growth of artificial intelligence , the capabilities and functionalities of computer systems have been enhanced. Emotions constitute a key factor in human communication.Human emotion can be expressed through various mediums such as speech, facial expressions, gestures and textual data. Emotion and sentiment analysis of text is a growing area of interest in the field of computational linguistics. Emotion detection approaches use or modify concepts and general algorithms created for subjectivity and sentiment analysis. In this paper the emotion and sentiment of text is analysed by means of fuzzy approach. The proposed method involves the construction of a knowledge base of words which are known to have emotional content for representing six basic emotions: anger,fear,joy,sadness,surprise and disgust. The system takes input natural language sentences, analyses them and determines the underlying emotion. It also represents the multiple emotion contained in text document. Experimental results indicate quite satisfactory performance. nltk,stanford parser Government Engineering College, Sreekrishnapuram

Title Name of Student

Ontology based information retrieval system in legal documents

Abstract

Ontology serves as knowledge base for any domain which is used by agents to mine relationships and dependencies, and/or to answer user queries. The domain of focus possibly holds a valid structure and hierarchy. Indian Penal Code (IPC) is one such realm which is the apex criminal code of India, fortified into five hundred and eleven sections under twenty three chapters. Ontology for IPC helps to create a vista which enables legal persons as well as common man to access the intended sections and the code in the easiest way. Ontology also provides the judgments produced over each code which make IPC more transparent and close to people. Protege and OWL are used to develop the ontology. Once completed, the ontology serve as an integral reference point for the elite legal community of the country, and can be applied to information retrieval, decision support/making, agent technology and question answering. Protege, Apache Jena, Python regular expression Government Engineering College, Sreekrishnapuram

Tools Place of Work

Gopalakrishnan. G

CLEAR June 2014

40


Title Name of Student Abstract

Tools Place of Work

Title Name of Student Abstract

Tools Place of Work

Topic Detection using LDA algorithm Indu M Probabilistic topic models are a suite of algorithms whose aim is to discover the hidden thematic structure in large archives of documents. Topic Modelling can enhance information network construction by grouping similar objects, event types and roles together. The documents is considered as a distribution on a small number of topics, where each topic is a distribution on words. So here the main aim is to find the most probable topics and its corresponding word distribution over the topics. One of the main algorithm used here is LDA(latent Dirichlet allocation ), which is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. pythonNLTK,Gensim Centre for Artificial Intelligence and Robotics , DRDO, Bangalore

Discourse Segmentation using Anaphora Resolution in Malayalam Lekshmi T S A number of linguistic devices are employed in text-based discourse for the purposes of introducing, defining, refining, and reintroducing discourse entities. This project looks at one of the most pervasive of these mechanisms, anaphora, and addresses the question of how discourse is maintained and segmented properly by resolving anaphora in Malayalam. An anaphor is a linguistic entity which indicates a referential tie to some other linguistic entity in the same text. The behaviour of referring expressions throughout the discourse seems to be closely correlated with the segment boundaries. In general, within the limit of a segment, referring expressions adopt a reduced form such as pronouns, whereas, a reference across a discourse boundaries tends to be realized via unreduced forms like definite descriptions and proper names.We outline an algorithm for anaphora resolution, while working from the output of a Subject-Verb-Object tagger. Next, by positioning the anaphoric references, segment boundaries are identified. The focus of this project is on anaphora resolution as an essential prerequisite to build the discourse segments of text. The anaphora resolution system itself can improve the performance of many NLP applications such as text summarisation, term extraction and text categorisation. TnT IIITM-K, Trivandrum

CLEAR June 2014

41


Title Name of Student Abstract

Tools Place of Work

Title Name of Student Abstract

Tools Place of Work

Speaker Verification using i-vectors Neethu Johnson Speaker verification is the process of verifying the claimed identity of a speaker based on the speech signal from the speaker. I-vector is an abstract representation of speaker utterance. In this modeling, a new low dimensional speaker and channel dependent space is defined using a simple factor analysis. This space is named the total variability space because it models both speaker and channel variabilities. Each speaker utterance is represented by an i-vector in total variability space. We have to carry out channel compensation in the total factor space rather than in the GMM supervector space followed by a scoring technique. So, the i-vectors can be seen as new speaker recognition features where the factor analysis plays the role of feature extractor rather than modeling speaker and channel effects. The use of the cosine kernel as a decision score for speaker verification makes the process faster and less complex than other scoring methods. HTK, LIA-RAL, ALIZE , Matlab Amrita Viswa Vidyapeetham, Coimbatore.

Topic Identification Using Fuzzy Graph Based Document Model Reshma O.K. Fuzzy graphs can be used to model the paragraph-paragraph correlation in documents. Using this model, nodes of the graph represent the paragraphs and the interconnections represent the correlation. This project aims to find the topic from the fuzzy graph model of the document. Topic identification refers to automatically finding the topics that are relevant to a document. The task of topic identification goes beyond keyword extraction, since relevant topics may not be necessarily mentioned in the document or its title, and instead have to be obtained from the context or the concept underlying in the document. The proposed system uses fuzzy graph for modelling the document. Then using eigen analysis of the correlation values from the fuzzy graph, the important paragraph can be found out. The terms and their synonyms in that paragraph are then mapped to predefined concepts. Thus the topic can be extracted from the document content. Topic identification can assist search engines in the retrieval of documents by enhancing the relevancy measures. NLTK, Stanford CoreNLP, Gensim, Stanford Topic Modelling Toolbox Government Engineering College, Sreekrishnapuram

CLEAR June 2014

42


Title Name of Student Abstract

Tools Place of Work

Ontology generation from Unstructured Text using Machine Learning Methods Nibeesh K This paper presents a system for ontology instance extraction that is automatically trained to recognize ontology instances using statistical evidence from a training set. This approach has several advantages: first, it eliminates the need for expert language- specific linguistic knowledge. To train the automated system, users only need to tag ontology instances. If the users needs change, the system can relearn from new data quickly. Second, system performance can be improved by increasing the amount of training data without requiring extra expert knowledge. Third, if new knowledge sources become available, they can easily be integrated into the system as additional evidence. GATE, Protege, Jena, OWL2,JAVA Government Engineering College, Sreekrishnapuram

Title Name of Student Abstract

TEXT CLASSIFICATION USING STRING KERNELS

Tools Place of Work

Openkernel, OpenFST, Libsvm. Amrita Vishwa Vidyapeetham ,Ettimadai, Coimbathore.

Varsha K V Text Classification is the class of assigning predefined categories to free text documents. In this task, the text documents are represented as feature vectors of high dimension. Where the feature values can be n-grams, named entities, words etc,.. Kernel Methods(KMs) makes use of kernel functions which gives the inner product of the document feature vectors. The KMs computes the similarity between text documents without explicitly extracting the features. Thus KMs are considered as an effective alternative to explicit feature extraction based classification. The project makes use of String Kernels for text classification. The String Kernel computes the similarity between two documents by the substring they contain. The n-gram kernel and gappy n-gram kernels are being used for the classification. In KMs the learning then takes place in the feature space, learning algorithms can be applied to this space. The project makes use of Support Vector Machine algorithm. Where SVMs are a class of algorithms that combine the principles of statistical learning theory with optimisation techniques and the idea of a kernel mapping. The non-dependence of KMs on dimensionality of the feature space and flexibility of using any kernel function makes them a good choice for text classification.

CLEAR June 2014

43


Title Name of Student Abstract

Tools Place of Work Title Name of Student Abstract

Tools Place of Work Title Name of Student Abstract

Domain Specific Information Extraction: A Comparison using different Machine Learning Methods Prajitha u Information extraction is the task of extracting structured information from unstructured text. It has been widely used in various research areas including NLP and IR. In this proposed system, a comparison of different machine learning methods like TnT, SVM, CRF and TiMBL is done to extract perpetrator entity from the opensource corpus of MUC34, which contain newswire articles of Latin American terrorism. TnT, SVM, CRF, TiMBL Government Engineering College, Sreekrishnapuram

Eigen Analysis Based Automatic Document summarization Sruthimol M P Automatic document summarization is the process of reducing a text document into a summary that retains the points highlighting the original document. Ranking sentences according to their importance of being part in the summary is the main task in summarization. This paper proposes an effective approach for document summarization by sentence ranking. Sentence ranking is done by vector space modeling and eigen analysis methods. Vector space model is used for representing sentences as vectors in an n-dimensional space assigning tf-idf weighting system. The principal eigen vectors of the characteristic equation will rank the sentences according to their relevance. Experimental result using standard test collections of DUC2001 corpus and Rouge Evaluation system shows that the proposed sentence ranking based on eigen analysis scheme improves conventional Tf-Idf language model based schemes. NLTK, Stanford CoreNLP, ROUGE Tool kit GEC Sreekishnapuram

SPEECH NONSPEECH DETECTION SINCY V. THAMBI Speech nonspeech detection is for identifying pure speech and to ignore nonspeech which includes music, noise, various environmental sounds, silence etc. Speech nonspeech detection is performed by analysing those features which enables to distinguish them efficiently. Features includes various time domain, frequency domain and cepstral domain features are analysed for short time frames of 20ms size along with their mean and standard deviation for segments of size 200ms. Among them best features are extracted using various feature dimensionality reduction mechanisms. It is noted that an accuracy of 95.085% is obtained for 2 hours speech-nonspeech database by using decision tree approach.

CLEAR June 2014

44


Tools Place of Work

Matlab, weka Amritha Viswa Vidyapeetham, Coimbatore

Title

Automatic Information Extraction and Visualization from Defence-related Knowledge Bases for Effective Entity Linking

Name of Student Abstract

Tools Place of Work

Sreejith C The project aims to develop an intelligent information extraction system capable of extracting entities and relationships from natural language reports using statistical and rule based approaches In this project, the knowledge about the domain is imparted to the machine in the form of ontology. The extracted entities are stored into a knowledge base and further visualized as graph based on the relationships existing between them for effective entity linking and inference. Java, Jene, Protege,CRF++ Centre for Artificial Intelligence and Robotics , DRDO, Bangalore

CLEAR June 2014

45


M.Tech Computational Linguistics Dept. of Computer Science and Engg, Govt. Engg. College, Sreekrishnapuram Palakkad www.simplegroups.in simplequest.in@gmail.com

SIMPLE Groups Students Innovations in Morphology Phonology and Language Engineering

Article Invitation for CLEAR- September-2014 We are inviting thought-provoking articles, interesting dialogues and healthy debates on multifaceted aspects of Computational Linguistics, for the forthcoming issue of CLEAR (Computational Linguistics in Engineering And Research) magazine, publishing on September 2014. The suggested areas of discussion are:

The articles may be sent to the Editor on or before 10th September, 2014 through the email simplequest.in@gmail.com. For more details visit: www.simplegroups.in

Editor,

Representative,

CLEAR Magazine

SIMPLE Groups

CLEAR June 2014

46


Hi, Computational Linguistics is an emerging and promising discipline in shaping future research and development activities in academia and industry, in fields ranging from Natural Language Processing, Machine Learning, Algorithms, Data Mining etc. Whilst considerable progress has been made in the development of Artificial Intelligence, Human Computer Interaction etc the construction of software that can understand and process human language still remains a challenging area. New challenges arise in the modelling of such complex systems, sophisticated algorithms, advanced scientific and engineering computing and associated problem-solving environments. CLEAR is designed to inform readers on the state of art in a number of specialized fields related to Computational Linguistics. CLEAR addresses the state of the art of all aspects of Computational Linguistics, highlighting computational methods and techniques for science and engineering applications. CLEAR is a platform for all, ranging from academic researcher and professional communities to industrial professionals in a range of number of topics in the field of Natural Language Processing and Computational Linguistics. CLEAR welcomes thought-provoking articles, interesting dialogues and healthy debates on multifaceted aspects in all areas covered, which includes audio, speech, and language processing and the sciences and technologies that support them. Thank you for your time and consideration.

Sreejith C CLEAR June 2014

47


CLEAR June 2014

48


Students'

Innovation

in

Morphology

Phonology

and

Language

Engineering (Abbreviated as SIMPLE) is the official website of M.Tech computational Linguistics, Govt. Engineering College, Palakkad. As the name indicates, SIMPLE is a platform for showcasing our innovations, ideas and activities in the field of Computational Linguistics. The applications of AI become much effective, when systems incorporate natural language understanding capabilities. Here, we are trying to explore how human language understanding capabilities can be used to model an intelligent behavior in computer. We hope our pursuit of excellence will ultimately benefit the society as we believe “Innovation brings changes to the society.� SIMPLE has plans and proposals for the active participation in the research as well as the applications of Computational Linguistics. The association is interested in organizing seminars, workshops and conferences in this area and also looking forward to actively network with the people and organizations in this field. Our activities are led by the common philosophy of innovation, sharing and serving the society. We plan to bring out the association magazine that explains the current technology in a CL perspective. www.simplegroups.in

CLEAR June 2014

49


CLEAR June 2014

50


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.