CLEAR June 2014
1
CLEAR June 2014
2
C
Editorial
4
SIMPLE News & Updates
5
M.Tech Computational Linguistic Project Abstracts (2012-2014) 38
CLEAR June 2014 Volume-3 Issue-2 CLEAR Magazine (Computational Linguistics in Engineering And Research) M. Tech Computational Linguistics Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad 678633 www.simplegroups.in simplequest.in@gmail.com Chief Editor Dr. P. C. Reghu Raj Professor and Head Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad Editors Sreejith C Reshma O K Gopalakrishnan G Neethu Johnson
CLEAR Call for Articles
46
Last word
47
Language Computing – A new Computing Arena Elizabeth Sherly 7 Data Completion using Cogent Confabulation Sushma D K 12 Malayalam POS Tagging Using Conditional Random Fields 17 Archana T. C, Haritha Lakshmi, Krishnapriya, Sreesha, Jayasree N V Topic modelling and LDA algorithm Indu M
22
Tamil to Malayalam Transliteration Kavitha Raju, Sreerekha T V, Vidya P V
26
Memory-Based Language Processing And Machine Translation Nisha M
29
Dialect Resolution Manu.V.Nair, Sarath K.S
34
Cover page and Layout Sreejith C
CLEAR June 2014
3
Greetings! This edition of CLEAR is marked by its variety in content and also by contribution from eminent scholars such as Prof. Elizabeth Sherly of IIITMK. It is heartening to note that our efforts are attracting wider attention from the academic community. We seek the readers' suggestions/comments on the previous edition on Indian Language Computing. As our efforts will be to bring language technology to the mainstream academics, we plan to include some articles on NLP/CL tools and platforms for specialized tasks and also on latest computing paradigms like Map Reduce and MBLP and their relevance to language technology. So, keep reading! With Best Wishes, Dr. P. C. Reghu Raj (Chief Editor)
CLEAR June 2014
4
NEWS & UPDATES
Publications
Text Based Language Identification System for Indian Languages Following Devanagiri Script, Indhuja k, Indu M, Sreejith C, Dr. Reghu Raj P C, International Journal of Engineering Research & Technology (IJERT), Volume. 3 , Issue. 04 , April - 2014, ISSN: 2278-0181
"Eigenvector Based Approach for Sentence Ranking in News Summarization", Divya S, Reghuraj P C, JCLNLP, April-2014
"A Natural Language Question Answering System in Malayalam Using Domain Dependent Document Collection as Repository", Pragisha K and P C Reghu Raj, IJCLNLP, April-2014
Box Item Generation from News Articles Based Paragraph Ranking using Vector Space Model, Sreejith C, Sruthimol M P, P C Reghuraj, International Journal of Scientific Research in Computer Science Applications and Management Studies ( IJSRCSAMS), Volume 3, Issue 2, March 2014, ISSN 2319 – 1953
SIMPLE Groups Congratulates all the authors for their achievement!!!
SIMPLE Groups Congratulates all of them for their S for her achievement s !!!
CLEAR June 2014
5
Industrial Training at IIITM-K Virtual Resource Centre for Language Computing(VRC-LC) department of Indian Institute of Information Technology and Management -Kerala(IIITM-K) had organized a short course and industrial training on Natural Language Processing exclusively for the M.Tech students of GEC, Sreekrishnapuram. The course was focused on recent trends in Computational Linguistics and Machine Learning related to Malayalam Computing. It was 15 days programme (from 5th May - 20th May). During the course various reserch scholars and eminent faculties of VRCLC delivered their sessions on various aspects of Language Processing. The discussion of various works on Malayalam computing preceedings in their centre gave clear idea on the working and challenges involved in Malayalam Computing for the participants.
CLEAR June 2014
6
Language Computing: A New Computing Arena Elizabeth Sherly, Professor, IIITM-Kerala The great strides made in Information
and services in high growth markets such as
Technology in every walk out have slowly
mobile application, healthcare, IT services,
developed a
financial services, online retail, call centers,
Computing,
prominence in Language which
has
got
immense
importance in today's computing world.
publishing and media etc.
A
country like India, which values its culture,
Thanks
heritage and language, is greatly in need of
mathematician
local language support so as to combat the
expressed structure of human language to the
dominance of English in computing. India
level of mathematically viable symbols by
has had about 25 years of history in language
introducing the concepts of TREE diagram.
computing (LC) which has gone through its
Then
ups and down. But during the last decade
linguistics gained good momentum because
there has been a paradigm shift with a
there is a way to represent linguistics in
significant leap to LC as a new computing
mathematical and logical form, that enable to
arena.
Scientists who were somewhat
convert a piece of text into a programmer
reluctant to take up language computing for
friendly data structure. As natural language
research turned out to choose LT as a
involves
main stream of research and interestingly
processing, Artificial Intelligence also plays
industry giants like Microsoft and Google
a significant role in the development of
entered into Language Computing in a big
Computational Linguistics models. Here the
way.
term
So, a phenomenal shift in LT has
the
to
Noam
turned
research
human
Language
Chomsky, linguist,
in
who
computational
understanding
Computing
and
(LC)
and
(CL)
are
happened as Language Technology tools
Computational
become inevitable to enhance the products
sometimes used interchangeably, both are
CLEAR June 2014
Linguistics
a
7
the key terminologies derived from Natural
challenge in this work is that each language
Language Processing (NLP).
has diverse linguistic nature with varied morphological features, inflectional sets,
Some of the major applications in LC are
Machine
Information
language computing research more complex.
Summarization,
The availability of large corpora and
Question Answering, Automatic Speech
dictionary in each language is another
Recognition, Language Writing and Spoken
constraint for researchers in LC.
Retrieval,
Aids,
Translation,
grammatical structure etc. which made
Automatic
Dialog
Systems,
Man-Machine
Interfaces, Knowledge Representations etc.
In India, there has been a phenomenal
Machine Translation (MT) is one of the
shift
major tasks in which research has gone back
computational
to the 1950's, but still remains as an open
development in the last several years as
problem.
Language
There are no 100% accurate
in
Language
Technology
Linguistics
research
Technology
tools
or and
become
Machine Translation systems in any pair of
ineviatable in many applications.
languages in the world. The day when we
Department of Electronics and Information
achieve that, most of the other problems in
Technology
LT will be resolved.
Technology
The major tasks
(DEITY),India Development
The
initiated for
Indian
involved in Machine Translation Systems are
Languages(TDIL)
to describe the language into syntactic and
with the objective of developing Information
Semantic
syntactic
Processing Tools and Techniques to facilitate
using
human-machine interaction without language
Parts-of-Speech
barrier; creating and accessing multilingual
(POS) tagging, Chunking, and Parsing. The
knowledge resources; and integrating them
semantic information are achieved using
to develop innovative user products and
Word-Sense
services.
Information.
information
are
Morphological
Analysis,
The
generated
Disambiguation
(WSD),
(http://tdil.mit.gov.in)
The major activities of TDIL
Semantic Role Labelling (SRL), Named
involves Research and Development of
Entity Recognition (NER), and Anaphora
Language Technology,
Resolution(AR).
The primary research in
machine translation, multi-lingual parallel
CL is basically to develop models and tools
corpora, cross lingual information access,
for the above mentioned components.
optical character recognition and text to
CLEAR June 2014
The
which includes
8
speech conversion,
and
development of
needs local language support for its various
standards related to Language Technology
features. This is yet another potential area for
etc. Various projects and research have been
research. Since the Internet of Things (IOT)
undergoing under TDIL with the effort of
is closely associated with Machine -to -
number of scientific organizations and
Machine, Machine -to -Human, Machine-to-
Educational
Institutions.
are
small devices, it needs local language
institutions
like
IIIT-
enablement to provide information to the
Hyderabad,
IIITM-Kerala,
IITs
There in
India,
CIIL-Mysore,
local masses.
Hyderabad Central University, Centre for Computer Science and Sanskrit Centre-JNU,
The opportunities in Industry are also
New Delhi, CDAC are the main centres
very demanding for CL. One look at the
where computational linguistics is taught and
2013 Gartner predictions tells us that there is
researched.
a huge demand for Computational Language Scientists. Natural Language Processing and
Apart from Machine Translation, the
Speech Recognition are no doubt the
research directions in Language Computing
prominent
are towards Automatic Speech Recognition,
expected technologies in the near future.
Speech to Text Processing, Web and
Industries
Semantics,
Named
linguists are tremendously forging ahead
Anaphora
because currently most of the web, mobile,
Resolution, Word-Sense Disambiguation etc.
and social media based work need extensive
It is not too futuristic to believe that we
language support.
could be talking to the computer which can
mainly three roles in Language Computing:
act according to our commands.
linguists,
Entity
Sentiment
analysis,
Recognition(NER),
New
considerations
that
employ
for the most
computational
The industry looks for
computer
programmers
(both
techniques and models have to be explored
having knowledge in language computing)
for better results. Ontology and Semantics,
and researchers.
phycho
Linguistics and Computer Science also have
linguistic
analysis
are
another
upcoming areas relies on research to get
good
more sensible information from the web and
Microsoft, Google, IBM, HP and several
other applications. The mobile phone being
other companies working heavily in this field
one of the handy and largely used gadgets
needs trained manpower in LT. The global
CLEAR June 2014
opportunities
Researchers in both
in
the
industry.
9
market value of 19.3 billion in 2011 has been
actively pursuing research and product
predicted to shoot to 30 billion in 2015. The
development
in
major
There
also
thrust
Translation,
areas to
are
create
in
Machine
systems
and
technologies catering to today‘s multitude of
are
Malayalam
Computing.
certain
NGOs
and
Communities actively engaged in work related to Malayalam Computing.
translation scenarios; multilingual systems, to
develop
approach
in
a
natural-language-neutral
all
aspects
of
Despite all the efforts made by many
linguistic
contributors, the awareness among masses is
computing; and natural-language processing,
very less. A mechanism for encouraging and
Automatic Speech Recognition, to design
promoting Indian Language Computing is
and build software which can analyze,
the need of the hour.
understand, and generate natural human
ship programmes and movements in the
languages so as to enable addressing a
state, in academic institutions and other
computer like addressing a human being.
organizations to give more visibility and
There should be flag
accessibility which can ensure reachability in The work on Malayalam Computing
every corner of the nation.
has been active for the last decade so as to
Computing
enable computer to understand and process
standardization have to be well placed in
Malayalam. The major work in Malayalam
India‘s IT policy, both at the national and
Computing involves Machine Translation
state levels.
System for Malayalam to another language
actively promote their language and culture,
and vice versa. Spell Checker, Malayalam
not only from their own state people, but
Search Engine, Malayalam text to speech
from wherever that language group has
and speech to text system, Morphological
existence
Analyser and POS taggers for Malayalam,
Malayalam as a language and some of the
Malayalam Text Editors, Human Interaction
promotional activities that can be done at the
Interfaces,
state and college levels are listed below.
Malayalam
Language
tutors,
and
its
use,
Language support,
and
Each language group shall
worldwide.
We
may
take
Corpora building, Dictionary building etc. CDAC, Chennai,
CDIT,
IIITM-K,
CIIL-Mysore,
AUKBAC-
IIIT-Hyderabad,
Include the importance of Malayalam Computing in Kerala‘s IT Policy.
Amrita are some of the major institutions CLEAR June 2014
10
Create Internet mailing list, say
attention
Malayalam.Net and introduce various
Computational Models for various tasks
tools and products developed with its
namely Morph Analysis, POS tagging etc are
use.
to be refined for more accurate results. Most
Form a Malayalam Computing Task
of the models are rule based, statistical or
Force. Promote and support various
hybrid models like Support Vector Machine,
activities
HMM and TnT that shows 95-100 %
and
form
PMUs
for
better
outlook.
The
accuracy. However the Machine Translation
implementation.
for
non-profit,
System has shown up to an accuracy of 60-
Government/non Government body
65% only. Deep Learning has recently
to promote Malayalam Computing.
shown many promises for NLP applications.
Promote
periodical
The deep Neural Network with its capability
seminars
and
Establish
a
workshops, in
to learn distributed representation based on
Malayalam Computing in educational
similarity of words and its ability to learn
institutions
multiple
conferences
level
of
representations
is
Computing
encouraging for cognitive problem solving
mandatory for Computer Science
like natural language understanding and
courses at the university level
Speech Recognition.
Promote
Computer with Malayalam input devices and
Make
Language
Malayalam
Computing
The days of using
using social media
interfaces with Malayalam Processing and
Publish articles, research papers,
understanding that the way human handles is
tools and products, news on LC in
not very far.
magazines,
journals
and
notice
boards.
Due to inherent structure of Malayalam language as a highly agglutinative and inflectional language, an issue in Malayalam Computing is many compared to other languages.
The rendering issues on the
display of Malayalam font require greater CLEAR June 2014
11
Data Completion using Cogent Confabulation Sushma D K M.Tech Dept. of computer Science SJBIT, Bangalore
studied extensively. Later, this data can be
Introduction Information is being generated on a daily basis and continuously uploaded onto the web inmassive quantity. This information is of many types ranging from simple text files to videofiles. This information may sometimes be incomplete due to various reasons like variations in electrical signals, intentional causes, unknowingly overwriting the data, etc,. Cogent confabulation is a unique, and alien technique to complete the missing data. The Cogent confabulation problem consists of two major tasks: (1) training the available data to generate knowledge bases (matrices storing the appearance counts of words in the training corpus); (2) querying over this trained data to predict the next word or phrase, given the starting few words of the phrase or sentence. To
overcome
the
difficulty
of
completing the missing information, the previously available data has to be first CLEAR June 2014
trained at the semantic level using a particular model, and used in the completion of the missing data. Cogent confabulation is a new model of vertebrate cognition used for training and querying the available data. Confabulation is the process of selecting that one symbol (termed as the conclusion of the confabulation) whose representing words happen to be receiving the highest level of excitation (appearance count). It explains confabulation product as the probability of occurrence
of
the
assumed
words
individually with the target word. Cogency is defined as the probability of occurrence of all the assumed words together with the target word. To predict and fill the missing data, the procedure starts with maximizing the cogency p(αβγδ|ε) and confabulation product p(α|ε).p(β|ε).p(γ|ε).p(δ|ε). The order in which the corpus words are chosen for training is meant to reflect the pattern of text contained 12
in the expository text. The approach uses
simultaneously, they are said to co-occur,
lexical analyses based on appearance count
which creates the opportunity to associate
of
retrieval
the two symbols. For instance, after seeing a
measurement, to determine the extent of
face and hearing a name together, the
possibility of a particular word being the
symbols representing each may become
conclusion word. Phrase completion, and
associated. Each strengthened unidirectional
sentence continuation using confabulation
association between two symbols is termed a
model should be useful for many text
knowledge link. Collectively knowledge
analysis
tasks,
information
links comprise all cognitive knowledge.
retrieval
and summarization, generating
Each thalamocortical module performs the
words,
an
information
including
artificial stories etc
same
processing
contraction of a list of symbols, termed a
Confabulation is a new model of cognition
information
operation, which can be thought of as a
Confabulation
vertebral
single
which
mimics
the
confabulation. Throughout a confabulation, input excitation is delivered to the module
Hebbian learning, and the information
through
processing procedures of human brain.
symbols in other modules lists of candidate
Cognitive information processing is a direct
conclusion symbols, driving the activation of
evolutionary re-application of the neural
these knowledge links target symbols in the
circuits controlling movement, and thus
module performing the confabulation. When
functions just like movement. Conceptually,
conclusion symbols contract, there is no
brains are composed of many muscles of
physical movement in the brain, rather
thought (termed thalamocortical modules in
symbols currently on the list compete (based
mammals). A module contains symbols, each
upon their relative excitation levels) for
of which is a sparse collection of neurons
eventual exclusive activation (a so-called
which functions as a descriptor of the
winner-take-all competition) within that
attribute of that module. For example, if the
module and, as a result, the number of active
attribute of a module is color, then a single
symbols is gradually reduced.
symbol represents a particular color. Each thalamocortical module is connected to many other modules. When two symbols are active CLEAR June 2014
knowledge
links
from
active
Crucially, this contraction of the candidate conclusion symbol list in each thalamocortical
module
is
externally 13
controlled by a thought control signal
launched. These issued action commands are
delivered to the module. A confabulation in a
proposed as the source of all non-reflexive
thalamocortical module is controlled by a
and
graded analog control input. The thought
Thalamocortical
control signal determines how many symbols
confabulations, delivering excitation through
remain in the competition, but has no effect
knowledge
on selecting which symbols are in the
knowledge through the issuance of action
competition. Which symbols are in the
commands
competition is determined by the excitation
foundation of all mammalian cognition.
level of a symbol as it dynamically reacts to
These basic concepts of cognition, and
knowledge link input from active symbols in
confabulation are applied in the artificial
other modules (which cause its excitation
intelligence, and machine learning field to
level to increase) or to a reduction or
predict, fill, and complete the missing data in
cessation of such input (which causes its
the sentences. It can also be used for context
excitation level to fall). Ultimately, the
representation, context exploitation, itelligent
thought control signal is used to dynamically
text recognition, and to generate artificial
contract the number of active symbols in a
stories
module from an initial many less-active symbols to, at the end of the confabulation, a
non-autonomic
links,
behaviors.
modules
and
constitute
performing
applying
the
skill
complete
Methodology and applications
The
One of the methodologies that are used for
resulting single active symbol is termed the
data completion is Confabulation, which is
confabulation
explained before.
single
maximally-active
conclusion.
symbol.
The
learned
The other concept is
association between each symbol of a
cogency. It is the occurrence of all the
module and its set of action commands is
assumed fact words together with the
termed skill knowledge. Skill knowledge is
assumed target word i.e. p(αβγδ|ε).
stored in the module, but the learning of these
associations
is
controlled
A. Terminologies:
by
subcortical brain nuclei. When a conclusion
According to the confabulation model,
is reached in a module, those action
assuming that the combined assumed facts
commands which have a learned association
αβγδ are true, then the set of all symbols (in
from that conclusion symbol are instantly
the answer lexicon from which conclusions
CLEAR June 2014
14
are being sought) with p(αβγδ|ε) > 0 is called
works fill in the missing data based on the
the expectation; the elements of which, are
Bayesian models, where the words are
termed answers, in the descending order of
selected
their
probability (selecting the conclusion which
cogencies.
(maximization
Confabulation
of
the
product
by
increasing
their
posterior
has the highest probability of being correct).
p(α|ε).p(β|ε).p(γ|ε).p(δ|ε) or, equivalently, the
Though
sum of the logarithms of these probabilities)
extensively used in neural networks, it is an
as a surrogate for maximizing cogency (it is
awful
assumed
pairwise
completion. Because, it simply selects a
conditional probabilities p(Ψ|λ) between
word with highest probability value, even if
symbols
This
it is irrelevant to the given context. Many
exhaustive
open source tools can be used for the easy
that Ψ
all
and
assumption
is
knowledge).
Each
required
λ
are
known.
termed meaningful
non-zero
Bayesian
model
development.
for
These
theory
has
cognition,
been
and
data
basic concepts
of
p(Ψ|λ) is termed an individual item of
cognition, and confabulation are applied for
knowledge.
context representation, context exploitation, intelligent text recognition, and to generate
B. Applications
artificial stories. The longer execution times
These basic concepts of cognition and
due to lexicon overhead can be reduced by
confabulation are applied in the artificial
parallel
intelligence, and machine learning field to
demand high throughput will have to
predict, fill, and complete the missing data in
evaluate the proposed confabulation method
the sentences. It can also be used for context
depending on the hardware available.
representation,
context
Applications
that
exploitation,
intelligent text recognition, and to generate artificial stories. Conclusion and Future Works This paper is an initiative for better data completion, and manipulation. The major requirement for the process is the previous complete text corpus. The previous CLEAR June 2014
processing.
REFERENCES [1] Robert Hecht-Nielsen, Confabulation Theory, The Mechanism of Thought 2007. [2] Qinru Qiu , Qing Wu, Daniel J. Burns, Michael J. Moore, Robinson E. Pino, Morgan Bishop, and Richard W. Linderman, Confabulation Based Sentence Completion for Machine Reading, IEEE, 2011. 15
[3] Qinru Qiu, Qing Wu, and Richard
[4] Fan Yang 1 , Qinru Qiu , Morgan Bishop
Linderman, Unified Perception-Prediction
, and Qing Wu, Tag-assisted Sentence
Model for Context Aware Text Recognition
Confabulation
on a Heterogeneous Many-Core Platform,
Recognition, IEEE, 2012
Proceedings
of
International
Joint
Conference on Neural Networks, San Jose, California, USA, July 31 – August 5, 2011
for
Intelligent
Text
[5] Darko Stipaničev, Ljiljana Šerić, Maja Braović, Damir Krstinić,
Toni
Jakovčevic, Maja Štula, Marin Bugarić and Josip Maras, Vision Based Wildfire and Natural Risk Observers, IEEE, 2012.
Computer chatbot 'Eugene Goostman' passes the Turing test Eugene Goostman is a chatterbot. First developed by a group of three programmers; the Russian-born Vladimir Veselov, Ukranian-born Eugene Demchenko, and Russian-born Sergey Ulasen in Saint Petersburg in 2001,Goostman is portrayed as a 13-year old Ukranian boy—a trait that is intended to induce forgiveness in users for his grammar and level of knowledge. The Goostman bot has competed in a number of Turing test contests since its creation, and finished second in the 2005 and 2008 Loebner Prize contest. In June 2012, at an event marking what would have been the 100th birthday of their namesake, Alan Turing, Goostman won what was promoted as the largest-ever Turing test contest, successfully convincing 29% of its judges that it was human. On 7 June 2014, at a contest marking the 60th anniversary of Turing's death, 33% of the event's judges thought that Goostman was human; the event's organizer Kevin Warwick considered it to have "passed" Turing's test as a result, per Turing's prediction that by the year 2000, machines would be capable of fooling 30% of human judges after five minutes of questioning.
CLEAR June 2014
16
Malayalam POS Tagging Using Conditional Random Fields Archana T. C, Haritha Lakshmi, Krishnapriya, Sreesha, Jayasree N V Dept. of Computer Science and Engg, Sreepathy Institute of Management and Technology, Vavanoor Parts of speech tagging is the process of assigning tags to the words of a given sentence. This paper presents the building of Part-Of-Speech (POS) tagger for Malayalam Language using Conditional Random Field (CRF). POS tagger plays an important role in Natural language applications like speech recognition, natural language parsing, information retrieval and information extraction. The present tagset consists of 100 tags.It consists of a language model, which is trained by an annotated corpus of 3026 sentences (36,315 words). This model checks the trigram possibility of occurrence of tag in the training corpus. We present a trigram HMM-based (Hidden Markov Model) part-of-speech (POS) tagger for Malayalam language, which will accept a raw text to produce a POS tagged output. We can improve the accuracy of the system by increasing the size of the annotated corpus. Although the experiments were performed on a small corpus, the results show that the statistical approach works well with a highly agglutinative language like Malayalam..
I. Introduction
that links them together. There is a need to
India is a large multi lingual country of diverse culture. It has many languages with written forms and over a thousand spoken languages.
The
Constitution
of
India
recognizes 22 languages, spoken in different parts of the country. The languages can be categorized into two major linguistic families namely Indo Aryan and Dravidian. These classes of languages have some important differences. Their ways of developing words and grammars are different. But both include a lot of Sanskrit words. In addition, both have a similar construction and phraseology
develop information processing tools to facilitate human machine interaction ,in Indian
Languages
and
multi
lingual
knowledge resources. A POS tagger forms an integral part of any such processing tool to be developed. Parts of Speech Tagging, a grammatical tagging, is a process of marking the words in a text as corresponding to a particular part of speech, based on its definition and context- i,e relationship with adjacent and related words in a phrase, sentence ,or paragraph. This is the first step towards understanding any languages. It finds its major application in the speech and
CLEAR June 2014
17
NLP like Speech Recognition, Speech
and both the observed and hidden words
Synthesis, Information retrieval etc. A lot of
must be in a sequence. It has been
work has been done relating to this in NLP
experimentally shown that the accuracy of
field. A lot of work has been done in part of
the
speech tagging of western languages. These
significantly
taggers vary in accuracy and also in their
template an efficient corpus and a widely
implementation. A lot of techniques have
accepted
also been explored to make tagging more and
concentrates on designing a POS Tagger,
more accurate. These techniques vary from
using CRF++, which is an open source
being purely rule based in their approach to
implementation of CRF.
being completely stochastic. Some of these taggers achieve good accuracy for certain
POS
tagger by
can
be
introducing
tagset.
This
improved a
paper
trigram
mainly
System Architecture
languages. But unfortunately, not much work
The system consists of 3 modules
has been done with regard to Indian
namely Preprocessing, Training and Testing.
languages especially Malayalam. In this
The architecture of the proposed system is
paper we have developed a POS tagger based
depicted in Fig.1.
on
Conditional
Random
Field
(CRF).
Conditional Random Fields (CRFs), the undirected graphical models, are used to calculate the conditional probability of values on designated output nodes given values on other designated input nodes. It can clearly be seen that generative model (HMM) here has performed quite close to CRF. A Hidden Markov Model (HMM) is a statistical model in which the system
Fig 1. System Architecture
modeled is thought to be a Markov process with the unknown parameters. In this model, the assumptions on which it works are the probability of the word in a sequence may depend on its immediate word preceding it CLEAR June 2014
Preprocessing is the initial stage for the implementation of a Malayalam POS Tagger. In preprocessing it takes the input sentence and tokenize the sentence,i.e it 18
receives the input as sentence and split it into
increasing the number of grams, the template
words. Thus splitted words are stored in a
becomes more useful and tagging becomes
file and it becomes the input for the testing
more efficient. The template file can be
phase.
represented as shown in Fig. 2. The
proposed
method
uses
a
supervised mode of learning for POS tagging. The simplest statistical approach finds out the most frequently used tag for a specific word from the annotated training data and uses this information to tag that word in the unannotated text. This is done during the training phase. We are using statistical approach for POS tagging i.e. We train and test our model for this we have to calculate frequency and probability of words of given corpus. During training, the annotated corpus is trained using the CRF Template. All the required files in performing the procedure is
Fig. 2 CRF Template
explained with example below. For training, The tagset used in the system is
the CRF Learn command is used as shown
developed by Bureau of Indian Standards.
below:
Tagset is a list of short forms representing crf learn template file train file model file The template file provides the correct position of the words to be tagged .If a unigram template is used, then a word is tagged based on the current word. If Bigram statistics is used, then the tagging is done based on the current and previous word. On CLEAR June 2014
the components of sentence like nouns, verbs and their sub forms. The corpus training is performed with the help of the tag set. The tag set used in the system contains 100 tags. The training will create a model file as output,
which
contain
the
learned
probabilities of the corpus.
19
tags of each and every words. Use crf test command: crf learn template file train file model file A screenshot of the proposed system is given in Fig.4.
Fig 4: Screenshot Conclusion The motivation of this project is to help children and foreigners to learn the Fig.3 Training Corpus
Testing is the process of comparing
structure of a Malayalam sentence. By using Python
programming
language
as
the
the trained model file with the tokenized
development environment the application is
input file and obeying rules in template and
built keeping in mind about the design
tags in tagset finding out the corresponding
standards and maintainability of the code. CRF++, by using the Hidden Markov Model and Tkinter features provide a rich user
CLEAR June 2014
20
experience to the users of the software. This
4. Anish, ‖Part of Speech Tagging for
application is very simple to use and is
Malayalam‖, Amritha School of Engineering
helpful to people who are in the preliminary
Coimbatore,
stage of learning Malayalam language.
Vidyapeetham,Coimbatore-641105
References
5. Steven Bird, et al, Natural Language
1. Rajeev R R, Jisha P Jayan, "Part of speech tagger
for
malayalam",
Computer
Processing
Amritha
with
Python,
Vishwa
Orielly
Publications, 2011.
Engineering and Intelligent Systems, Vol 2,
6.
No.3 June 2011.
Markov Model ‖, Lecture Notes.
2. Christopher.D.Manning,Hinrich Shutze, "
7.
Fundamentals
https://code.google.com/p/crfpp/, Last visited
of
Statistical
Natural
Language Processing", MIT Press, 1999. 3.
Asif
Ekbal,Rajwanul
Michael Collins,‖Tagging with Hidden
CRF++
Home
Page,
on March 2014.
Haque,Sivaji
Bandhopadhyay,‖Bengali Part Of Speech Tagging
using
Fields‖,Department
Conditional of
CSE
Random Javadpur
University Kolkata-70032,India
Making the world’s knowledge computable Wolfram Alpha introduces a fundamentally new way to get knowledge and answers not by searching the web, but by doing dynamic computations based on a vast collection of built-in data, algorithms, and methods.
http://www.wolframalpha.com/
CLEAR June 2014
21
Topic modelling and LDA algorithm Indu M Dept. of computer Science Govt. Engineering College, Sreekrishnapuram A topic model that which automatically captures the thematic patterns and identiďŹ es emerging topics of text streams and their changes over time. It is used to check models, summarize the corpus, and guide exploration of its contents. Topic Modelling can enhance information network construction by grouping similar objects, event types and roles together.
Introduction
hidden thematic structure in large archives
As electronic documents become
of documents. One of the simplest topic
available, it becomes more difficult to find
models is latent Dirichlet location (LDA).
and discover what we need. We need new
The intuition behind LDA is that documents
tools to help us organize, search, and
exhibit multiple topics. Most topic models,
understand
such as LDA [1] are unsupervised. Only the
these
vast
amounts
of
words in the documents are modeled, and the
information. Topic models are based upon the idea that documents are mixtures of topics, where a topic is a over words. A topic model is a generative model for documents: it specifies
goal is to infer topics that maximize the likelihood (or the posterior probability) of the collection. The
main
application
of
topic
a simple probabilistic procedure by which
modeling is in Information Extraction. And it
where a topic is a documents can be
can also be used to analyze, summarize, and
generated. To make a new document, one
categorize the stream of text data at the time
chooses a distribution over topics. Then, for
of its arrival. For example, as news arrives in
each word in that document, one chooses a
streams,organizing it as threads of relevant
topic
articles is more efficient and convenient.
at
random
according
to
this
distribution, and draws a word from that topic.
Topic modelling programs do not know anything about the meaning of the
Probabilistic topic models are a suite of algorithms whose aim is to discover the CLEAR June 2014
words in a text. Instead, they assume that any piece of text is composed (by an author) by 22
selecting words from possible baskets of
analysis to determine the number of genes
words where each basket corresponds to a
that an organism needs to survive.
topic. If that is true, then it becomes possible to mathematically decompose a text into the probable baskets from whence the words first came.
In this it is highlighted the different words that that are used in the article. Words about data analysis, such as ―computer" and ―prediction," are highlighted in blue; words
To make a new document in topic
about evolutionary biology, such as ―life"
modelling, one chooses a distribution over
and ―organism", are highlighted in pink;
topics. Then, for each word in that
words about genetics, such as ―sequenced"
document, one chooses a topic at random
and ―genes," are highlighted in yellow. Here
according to this distribution, and draws a
in this figure at left side, this represents some
word from that topic.
no of topics. Each document is assumed to
The model specifies the following distribution over words within a document:
be generated as follows. First choose a distribution over the topics (the histogram at right); then, for each word, choose a topic assignment (the coloured coins) and choose the word from the corresponding topic.
where T is the number of topics, p(z) for the
LDA models can be used to find
distribution over topics z in a particular
topics that describe a corpus and each
document and p(w|z) for the probability
document exhibit multiple topics.
distribution over w words given z topic. LDA
The basic ideas behind latent Dirichlet allocation (LDA), which is the simplest topic model [3]. The intuition behind LDA is that documents exhibit multiple topics. For example, consider the article in Figure 1[1]. This article, entitled ―Seeking Life's Bare
Graphical model of LDA
(Genetic) Necessities," is about using data CLEAR June 2014
23
(DCM)
mixtures,
probabilistic
Latent
Semantic Indexing (pLSI), were also used to find the relevant topics. We can also simplify the topic distribution by modelling each topic
as
a
discrete
probability
contains
a
over
documents. A.Algorithm The corpus documents.
Methodology and applications One of the methodologies that are used for topic detection is LDA, which I have explained before. Theoretical studies of topic modelling focus on learning the model‘s parameters assuming the data is actually generated from it. Existing approaches for the most part rely on Singular Value Decomposition (SVD), and consequently have one of two limitations: these works need to either assume that each document contains only one topic, or else can only recover the span of the topic vectors instead of the topic vectors themselves. Here in SVD, we assume that each document contain
collection
of
For each document in the collection, we generate the words in a two-stage process. 1. Randomly choose a distribution over topics 2. For each word in the document: (a) Randomly choose a topic from the distribution over topics in step #1. (b) Randomly choose a word from the corresponding distribution over the vocabulary. This statistical model rejects the intuition that documents exhibit multiple topics. Each document exhibits the topics with different proportion (step #1); each word in each document is drawn from one of the topics (step #2b), where the selected topic is chosen from the per-document distribution
only one topic or else span of topic vector instead of topic vector themselves.
Here
in
order
to
evaluate
the
predictive power of a generative model on Other probabilistic models such as naive-
unseen data, there is a standard way known
Bayes Dirichlet Compound Multinomial CLEAR June 2014
24
as perplexity. By this, it is able to predict
applications, scientific applications such as
relate words from different languages.
genetic/ neuro science etc.
B. Applications
Conclusion and Future Works
Topic models are good for data exploration, when there is some new data set and you don't know what kinds of structures that you could possibly find in there. But if you did know what structures you could find in your data set, topic models are still useful if you didn't have the time or resources to construct classification models based on supervised machine learning. Lastly, if you did have the time and resources to construct classification models based on supervised learning, topic models would still be useful as extra features to add to the models in
Documents
are
partitioned
into
topics, which in turn have terms associated to varying degrees. However in practice, there are some clear issues: the models are very sensitive to the input data small changes to the stemming/ tokenization algorithm can result in completely different topics; topics need to be manually categorized in order to be useful; topics are ―unstable ― in the sense that adding new document can cause significant change to the topic distribution. REFERENCES
order to increase their accuracy. This is the
[1] D. Blei. Introduction to probabilistic
case because topic models act as a kind of
topic models, Communications of the ACM
"smoothing" that helps combat the sparse
pp. 77-84 2012.
data problem that is often seen in supervised learning.
[2] Vivi Nastase, Introduction to topic models: Building up towards LDA, summer
The topic modeling can be used in computer vision, an inference algorithm applied to natural texts in the services of text retrieval, classification, an d organization
semester 2012 [3] D. Blei. Et. al. supervised topic models, Princeton university
build text hierarchy etc. it can also be used
[4] David Hall et al. Studying the History of
in
Ideas
WSD,
machine
learning,
organize,
summarize, and help users to explore large corpora,
information
CLEAR June 2014
Using
Topic
Models,
Stanford
University.
engineering
25
Tamil to Malayalam Transliteration Kavitha Raju, Sreerekha T V, Vidya P V M.Tech CL GEC, Sreekrishnapuram Palakkad Abstract : Transliteration can form an essential part of transcription which converts text from one writing system to another. This article discusses about the applications and challenges in machine transliteration from Tamil to Malayalam, two languages that belong to the Dravidian family. Transliteration can be used to supplement machine translation process by handling the issues that can happen due to the presence of named entities.
I. Introduction The rewriting or conversion of the characters of a text from one writing system to another writing system is called transliteration. Here each character of the source language is assigned to a different unique character of the target language, so that an exact inversion is possible. If the source language consists of more characters than the target language, combinations of characters and diacritica can be used. Machine transliteration systems have a great importance in a country like India which has a fascinating diversity of languages. Even though there are groups of language that comes from common origin, the difference in scripts makes the task cumbersome. Machine transliteration systems can be classified into rule-based and corpus-based approaches, in terms of their core methodology. Rule-based systems develop linguistic rules that allow the words to be put in different places, to have different scripts depending on context, etc. The main CLEAR June 2014
approach of these systems is based on linking the structure of the given input with the structure of the demanded output, necessarily preserving their unique meaning. Minimally, to get a transliteration of source language sentence one needs a dictionary that will map each source language word to an appropriate target language word, rules representing source language word structure and rules representing target language word structure. Because rule-based machine transliteration uses linguistic information to mathematically break down the source and target languages, though, it is more predictable and grammatically superior than Statistical Method. Statistical machine transliteration employs a statistical model, based on the analysis of a corpus, to generate text in the target language. The idea behind statistical machine transliteration comes from information theory. A document is transliterated according to the probability distribution p(e|f ) that a string e in the target language is the transliteration of a string f in the source language.
26
There are two primary motivations behind pursuing this project. First one is that in India, majority of the population still use their mother-tongue as the medium of communication and the next is that in spite of globalization and wide-spread influence of the West in India, most of the people still prefer to read, write and talk in their mothertongue. II. Language Features Malayalam is a language spoken in India, predominantly in the state of Kerala. It is one of the 22 scheduled languages of India and was designated a Classical Language in India in 2013. Malayalam has official language status in the state of Kerala and in the union territories of Lakshadweep and Puducherry. It belongs to the Dravidian family of languages. The origin of Malayalam, whether it was from a dialect of Tamil or an independent offshoot of the Proto Dravidian language, has been and continues to be an engaging pursuit among comparative historical linguists. Tamil is a Dravidian language spoken predominantly by Tamil people of South India and North-east Sri Lanka. It has official status in the Indian states of Tamil Nadu, Puducherry and Andaman and Nicobar Islands. Tamil is also an official language of Sri Lanka and an official language of Singapore. It is legalised as one of the languages of medium of education in Malaysia along with English, Malay and Mandarin. It is also chiefly spoken in the states of Kerala, Karnataka, Andhra Pradesh and Andaman and Nicobar Islands as one of the secondary languages. It is one of the 22 scheduled languages of India and was the CLEAR June 2014
first Indian language to be declared a classical language by the Government of India in 2004. Both Tamil and Malayalam are languages of southern states of India. Even though Malayalam is said to belong to Dravidian family of languages,it is more similar to the Arya language. When Aryans came at northeast border of Bharatam, the people of Harappa and Mohen Ja Daro moved to east and south and they replanted their civilisation in South India, based in Tamil Nadu. Tamil is one of the oldest languages. When Dravidians moved to south, there were people of soil to receive them and help. The geography helped them to receive Tamil. Dravidians settled in Tamil Nadu and developed the Tamil Literature and Tamil Civilization. In comparison to all Indian Languages, Tamil has only 12 vowels and 18 consonants. Malayalam is one of the most updated languages with clarity in voice and comparatively more difficult to study. There are 56 letters. The vowels are most resembling to Sanskrit. Since Tamil has only few alphabets in comparison to Malayalam, it is not possible to have a one to one mapping between these languages. A letter in Tamil can be transliterated as more than one letters in Malayalam. III. Applications and Challenges The various applications of machine transliteration system are as follows: •
It aids machine translation.
•
It helps to eliminate language barriers. 27
•
It supports localization.
• It enables the use of a keyboard in a given script to type in a text in another one. Transliteration also has many challenges. They are: • Dissimilar alphabet sets of the source and target languages. • Multiple possible transliterations for the same word. • Finding exactly matching tokens is difficult for some of the vowels and a few consonants. • The size of the corpus required is very large in order to build an accurate transliteration system. IV. Conclusion In this article, we discussed about Tamil to Malayalam transliteration systems in general. Various applications of transliteration and the challenges associated with it were also pointed out. It is inevitable for a machine
CLEAR June 2014
translation system for handling named entities. In case of a dictionary based translation systems, this is very useful as it would save lot of time and resources. But it has been observed that a large corpus size is required to model the system accurately. REFERENCES [1] K Saravanan, Raghavendra Udupa, A Kumaran: Crosslingual Information Retrieval System Enhanced with Transliteration Generation and Mining, Microsoft Research India, Bangalore, INDIA. [2] Rishabh Srivastava and Riyaz Ahmad Bhat: Transliteration Systems Across Indian Languages Using Parallel Corpora, Language Technologies Research Center, IIIT-Hyderabad, India. [3] R. Akilan and Prof. E.R.Naganathan: Morphological Analyzer for Classical Tamil texts: A rule-based approach
28
Memory-Based Language Processing And Machine Translation Nisha M M.Tech CL GEC, Sreekrishnapuram Palakkad
Memory-based language processing (MBLP) is an approach to language processing based on exemplar storage during learning and analogical reasoning during processing. From a cognitive perspective, the approach is attractive because it does not make any assumptions about the way abstractions are shaped, and does not make any a priori distinction between regular and exceptional exemplars, allowing it to explain fluidity of linguistic categories, and irregularization as well as regularization in processing. Memory-based machine translation can be considered as a form of Example-Based Machine Translation. Machine translation problem can be treated as a classification problem and hence memory based learning can be applied. This paper demonstrates memory-based approach for machine translation.
MBLP finds its computational basis I. Introduction
in the classic k-nearest neighbor classifier
Memory-based language processing,
(Cover & Hart, 1967). With k = 1, the
MBLP, is based on the idea that learning and
clasfier searches for the single example in
processing are two sides of the same coin.
memory that is most similar to B, say A, and
Learning is the storage of examples in
then copies its memorized mapping A‘ to B‘
memory, and processing is similarity-based
(as visualized schematically in Figure 1).
reasoning with these stored examples.
With k set to higher values, the k nearest neighbors to B are retrieved, and some voting procedure (such as majority voting) determines which value is copied to B‘. Memory-based machine translation can be considered as a form of ExampleBased
Machine
characterizing
Translation.
Example-Based
In
Machine
Translation, Somers (1999) refers to the CLEAR June 2014
29
common use of a corpus or database of
intermediate structure is always implicitly
already translated examples, and the process
encoded somehow in the words at the
of matching new input against instances in
surface, and the way they are ordered, and
this database. This matching is followed by
memory-based models may be capable of
extraction of fragments which are then again
capturing the knowledge that is usually
recombined to form the final translation.
considered to be necessary, in an implicit
The task of mapping one language to the other can be treated as a classification
way, so that they do not need to be explicitly computed.
problem. The method can be described as
Classes of natural processing tasks in
one in which the sentence to be translated is
which this question can be investigated in
decomposed into smaller fragments, each of
extreme are processes in which form is
which
for
mapped to form, i.e., in which neither the
classification. The memory-based classifier
input nor the output contains abstract
is trained on the basis of a parallel corpus in
elements to begin with, such as machine
which the sentence pairs have been reduced
translation.
to smaller fragment pairs. The assigned
translation tools, such as the open source
classes thus correspond to the translation of
Moses toolkit (Koehn et al., 2007), indeed
the fragments. In the final step, these
implement a direct mapping of source to
translated fragments are re-assembled to
target text, leaving all of syntax and
derive the final translation of the input
semantics implicit; they hide in the form of
sentence.
statistical translation models between col
is
passed
to
a
classifier
II. Background and Related Literature
Natural language processing models and systems
typically employ abstract
linguistic
representations(syntactic,
semantic, or pragmatic) as intermediate working units. Memory-based models enable asking the question whether we can do without
them,
CLEAR June 2014
since
any
invented
Many
current
machine
locationally strong phrases, and of statistical language models of the target language. MBLP approach on this problem involves using context on the source side, and using memory-based classification as a translation model (Van Gompel, Van den Bosch, & Berck,2009). There is an encouraging number of recent studies that attempt to link statistical 30
and memory-based models of language that focus on discovering strong n-grams (for phrase-based statistical machine translation or for statistical language modeling) to the concept of constructions and to the question to what extent human language users exploit constructions. To mention two, we note that Mos, Van den Bosch, and Berck have reported that a memory-based language model shows a reasonable correlation with unit segmentations that test subjects generate in a sentence copy task; the model implicitly captures several strong complex lexical items (constructions), although it fails to capture long distance dependencies, a common issue with local n-gram based statistical models. In a related study, (Arnon & Snider,2010) show that subjects are sensitive to the frequency (a rough
approximation
of
collocational
strength) of four-word n-grams such as ‗don‘t have to worry‘, which are processed faster when they are more frequent. Their argument is again focused on the question whether strong subsequences need to have linguistic structure that assume hierarchy, or could simply be taken to be flat n-grams — it is exactly this question that we aim to explore further in our work with memory based language processing models.
III. Methodology The process of translating a new sentence is divided into a local phase (corresponding to the first two steps in the process) in which memory-based translation of source trigrams to target trigrams takes place, and a global phase (corresponding to the third step) in which a translation of a sentence is assembled from the local predictions. A.Local classification Both in training and in actual translation, when a new sentence in the source language is presented as input, it is first converted into windowed trigrams, where each token is taken as the center of a trigram once. First trigram of the sentence contains an empty element, and the last trigram contains an empty right element. At training time, each source language sentence is accompanied by a target language translation.
Word alignment should be
performed before this step, so that classifier know for each source word whether it maps to a target word, and if so, to which. Given the alignment, each source trigram is mapped to a target trigram of which the middle word is the target word to which the word in the middle of the source trigram aligns and right neighboring words of the target trigram are
CLEAR June 2014
31
the center word‘s actual neighbors in the
with the first three Dutch words. Fourth
target sentence.
predicted English trigram, however, overlaps to its left with the fifth predicted trigram, in one position, and overlaps in two positions to the right with the sixth predicted trigram, suggesting that this part of the English sentence is positioned at the end. Note that in this example, the ―fertility‖ words take and this, which are not aligned in the training trigram mappings (cf. Figure 1), play key roles in establishing trigram overlap.
Figure2: an example training pair of sentences, converted into six overlappingtrigrams with their aligned trigram translations.[1] When translating new text, trigram outputs are generated for all words in each new source language sentence to be translated, since our system does not have clues as to which words would be aligned by statistical
IV. Conclusion and Future Works
word alignment. B.Global search
The study described in this paper has demonstrated how memory-based learning
To convert the set of generated target
can be applied to machine translation.
trigrams into a full sentence translation, the
Memory based learning stores all examples
overlapbetween the predicted trigrams is
in memory it settles for a state of maximal
exploited. Figure 3 illustrates a perfect case
description length. This extreme bias makes
of a resolutionof the overlap (drawing on the
memory-based
example of Figure 2), causing words in the
comparative case against so-called eager
English sentence tochange position with
learners, such as decision tree induction
respect to their aligned Dutch counterparts.
algorithms and rule learners. Phrase based
learning
an
interesting
First three English tri-grams align one-to-one CLEAR June 2014
32
memory based machine translation is also
[2] Antal van den Bosch and Walter
implemented which use Statistical toolkits
Daelemans,
such as Moses for phrase extraction. Future
Processing, Cambridge University Press,
works include developing memory based
New York
techniques for phrase extraction.
Memory-Based
Language
[3] Antal van den Bosch and Walter Daelemans, Implicit schemata and categories
REFERENCES
in memory-based language processing
[1] A. van den Bosch and P. Berck. Memorybased machine translation and language modeling.
The
Prague
Bulletin
of
Mathematical Linguistics, 91:17–26, 2009.
The Stanford Natural Language Processing Group The Natural Language Processing Group at Stanford University is a team of faculty, research scientists, postdocs, programmers and students who work together on algorithms that allow computers to process and understand human languages. The work ranges from basic research in computational linguistics to key applications in human language technology, and covers areas such as sentence understanding, machine translation, probabilistic parsing and tagging, biomedical information extraction, grammar induction, word sense disambiguation, and automatic question answering. The Stanford NLP Group makes parts of their Natural Language Processing software available to everyone. These are statistical NLP toolkits for various major computational linguistics problems. They can be incorporated into applications with human language technology needs. A distinguishing feature of the Stanford NLP Group is their effective combination of sophisticated and deep linguistic modeling and data analysis with innovative probabilistic and machine learning approaches to NLP. Our research has resulted in state-of-the-art technology for robust, broadcoverage natural-language processing in many languages. These technologies include their competition-winning coreference resolution system; a state-of-the-art part-of-speech tagger; a high performance probabilistic parser; a competition-winning biological named entity recognition system; and algorithms for processing Arabic, Chinese, and German text. All the software is written in Java. All recent distributions require Oracle Java 6+ or OpenJDK 7+. Distribution packages include components for command-line invocation, jar files, a Java API, and source code. A number of helpful people have extended our work with bindings or translations for other languages. As a result, much of this software can also easily be used from Python (or Jython), Ruby, Perl, Javascript, and F# or other .NET languages. Link: http://nlp.stanford.edu/software/
CLEAR June 2014
33
Dialect Resolution Manu.V.Nair, Sarath K.S Department of Computer Science and Engineering Govt. Engg. College Sreekrishnapuram Palakkad, India -678 633 Abstract: A regional or social variety of a language distinguished by pronunciation, grammar, or vocabulary, especially a way of speaking that differs from the standard variety of the language. It is a recognized and formal variant of the language spoken by a large group of one region, class or profession. Whereas slang, that is different from dialect, consists of a lexicon of non-standard words and phrases in a given language. Dialect resolution is an approach to represent a dialect dialog/word into its formal format without losing its semantic. It is a localized approach, through which a local person can express his idea with his own style into a formal format. It is also a method to resolve the slang words also.
I. Introduction India is a special nation, holding highly linguistically diverse population with 18 officially recognized languages and other unofficial languages. Diversity is more than China (7 languages and hundreds of dialects) though area size India covers only one third of China. A dialect is a form of a language spoken in a particular geographical area or by members of a particular social class or occupational group, distinguished by its vocabulary, grammar, and pronunciation. The term is applied most often to regional speech patterns, but a dialect may also be defined by other factors, such as social class. A dialect that is associated with a particular social class can be termed a Sociolect, a dialect that is associated with a particular ethnic group can be termed as Ethnolect, and CLEAR June 2014
a regional dialect may be termed a regiolect or topolect. According to this definition, any variety of a language constitutes
"a
dialect",
including
any
standard varieties. A standard dialect is a dialect that is supported by institutions. Slang consists of words, expressions, and meanings that are informal, and are used by people who know each other very well or who have the same interests. It include mostly expressions that are not considered appropriate for formal occasions; often vituperative or vulgar. Use of these words and phrases is typically associated with the subversion of a standard variety and is likely to be interpreted by listeners as implying particular attitudes on the part of the speaker. In some contexts a speaker's selection of slang words or phrases may convey prestige, indicating
group
membership
or 34
distinguishing group members from those
includes
who are not a part of the group.
machine learning approaches. We can also
Among Indian languages, Malayalam is a highly inflectional language, distinct due to political and geographical isolation, the impact of Christianity and Islam, and the arrival of the Namboothiri Bhrahmins a little over
thousand
years
ago,
all
created
conditions favourable to the development of the
local
dialect
Malayalam.
The
Namboothiri grafted a good deal of Sanskrit on to the local dialect and influenced its
rule
based,
statistical
based,
use a hybrid approach, ie. mixture of the above mentioned approaches.
A hybrid
approach always have the positives of the previous mentioned approaches. So, a hybrid approach may have the higher accuracy since it have all the positives of each of the single approaches. III. Applications A. Localization
physiognomy. Malayalam language itself
Language localization is the process
contain many number of dialect variations.
of adapting a product that has been
Each one of the districts of Kerala have their
previously
own dialect changed languages.
languages to a specific country or
translated
into
different
region.Localization can be done for II. Dialect Resolution
Dialect resolution is an approach to represent a dialect dialog/word into its
regions or countries where people speak different languages or where the same language is spoken
formal format without losing semantic. It is a
Dialect Resolution have an important
localized approach, through which a local
role in localization. Since the rich
person can express his idea with his own
morphological language Malayalam have
style into a formal format. Highly informal
different types of dialects, we have to
slang words also to be resolved.
incorporate dialect resolution to the
From the computational point of view, Dialect Resolution is one of the difficult task. There are different types of dialect variations even for a single language. Computational methods for dialect resolution CLEAR June 2014
localization process.
Different
tribal
people can express
their ideas
&
knowledge to the outer world with their own language. The Dialect resolution system will convert it into the formal 35
language that they belongs to. This will
The application of Dialect Resolution
attract them to engage with outer world,
system in the field of speech to text
mainly the government processes &
application is crucial. If the speech to
services that aiming to them. They will
text application is possible with certain
feel free to use their mother tongue.
good accuracy, then that text may contain the dialect words. It must be resolved to
B. Machine Translation
use it as a formal one. It is very useful,
Machine Translation is the sub-field of
computational
linguistics
when the text in local language can be
that
converted into formal language and
investigates the use of software to
through that into another languages. This
translate text or speech from one natural
is very useful in the places like
language to another.
parliament, legislative assemblies etc.
Dialect Resolution is very much useful when a story in a colloquial is translated to another language. First, we can go through the Dialect Resolution process and make the story in a normal language format.
translation system. Without the help of a dialect resolution system, it will be difficult to deal with such colloquial We
can
implement
similar
process in the case of old transcripts. Old transcripts may contain the details of rare medicines, several histories etc. These Information can be translated to another language only with the support of a dialect resolving system. C. Speech to text applications CLEAR June 2014
As already said, Dialect Resolution is not an easy task. It is a difficult one, and it have different issues on implementation. A. Input Level
Then, it is given to the machine
works.
IV. Issues in Dialect Resolution
Thrissur dialect can be simple and more complex with compound and slang words. Some of the inputs like, sentence containing metaphor expressions, compound
named words,
entity, or
large
redundant
words, are more complex to resolve together. Each of them should be handled separately. As all know, Malayalam is a highly agglutinative language with free order character. Almost many slang words have change at tail, and 36
remaining part with its context words will provide clue about actual word. B. Machine Learning Level The
dialect
It is a great step in localized language resolution in Malayalam language. The ultimate possibilities are, bringing all dialect
patterns
are
ideas and informations to a formal format,
learned through a learning process.
which will be readable and understandable to
Smaller the corpus for learning,
others without any language boundary within
lesser will be the accurate for
the language. After that, translation to other
unknown items. Selection of method
language will be more easier.
for machine learning is also a crucial factor. Different learners perform differently on same corpus. Getting feature annotated corpus is also a big task. Named entities in training data will miss-lead, they have to be transliterated.
can be adapted in other slang of other language, with sufficient support of corpus. As an extension of this approach, gender, tense, person and number informations are labeled for disambiguation. It will be better with large corpus of formal sentences and
C. Corpus Level Availability of very large Malayalam formal-sentence corpus is a big issue. Also, a dictionary for all slang words are also a problem.
dictionary with all slang words. A better machine learning technique like TnT, SVM or CRF with this large corpus will provide more better results. VI. REFERENCES
D. Output Level Keeping naturality in output is great challenge. For handling ambiguous
Dialect resolution in thrissur slang
words,
informations.
need
Even
context though
Malayalam is a highly free order
[1] Daniel Jurafsky and James H Martin. Speeh
and
Introduction
Language to
Processing:
Natural
An
Language
Processing, Computational Linguistics and Speech Recognition, Prentice Hall, 1999.
language, for generating semantically correct sentences we need a proper
[2] Thorsten Brants, TnT – A Statistical Part-
word ordering.
of-Speech
Tagger, Saarland University,
Computational Linguistics, 2000. V. Conclusion & Future works
CLEAR June 2014
37
M.Tech Computational Linguistics Department of Computer Science and Engineering 2012-2014 Batch Details of Master Research Projects Title Name of Student Abstract
Tools Place of Work
Spoken Language Identification Abitha Anto Spoken Language Identification(LID) refers to the automatic process through which the identity of a language spoken in a speech sample is determined. This project is based on the phonotactic approach. The phone/phoneme sets are different from one language to another. Also the frequency of occurrences of certain phone/phoneme differs from one language to another. Based on the phone sequences of each language, a language dependent n-gram probability distribution model is estimated. The language identification is done by comparing the frequency of occurrences of certain phone or phone sequences with that of the target languages. The applications of LID system fall into two categories - preprocessing for machine understanding systems and preprocessing for human understanding systems. This project tries to identify the Indian languages such as English (Indian), Hindi, Malayalam, Tamil and Kannada. HTK, SRILM, Matlab Amrita Vishwa Vidyapeetham, Coimbatore.
Title Name of Student
Statistical Approach to Anaphora Resolution for Improved Summary Generation
Abstract
Anaphora resolution deals with finding of noun phrase antecedents of the anaphors. It is used in the natural language processing applications such as text summarization, information extraction, question answering, machine translation, topic identification etc. The project aims to resolve the mostly occurring pronouns in the documents such as third person pronouns. A statistical approach is utilized to perform the accomplished task. Important features are extracted and they are used for finding the antecedents of anaphors. The proposed system includes the components such as pronoun extractor, noun phrase extractor, gender classifier, subject identifier, named entity recognizer, chunker and part-of-speech tagger. NLTK, TnT, Stanford Parser GEC Sreekrishnapuram
Tools Place of Work
Ancy Antony
CLEAR June 2014
38
Title Name of Student
Anaphora Resolution in Malayalam
Abstract
An anaphor is a linguistic entity which indicates a referential tie to some other linguistic entity in the same text. Anaphora Resolution is a process of automatically finding the pairs of pronouns or noun phrases in a text that refer to same incidence, thing, person, etc. called referent. The first member of the pair is called antecedent and the next member is called anaphora. This project tries to resolve anaphora in Malayalam. We outline an algorithm for anaphora resolution, while working from the output of a Subject-Verb-Object tagger. And also Person-Number-Gender agreement is also included in the exisiting tagging system. Anaphora resolution is done based on the tagging and the degree of salience(salience value). The anaphora resolution system itself can improve the performance of many NLP applications such as text summarisation, term extraction and text categorisation. TnT IIITMK, Thiruvananthapuram, Kerala
Tools Place of Work Title Name of Student Abstract
Tools Place of Work
Title Name of Student Abstract
Tools Place of Work
Athira S
Extractive News Summarization using Fuzzy Graph based Document Model Deepa C A This system describes a news summarization system based on Fuzzy Graph Document Models. Modelling documents to Fuzzy Graphs is used to summarize a set of similar news paper articles. Each article is represented as a fuzzy graph, whose nodes represent sentences and edges connecting nodes are present if there exists a similarity between those sentences. This proposed extractive document summarizer uses fuzzy similarity measure to weight the edges. WebScrapy, NLTK Government Engineering College, Sreekrishnapuram,
Text Summarization using Machine Learning Approach Sreeja M This project aims the comparison of summarization algorithms using Rouge toolkit in DUC 2001 dataset and development of a new algorithm for summarization and the comparison with the previous works. Eclipse, Rouge, R-Studio Centre for Artificial Intelligence and Robotics , DRDO, Bangalore
CLEAR June 2014
39
Title Name of Student Abstract
Tools Place of Work
Fuzzy model-based emotion and sentiment analysis of text documents Divya M Computer systems are inevitable in almost all aspects of everyday life. With the growth of artificial intelligence , the capabilities and functionalities of computer systems have been enhanced. Emotions constitute a key factor in human communication.Human emotion can be expressed through various mediums such as speech, facial expressions, gestures and textual data. Emotion and sentiment analysis of text is a growing area of interest in the field of computational linguistics. Emotion detection approaches use or modify concepts and general algorithms created for subjectivity and sentiment analysis. In this paper the emotion and sentiment of text is analysed by means of fuzzy approach. The proposed method involves the construction of a knowledge base of words which are known to have emotional content for representing six basic emotions: anger,fear,joy,sadness,surprise and disgust. The system takes input natural language sentences, analyses them and determines the underlying emotion. It also represents the multiple emotion contained in text document. Experimental results indicate quite satisfactory performance. nltk,stanford parser Government Engineering College, Sreekrishnapuram
Title Name of Student
Ontology based information retrieval system in legal documents
Abstract
Ontology serves as knowledge base for any domain which is used by agents to mine relationships and dependencies, and/or to answer user queries. The domain of focus possibly holds a valid structure and hierarchy. Indian Penal Code (IPC) is one such realm which is the apex criminal code of India, fortified into five hundred and eleven sections under twenty three chapters. Ontology for IPC helps to create a vista which enables legal persons as well as common man to access the intended sections and the code in the easiest way. Ontology also provides the judgments produced over each code which make IPC more transparent and close to people. Protege and OWL are used to develop the ontology. Once completed, the ontology serve as an integral reference point for the elite legal community of the country, and can be applied to information retrieval, decision support/making, agent technology and question answering. Protege, Apache Jena, Python regular expression Government Engineering College, Sreekrishnapuram
Tools Place of Work
Gopalakrishnan. G
CLEAR June 2014
40
Title Name of Student Abstract
Tools Place of Work
Title Name of Student Abstract
Tools Place of Work
Topic Detection using LDA algorithm Indu M Probabilistic topic models are a suite of algorithms whose aim is to discover the hidden thematic structure in large archives of documents. Topic Modelling can enhance information network construction by grouping similar objects, event types and roles together. The documents is considered as a distribution on a small number of topics, where each topic is a distribution on words. So here the main aim is to find the most probable topics and its corresponding word distribution over the topics. One of the main algorithm used here is LDA(latent Dirichlet allocation ), which is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. pythonNLTK,Gensim Centre for Artificial Intelligence and Robotics , DRDO, Bangalore
Discourse Segmentation using Anaphora Resolution in Malayalam Lekshmi T S A number of linguistic devices are employed in text-based discourse for the purposes of introducing, defining, refining, and reintroducing discourse entities. This project looks at one of the most pervasive of these mechanisms, anaphora, and addresses the question of how discourse is maintained and segmented properly by resolving anaphora in Malayalam. An anaphor is a linguistic entity which indicates a referential tie to some other linguistic entity in the same text. The behaviour of referring expressions throughout the discourse seems to be closely correlated with the segment boundaries. In general, within the limit of a segment, referring expressions adopt a reduced form such as pronouns, whereas, a reference across a discourse boundaries tends to be realized via unreduced forms like definite descriptions and proper names.We outline an algorithm for anaphora resolution, while working from the output of a Subject-Verb-Object tagger. Next, by positioning the anaphoric references, segment boundaries are identified. The focus of this project is on anaphora resolution as an essential prerequisite to build the discourse segments of text. The anaphora resolution system itself can improve the performance of many NLP applications such as text summarisation, term extraction and text categorisation. TnT IIITM-K, Trivandrum
CLEAR June 2014
41
Title Name of Student Abstract
Tools Place of Work
Title Name of Student Abstract
Tools Place of Work
Speaker Verification using i-vectors Neethu Johnson Speaker verification is the process of verifying the claimed identity of a speaker based on the speech signal from the speaker. I-vector is an abstract representation of speaker utterance. In this modeling, a new low dimensional speaker and channel dependent space is defined using a simple factor analysis. This space is named the total variability space because it models both speaker and channel variabilities. Each speaker utterance is represented by an i-vector in total variability space. We have to carry out channel compensation in the total factor space rather than in the GMM supervector space followed by a scoring technique. So, the i-vectors can be seen as new speaker recognition features where the factor analysis plays the role of feature extractor rather than modeling speaker and channel effects. The use of the cosine kernel as a decision score for speaker verification makes the process faster and less complex than other scoring methods. HTK, LIA-RAL, ALIZE , Matlab Amrita Viswa Vidyapeetham, Coimbatore.
Topic Identification Using Fuzzy Graph Based Document Model Reshma O.K. Fuzzy graphs can be used to model the paragraph-paragraph correlation in documents. Using this model, nodes of the graph represent the paragraphs and the interconnections represent the correlation. This project aims to find the topic from the fuzzy graph model of the document. Topic identification refers to automatically finding the topics that are relevant to a document. The task of topic identification goes beyond keyword extraction, since relevant topics may not be necessarily mentioned in the document or its title, and instead have to be obtained from the context or the concept underlying in the document. The proposed system uses fuzzy graph for modelling the document. Then using eigen analysis of the correlation values from the fuzzy graph, the important paragraph can be found out. The terms and their synonyms in that paragraph are then mapped to predefined concepts. Thus the topic can be extracted from the document content. Topic identification can assist search engines in the retrieval of documents by enhancing the relevancy measures. NLTK, Stanford CoreNLP, Gensim, Stanford Topic Modelling Toolbox Government Engineering College, Sreekrishnapuram
CLEAR June 2014
42
Title Name of Student Abstract
Tools Place of Work
Ontology generation from Unstructured Text using Machine Learning Methods Nibeesh K This paper presents a system for ontology instance extraction that is automatically trained to recognize ontology instances using statistical evidence from a training set. This approach has several advantages: first, it eliminates the need for expert language- specific linguistic knowledge. To train the automated system, users only need to tag ontology instances. If the users needs change, the system can relearn from new data quickly. Second, system performance can be improved by increasing the amount of training data without requiring extra expert knowledge. Third, if new knowledge sources become available, they can easily be integrated into the system as additional evidence. GATE, Protege, Jena, OWL2,JAVA Government Engineering College, Sreekrishnapuram
Title Name of Student Abstract
TEXT CLASSIFICATION USING STRING KERNELS
Tools Place of Work
Openkernel, OpenFST, Libsvm. Amrita Vishwa Vidyapeetham ,Ettimadai, Coimbathore.
Varsha K V Text Classification is the class of assigning predefined categories to free text documents. In this task, the text documents are represented as feature vectors of high dimension. Where the feature values can be n-grams, named entities, words etc,.. Kernel Methods(KMs) makes use of kernel functions which gives the inner product of the document feature vectors. The KMs computes the similarity between text documents without explicitly extracting the features. Thus KMs are considered as an effective alternative to explicit feature extraction based classification. The project makes use of String Kernels for text classification. The String Kernel computes the similarity between two documents by the substring they contain. The n-gram kernel and gappy n-gram kernels are being used for the classification. In KMs the learning then takes place in the feature space, learning algorithms can be applied to this space. The project makes use of Support Vector Machine algorithm. Where SVMs are a class of algorithms that combine the principles of statistical learning theory with optimisation techniques and the idea of a kernel mapping. The non-dependence of KMs on dimensionality of the feature space and flexibility of using any kernel function makes them a good choice for text classification.
CLEAR June 2014
43
Title Name of Student Abstract
Tools Place of Work Title Name of Student Abstract
Tools Place of Work Title Name of Student Abstract
Domain Specific Information Extraction: A Comparison using different Machine Learning Methods Prajitha u Information extraction is the task of extracting structured information from unstructured text. It has been widely used in various research areas including NLP and IR. In this proposed system, a comparison of different machine learning methods like TnT, SVM, CRF and TiMBL is done to extract perpetrator entity from the opensource corpus of MUC34, which contain newswire articles of Latin American terrorism. TnT, SVM, CRF, TiMBL Government Engineering College, Sreekrishnapuram
Eigen Analysis Based Automatic Document summarization Sruthimol M P Automatic document summarization is the process of reducing a text document into a summary that retains the points highlighting the original document. Ranking sentences according to their importance of being part in the summary is the main task in summarization. This paper proposes an effective approach for document summarization by sentence ranking. Sentence ranking is done by vector space modeling and eigen analysis methods. Vector space model is used for representing sentences as vectors in an n-dimensional space assigning tf-idf weighting system. The principal eigen vectors of the characteristic equation will rank the sentences according to their relevance. Experimental result using standard test collections of DUC2001 corpus and Rouge Evaluation system shows that the proposed sentence ranking based on eigen analysis scheme improves conventional Tf-Idf language model based schemes. NLTK, Stanford CoreNLP, ROUGE Tool kit GEC Sreekishnapuram
SPEECH NONSPEECH DETECTION SINCY V. THAMBI Speech nonspeech detection is for identifying pure speech and to ignore nonspeech which includes music, noise, various environmental sounds, silence etc. Speech nonspeech detection is performed by analysing those features which enables to distinguish them efficiently. Features includes various time domain, frequency domain and cepstral domain features are analysed for short time frames of 20ms size along with their mean and standard deviation for segments of size 200ms. Among them best features are extracted using various feature dimensionality reduction mechanisms. It is noted that an accuracy of 95.085% is obtained for 2 hours speech-nonspeech database by using decision tree approach.
CLEAR June 2014
44
Tools Place of Work
Matlab, weka Amritha Viswa Vidyapeetham, Coimbatore
Title
Automatic Information Extraction and Visualization from Defence-related Knowledge Bases for Effective Entity Linking
Name of Student Abstract
Tools Place of Work
Sreejith C The project aims to develop an intelligent information extraction system capable of extracting entities and relationships from natural language reports using statistical and rule based approaches In this project, the knowledge about the domain is imparted to the machine in the form of ontology. The extracted entities are stored into a knowledge base and further visualized as graph based on the relationships existing between them for effective entity linking and inference. Java, Jene, Protege,CRF++ Centre for Artificial Intelligence and Robotics , DRDO, Bangalore
CLEAR June 2014
45
M.Tech Computational Linguistics Dept. of Computer Science and Engg, Govt. Engg. College, Sreekrishnapuram Palakkad www.simplegroups.in simplequest.in@gmail.com
SIMPLE Groups Students Innovations in Morphology Phonology and Language Engineering
Article Invitation for CLEAR- September-2014 We are inviting thought-provoking articles, interesting dialogues and healthy debates on multifaceted aspects of Computational Linguistics, for the forthcoming issue of CLEAR (Computational Linguistics in Engineering And Research) magazine, publishing on September 2014. The suggested areas of discussion are:
The articles may be sent to the Editor on or before 10th September, 2014 through the email simplequest.in@gmail.com. For more details visit: www.simplegroups.in
Editor,
Representative,
CLEAR Magazine
SIMPLE Groups
CLEAR June 2014
46
Hi, Computational Linguistics is an emerging and promising discipline in shaping future research and development activities in academia and industry, in fields ranging from Natural Language Processing, Machine Learning, Algorithms, Data Mining etc. Whilst considerable progress has been made in the development of Artificial Intelligence, Human Computer Interaction etc the construction of software that can understand and process human language still remains a challenging area. New challenges arise in the modelling of such complex systems, sophisticated algorithms, advanced scientific and engineering computing and associated problem-solving environments. CLEAR is designed to inform readers on the state of art in a number of specialized fields related to Computational Linguistics. CLEAR addresses the state of the art of all aspects of Computational Linguistics, highlighting computational methods and techniques for science and engineering applications. CLEAR is a platform for all, ranging from academic researcher and professional communities to industrial professionals in a range of number of topics in the field of Natural Language Processing and Computational Linguistics. CLEAR welcomes thought-provoking articles, interesting dialogues and healthy debates on multifaceted aspects in all areas covered, which includes audio, speech, and language processing and the sciences and technologies that support them. Thank you for your time and consideration.
Sreejith C CLEAR June 2014
47
CLEAR June 2014
48
Students'
Innovation
in
Morphology
Phonology
and
Language
Engineering (Abbreviated as SIMPLE) is the official website of M.Tech computational Linguistics, Govt. Engineering College, Palakkad. As the name indicates, SIMPLE is a platform for showcasing our innovations, ideas and activities in the field of Computational Linguistics. The applications of AI become much effective, when systems incorporate natural language understanding capabilities. Here, we are trying to explore how human language understanding capabilities can be used to model an intelligent behavior in computer. We hope our pursuit of excellence will ultimately benefit the society as we believe “Innovation brings changes to the society.� SIMPLE has plans and proposals for the active participation in the research as well as the applications of Computational Linguistics. The association is interested in organizing seminars, workshops and conferences in this area and also looking forward to actively network with the people and organizations in this field. Our activities are led by the common philosophy of innovation, sharing and serving the society. We plan to bring out the association magazine that explains the current technology in a CL perspective. www.simplegroups.in
CLEAR June 2014
49
CLEAR June 2014
50