CLEAR June2013
1
CLEAR June2013
2
Editorial …… ……. 5 C
SIMPLE News & Updates ……. ……… 6 Details of M.Tech Projects………….. 31 CLEAR Sep 2013Invitation…………… 40
CLEAR June 2013 Volume-2 Issue-2 CLEAR Magazine (Computational Linguistics in Engineering And Research) M. Tech Computational Linguistics Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad 678633 simplequest.in@gmail.com Chief Editor Dr. P. C. Reghu Raj Professor and Head Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad Editors Manu Madhavan Robert Jesuraj. K Athira P M Sreejith C
Last word…………. 41
A Novel Approach for Automated Question Answering ............................................. 7 Before the internet and electronic data storage, the time to search........
Software Localization and Malayalam Computing ............................................ 12 localize software in order to overcome cultural barriers for their ..
Extracting Precise Answers using Question Answering System................................. 16 Support Vector Machine employed tree kernel with a SVM classifier for question classification ...
Human Language Processing: Biological Perspective.... ........................................21 Scientific understanding of the role of genes in hearing is also increasing at an ......
ClearTK: Can I be a Competitor for NLTK and CoreNLP? ...............................................26 The ClearTK feature extraction library is highly configurable and .........
Cover page and Layout MujeebRehman. O
CLEAR June2013
3
CLEAR June2013
4
Greetings!
We are happy to release this edition of CLEAR at the start of the new academic year with contributions on a variety of topics. Interestingly, there is an article on the biological perspective of language processing as well. This is a positive signal for the CLEAR team, as their efforts are being received by an interdisciplinary audience. We hope that this edition fuels further thoughts on the mysteries of the language phenomenon!
With Best Wishes, Dr. P. C. Reghu Raj (Chief Editor)
CLEAR June2013
5
NEWS & UPDATES Industrial Training at IIITM-K: Virtual Resource Centre for Language Computing(VRC-LC) department of IIITM-K had organized a short course and industrial training on Natural Language Processing exclusively for the PG students of GEC, Sreekrishnapuram. The course was mainly related to Malayalam computing with an emphasis on the need for enabling localization. It was a 10 days programme (from 18th May - 28th May 2013) .During the course, various eminent faculties and research scholars of VRC-LC delivered their sessions on various aspects of language processing.
Congratulations!!!! The
paper
titled
Algorithm
for
Publication
"N-gram
based
distinguishing
between Hindi and Sanskrit texts" authored by Sreejith C and Indu M from
M.Tech
Computational
Linguistics has been accepted for the presentation
for
International
the
2013
Conference
Computing,
IEEE
Divya S from M.Tech Computation -al Linguistics presented a paper titled "News Summarizati on based on Sentence Clustering and Sentence Ranking", in ICMCMM 2013, conducted by MACFAST at Thiruvalla.
on
Communication
and Networking Technologies(ICCCNT)
SIMPLE Groups Congratulates Divya S for her achievement!!!
-2013. SIMPLE
Groups
Sreejith
and
Indu
Congratulates for
their
achievement!!!
CLEAR June2013
6
A Novel Approach for Automated Question Answering K.Ramya
K.M Arivuchelvan
Student, M.Tech (CSE) Periyar Maniammai University Vallam ramya.devi43@gmail.com
Assistant Professor (CSE) Periyar Maniammai University Vallam arivu@pmu.edu
Question Generation (QG) is a key challenging systems that interact with natural languages. The potential benefit of using automated systems to generate questions helps to reduce the dependency on humans. In particular, Anaphora resolution and Up-Keys to generate questions from input documents (i.e.) paragraph. This paper presents an approach to generate question from a paragraph. Since the paragraph may have complex sentences the system will generate the simple sentences. The simple sentences are transformed into interrogative sentences and use hybrid ranking to select the best questions. The current automatic question generation focus on factual question generation for reading comprehension or vocabulary assessment to find the different types of question from a paragraph (Yes-No, Who, What, Where and How questions). The system generates the answers according to the above framed questions.
I. Introduction Question Generation (QG) is the task of
and text books. In the field of Automatic
generating reasonable questions from an input,
Question
which can be structured (e.g. a database) or
focus on the text-to-question task where a set
unstructured (e.g. a text). Question Generation
of content-related
is believed to play a crucial role in a variety of
based on a given text. Usually, the answers to
cognitive faculties, such as Comprehension and
the generated questions are contained in the
reasoning. Asking good questions is a great
text. Question Generation can be divided into
skill of human and we could not expect such
deep QG and shallow QG. Deep QG generates
great skill from everyone. Therefore they would
deep
benefits from automated QG systems to assist
thinking (such as why, why not, what-if, what
them in meeting their inquiry needs. This
if-not and how questions) whereas shallow QG
section reports some of the research that
generates shallow questions that focus more on
supports our claim that human question asking
facts (such as who, what, when, where, which,
is extremely limited in both quantity and
how many/much and yes/no questions).
quality.
Question
generation
(QG)
for
the
purpose of creating reading assessments about the factual information that is present in expository texts such as encyclopedia articles
Generation
questions
(AQG),
most systems
questions
that
are generated
involve
more
logical
QG system can be helpful in the following areas: Intelligent tutoring systems. QG can ask questions based
on
learning
materials in order to Check learners'
CLEAR June2013
7
accomplishment or help them focus on
years,
the keystones in study. QG can also
automatic question generation. In ICITA‘05 [1],
help
they introduced a template-based approach to
tutors
to
prepare
questions
intended for learners or prepare for potential questions from learners. Closed-domain
Question
Answering
(QA) systems. Some closed-domain QA systems
use
predefined
(sometimes
hand-written) question-answer pairs to provide QA services. By employing a QG approach such systems could be ported to other domains with little or no effort. Natural
language
summarization/
generation systems. QG can help to generate,
for
instance,
Frequently
Asked Questions from the provided information source in order to provide a list of FAQ candidates. The advantage of this approach is that the mapping
from
sentence
is
declarative done
on
to
interrogative
the
semantic
representations. In this way, we are able to use an
independently
generator for the
developed analysis
parser
and
and generation
stage.
new
preoccupations
appeared
for
generate questions on four types of entities. An approach to question generation using parse tree manipulation, named entity recognition, and
Up-Keys
(significant
document).Existing question
phrases
method
generation
in
a
described
methods:
two
one
for
generating factoid questions, and another for generating definitional questions. We showed how our question generation approach can generate multiple questions from a single input sentence.
We
demonstrated
through
an
evaluation that our factoid question generation method shows promise, and we discussed our plans to use question generation for question answering [8]. Numerous approaches to text compression and simplification have been proposed; see [5, 4] for
reviews
of
various
techniques.
One
particularly closely related method is discussed in [3]. That method extracts simple sentences from each verb in the syntactic dependency trees
of
complex
sentences.
The
task
of
generating a question about a given text can be
II. Background and Related Literature
decomposed into three subtasks. First, given
NLP techniques have been used to develop a
the source text, a content selection step is
number of tutoring and feedback systems for
necessary to select a target to ask about, such
academic
of
as the desired answer. Second, given a target
computational linguistics, dealing with Question
answer, select question type, i.e., the form of
Generation (QG) is getting more attention from
question to ask, such as a cloze or why
the researchers [6]. Before the internet and
question.
electronic data storage, the time to search and
question type, construct the actual question in
find an answer for questions could extend for
a question construction step. These steps are
weeks hunting for documents in the library.
calls
Electronic books and information sources will be
Determination
the mainstream in the future. In the last few
[9],[10]
writing
CLEAR June2013
support.
In the
field
Third,
Concept
given
the
Selection, and
content,
Question
Question
and
Type
Construction.
8
In Paragraph Processing the complex sentences
III. Methodology
are converted into simple sentences. This in
A. Question Taxonomy
turn helps to extract the important keyword
Following 18 question categories according to
from the sentence.
the content of information sought rather than on the interrogative words (i.e. why, how, where, etc)[11]. 1. Verification:
invites
a
yes
or
no
answer. 2.
Disjunctive: Is X, Y, or Z the case?
3.
Concept
completion:
Who?
What?
When? Where? 4. Example: What is an example of X? 5. Feature specification: What are the properties of X? 6. Quantification: How much? How many? 7. Definition: What does X mean? 8. Comparison: How is X similar to Y? 9. Interpretation: What does X mean? 10. Causal antecedent: Why/how did X occur? 11. Causal consequence: What next? What if? 12. Goal orientation: Why did an agent do X? 13. Instrumental/procedural: How did an agent do X?
Figure 1 System Flow Diagram C. Sentence Classification In this module the input is the elementary sentences. Using the syntactic parser to parse the elementary sentence, and based on the associated POS and NE tagged Information,
14. Enablement: What enabled X to occur?
preposition and verb. This information is used
15. Expectation: Why didn‘t X occur?
to classify the sentences.
16. Judgmental: What do you think of X 17. Assertion:
1. Human: This will have any subject that is the name of a person.
18. Request/Directive
2. Entity: This includes animals, plant, These are the various types of question to be used in question generation module to produce the questions. B. Input Paragraph Processing
CLEAR June2013
mountains and any object. 3.
Location: This will be the words that represent locations, such as country, city, School etc.
9
4. Time: This will be any time, date or
Sentences in a target document and extracted
period such as year, Monday, 9 am, last
the question answer pair. So we click the
week, Etc.
question it will display the answer.
5. Count: This class will hold all the counted elements, such as 9 men, 7 workers, measurements like weight and
Abraham
size, etc. we get from each elementary
1809 – April 15, 1865), the 16th
sentence the subject, object,
President of the United States,
D. Question Generation from Paragraph The
Question
Generation
from
Paragraphs
(QGP) task has been defined such that it is application-independent.
Application-
independent means questions will be judged based
Example of Input Paragraph
on
content
paragraph.
For
analysis
this
task,
of
the
input
questions
are
Lincoln
(February
successfully
led
through
greatest
its
his
12,
country internal
crisis, the American Civil War, preserving the Union and ending slavery.
As
opponent slavery
of in
an the
outspoken
expansion
the
United
of
States,
Lincoln won the Republican Party nomination
in
1860
and
was
considered important if they ask about the core
elected
idea(s)
year. His tenure in office was
in
the
paragraph.
Questions
are
president
later
that
considered interesting if an average person
occupied
reading the paragraph would consider them so
defeat
based on a quick analysis of the contents of the
Confederate States of America in
paragraph.
the
primarily of
the
American
with
secessionist
Civil
introduced
the
War.
measures
He that
Simple, trivial questions such as what is X? Or
resulted
a
slavery,
issuing
paragraph about? were avoided. In addition,
Emancipation
Proclamation
implied questions were not allowed as the
1863 and promoting the passage
emphasis
is
and
of the Thirteenth Amendment to
answered
by
Diagram
the Constitution. As the civil
generic
paragraph.
question
on
such
as
questions
the
System
Questions
what
is
triggered Flow
should
the
not
be
compounded as in what is … and who …? Questions
must
be
grammatically
and
semantically correct and related to the topic of the given input paragraph. Question types (who/what/why/…) paragraph
should
generated be
diverse,
for if
each
possible.
Unique question types are preferred in the set of returned questions.
war
was
Lincoln American
in
the
drawing became
abolition
to
of his
a
close,
the
president
in
first to
be
assassinated. Examples of Questions Who is Abraham Lincoln? What major measures did President Lincoln introduce? How did President Lincoln die?
E. Question Generation with Answer Retrieval
CLEAR June2013
10
When Abraham Lincoln was elected president?
[3] Beigman Klebanov, B., Knight, K., Marcu, D.: Text simplification for Information Seeking
When
was
President
Lincoln
assassinated?
applications.
On
the
Move
to
Meaningful
Internet Systems (2004)
What party did Abraham Lincoln belong to?
[4] Clarke, J.: Global Inference for Sentence Compression: An Integer Linear Programming
IV. Conclusion and Future Works
Approach. Ph.D. thesis, University of Edinburgh
In this paper we presented an approach to
(2008)
question generation using Up-Keys (significant phrases in a document). We show how our question generation approach can generate multiple questions from an input. The proposed approach will automatically generate questions for
given
text.
We
sentences
from
complex
syntactic
extracted
information
elementary
sentences
and
using
classified
the
[5] Dorr, B., Zajic, and D.: Hedge Trimmer: A parse-andtrim approach to headline generation. In:
Proc.
Of
Workshop
on
Automatic
Summarization (2003) [6] Leung, H., Li, F. & Lau, R. Advances in Web Based Learning - ICWL 2007: 6th International Conference
elementary sentences. We generated questions based
on
the
subject,
verb,
object
and
[7] In V. Rus and A. Graesser, editors, The
preposition using a predefined interaction rules.
Question
Based on the questions the system will be
Evaluation Challenge Workshop Report. The
generating the answer. Since human generated
University of Memphis, 2009.
questions tend to have words with different meanings and senses, the system can be improved
with
the
inclusion
of
semantic
information and word sense disambiguation.
A.
&
Sniders,
E.
(2005).Automated Question Answering: Review of the Main Approaches. In Proceedings of the 3rd International Conference on Information and
Applications
(ICITA‘05),
Sydney, Australia.
international
and
[8] Heilman, M., & Smith, N. (2009). Question generation via over generating Transformations and ranking. Technical Report CMU-LTI-09-013,
In
Proceedings
conference
Phillips, Michael Wallis, Mladen Vouk and James Lester(2009).An Empirically-Derived Question Taxonomy for Task-Oriented Tutorial Dialogue. In
Proceedings
of
The
2nd
Workshop
on
Question Generation. [10] Ming liu, Rafael A.Calvo, Vasile RusG-
[2] Xin Li and Dan Roth. Learning question classifiers.
Task
[9] Kristy E. Boyer, William Lahti, Robert
Andrenucci,
Technology
Shared
Carnegie Mellon University
REFERENCES [1]
Generation
on
of
the
19th
Computational
Asks:
An
Generation
Intelligent System
for
Automatic Academic
Question Writing
Support.
linguistics, Morristown, NJ, USA, 2002.
CLEAR June2013
11
Software Localization and Malayalam Computing Sreejith C M. Tech Computational Linguistics Govt. Engineering College, Palakkad
“Enabling computers to understand human language is one of the major challenge in the field of technology. “ Extending your global reach is challenging –
regions
and when it comes to software, application
Localization
quality and tight release deadlines add to the
internationalized software for a specific region
complexity. The English language is sometimes
or
described as the lingua franca of computing. In
components and translating text. Hence there
comparison to other sciences, where Latin and
is a rigid development in the area of computing
Greek are the principal sources of vocabulary,
from the global English language to the local
Computer Science borrows more extensively
languages.
from English. Due to the technical limitations of early computers, and the lack of international
without is
engineering
the
language
process
by
adding
of
changes. adapting
locale-specific
Software Localization
standards on the Internet, computer users
Software localization is the process of adapting
were limited to using English and the Latin
a software product to the linguistic, cultural and
alphabet. However, this historical limitation is
technical requirements of a target market.
less present today. Most software products are
Software
localized in numerous languages and the use of
translation
of
the Unicode character encoding has resolved
Companies
localize
problems
Some
overcome cultural barriers for their products to
limitations have only been changed recently,
reach a much larger target audience. Software
such as with domain names, which previously
localization is the translation and adaptation of
allowed only ASCII characters.
a software or web product, including the
In
with
computing,
non-Latin
alphabets.
internationalization
and
localization are means of adapting computer software
to
different
languages,
regional
differences and technical requirements of a target
market.
Internationalization
is
software
Localization
itself
documentation.
a
is
product's
and
more User
software
all
Traditional
in
related
than
the
Interface. order
to
product
translation
is
typically an activity performed after the source document has been finalized.
the
process of designing a software application so that it can be adapted to various languages and
CLEAR June2013
12
Software localization projects, on the other
The standard localization process includes the
hand,
following basic steps:
often
run
in
parallel
with
the
development of the source product to enable simultaneous
shipment
of
versions.
example,
the
For
all
language
translation
●
required
software strings may often start while the
● ●
A software product that has been localized
market. Here are just a number of points that have
to
be
considered,
as
well
as
date formats (long and short), paper sizes, fonts, default font selection, case differences, character sets, sorting, word separation and hyphenation,
local
regulations,
copyright
Creation and
maintenance
of
target language
●
Adaptation of the
user
interface,
including resizing of forms and dialogs, as required
●
Localization of graphics,
scripts
or
other media containing visible text, symbols,
language, in order to effectively localize a number formats, address formats, time and
linguistic
Translation to the
the
software product or website: measuring units,
and
●
properly has the look and feel of a product originally written and designed for the target
Cultural, technical
terminology glossaries
involved such as project management, software engineering, testing and desktop publishing.
for localization
assessment
Translation is only one of the activities in a localization project – there are other tasks
and
evaluation of the tools and resources
of
software product is still in the beta phase.
Analysis of the material received
●
etc.
Compilation and
build
of
the
localized files for testing
●
Linguistic and
functional
quality
assurance
●
Project delivery
issues, data protection, payment methods, currency conversion, taxes.
CLEAR June2013
13
" is the slogan of the Malayalam as classical language
organization, which translates to "My language
After nearly three years of deliberation at
for/on My Computer". SMC has been active
various levels, the Union Cabinet on Thursday
since October 2002 and has been working to
may 24 2013 declared Malayalam as a classical
provide Malayalam language tools that work on
language. This is a welcome news for bringing
all layers of computing including and not limited
together the governments, the academia and
to rendering fixes, fonts, input mechanisms,
the research institutions and developers and
translations
the industry associations on a common ground
engines, dictionaries, spell checkers and other
for
computing.
indic script based language computing specific
Securing classical language tag would have
tools across operating systems. They are the
benefits as well as it will result in flow of
upstream for Malayalam fonts and tools for
resources and
and
popular GNU/Linux based operating systems
writers of eminence in Malayalam through
such as Fedora and Debian. They also maintain
awards. The benefits extended to Classical
localizations
Languages
Desktops (GNOME/KDE), popular applications
promoting
local
language
recognition of scholars
include,
two
major
annual
international awards for scholars of eminence
(localization),
for
popular
text-to-speech
Free
Software
such as Firefox and Libre Office.
and setting up of a Centre of Excellence for Studies in Classical Languages. This will also
Virtual
Resource
lead to more works in the area of malayalam computing.
and Kerala State IT Mission have embarked on journey
to
give
a
fillip
to
Malayalam
computing and research. The aim is to provide a common platform for existing isolated works and set a benchmark for the future research works, that till now was a missing link in promoting Malayalam computing. There are also
For
Language
Computing VRC-LC [4] is a research and project lab of
Thunchath Ezhuthachan Malayalam University a
Centre
several
other
groups
and
institutes
IIITM-K language
to
promote
with
the
Technology.VRC-LC is
and support
strengthen of
Information
a research and project
lab of IIITM-K to promote and strengthen local language
with
Technology.
the
support
of
Information
VRC-LC web portal will act as
the repository of various information about the research and projects related to language computing and the software tools, standards to enable localization in a computer.
working on this area such as :
local
The works
mainly concentrate to Malayalam Language, Swathanthra Malayalam Computing
that can address the linguistic barrier of our
Swathanthra Malayalam Computing (SMC) [3] is
a
free
software
collective
engaged
in
development, localization, standardization and
people using computer and also to enable the usage of various e-Governance projects to common man.
popularization of various Free and Open Source Softwares
in
Malayalam
CLEAR June2013
language.
"
14
Simple groups Computational Linguistics
5% people know English and rest are deprived
Lab @ GEC Sreekrishnapuram
of
Students' Innovation in Morphology Phonology and Language Engineering (Abbreviated as SIMPLE) [5] is the official group of M.Tech computational Linguistics students at Govt. Engineering College, Palakkad. Computational linguistics linguistics
(CL) and
is
a
computer
discipline science
between which
is
concerned with the computational aspects of the human language processing. It belongs to the cognitive sciences and overlaps with the field of artificial intelligence (AI), a branch of computer
science
that
is
aiming
at
computational models of human cognition. As the name indicates, SIMPLE is a platform for showcasing innovations, ideas and activities in the field of Computational Linguistics. SIMPLE group
is
also
working
to
promote
and
the
benefits
development.
of
The
information benefits
of
technology information
technology can reach to the common man only when
software
tools
and
human
machine
interface systems are available in their own languages. To enable wide proliferation in Indian languages, tools, products and resources should be freely available to the public. Thus in a multilingual country like India the scope of localization is enormous. Much efforts should be done in this area. Malayalam language and Malayalam
computing
should
also
be
encouraged
and
works
should
be
more
proposed and initialised. India can thus, poised to emerge as a Multilingual Computing hub. Reference 1 http://malayalam.kerala.gov.in
strengthen local language with the support of
2
Information Technology and computationally
technology/what-is-software-localization.html
http://www.sdl.com/technology/language-
driven statistical approaches. The malayalam computing activities by Simple groups includes malayalam indexing, pos tagger, subject object identifier,
spell
checker,
lemmatization,
3 http://smc.org.in/ 4 http://www.iiitmk.ac.in/vrclc/en/index.html
question answering systems etc.
5 http://www.simplegroups.in
India is a multilingual country, with 22 official
6 http://tdil.mit.gov.in/
languages and 12 scripts. In India only about
CLEAR June2013
15
Extracting Precise Answers Using Question Answering System K.Subalokshini M.Tech, CSE
R.Poonguzhali Assistant Professor(SS),CSE Periyar Maniammai University Vallam
Periyar Maniammai University Vallam
rpg_pmu@rediffmail.com
suba.candy@gmail.com
Question Answering Systems provides answers to the users questions in concise form which fulfills the expectation of the user. Question answering system is based on keywords search. This is similar to Web search. The Question Answering System should be able to provide answer for the user‘s questions in a user friendly way. Judging the correctness of the answer is an important issue in the field of question answering. In this paper, question classification is one of the heuristics for answer validation. Question classification is used to determine the type of question. This paper focus on context based retrieval of information. This paper provides an efficient method for extracting exact textual answers from the returned documents that are retrieved by traditional IR system in large-scale collection of texts.
Introduction The World Wide Web is the major source of of
Natural Language Processing (NLP) is the
information are available on the World Wide
computerized approach to analyzing text that
Web in one or another form. Managing such a
is based on both a set of theories and a set
huge volume of data is not an easy task.
of technologies. It is a very active area of
Search engines like Google and Yahoo return
research and development, there is not a
links to the documents for the user query. Most
single agreed-upon definition that
would
often, web pages retrieved by these search
satisfy
some
engines do not provide precise information and
aspects,
may contain irrelevant information in even top
knowledgeable person.
information
for
everyone.
All
kinds
ranked results. This makes the user to look for an alternate information retrieval system that can provide answers of the user queries in succinct form.
Question
Answering
Systems,
unlike
other information retrieval systems, combine question classification, information retrieval, and
information
extraction
techniques
to
present precise answers to user questions posed in a natural language.
CLEAR June2013
everyone, which
but
would
there be
are
part
of
any
The definition is Natural Language Processing is
a
theoretically
motivated
range
of
computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human-like language processing
for
a
range
of
tasks
or
applications. The goal of NLP as stated above is
to
accomplish
human-like
language
processing.
16
A typical pipeline Question Answering System
Automatic
consists
question
divided into two main approaches known as
classification, Question Processing, Document
machine learning and language modelling. The
Processing,
primary machine learning algorithm used for
of
different
phases:
Answer Processing and
Answer
question
classification
question
Background and Related Work
Machine employed tree kernel with a SVM
NLP techniques are used in applications that
classifier
make queries to databases, extract information
reported 80.2% accuracy without the use of
from text, retrieve relevant documents from a
syntactic
collection, translate from one language to
modelling
another, generate text responses, or recognize
approaches try to compute the probability of
spoken words converting them into text.
the question for a given question class.
A common feature of NLP systems is that they
The use of context in information retrieval
convert text input into formal representation of
systems
has
meaning such as logic (first order predicate
Recently,
the
calculus),
conceptual
temporal contextual clues [8], category labels
frame-based
[9], and top-ranking related sentences [6] has
representations.[1]. NLP-based (QASs) systems
been explored empirically through user studies
may utilize machine learning to improve their
in
syntax rules [1], lexicon [4], semantic rules
Interactive
[5],or the world model [4].
interest in interface issues associated with
dependency
networks,
diagrams,
or
a
for or
question semantic
based
Web
is
further
extraction.
semantic
classification
is
Support classification
features.
question
been
environment. at
Language
of
studied.
spatial
Furthermore,
TREC
and
classification
extensively
effectiveness
Track
Vector
has
and
the
generated
information retrieval systems. [7] Compared a Early
QA
systems,
e.g.
1960s‘
Intelligent
Question-Answering Systems by Coles et.al [3],
single-document and multi-document view of IR results for a question answering task.
focused on how to kill the semantic ambiguity of questions using artificial intelligence (AI),
Methodology
and evolved to expertise systems [2].
A. Question Classification Question classification is used to determine the
There are two main approaches for question
question type. The question type makes the
classification: manual and automatic. Question
user a clear view to identify the expected
Answering Systems using manual classifications
answer type. With the help of question type it
(Hermjakob, 2001) apply hand-crafted rules to
is easy to retrieve the answer form the large
identify expected answer types. These rules
collection of documents. There are different
may be very accurate but these are time
types of question types. Some of them are
consuming,
listed below.
tedious,
and
non-extendible
in
nature.
CLEAR June2013
17
Functional
Non-Wh
Who/Whose/Whom Questions: Questions falling
questions (except how) fall under the category
under this category usually ask about an
of Functional Word Questions. The functional
individual or an organization.
word
Word
questions
Questions:
usually
All
start
with
non-
significant verb phrases.
Example: Who wrote ‗Thirukural‘? Why Questions: ‗‗Why Questions‖ always ask for certain reasons or explanations. Example:
Why
do
heavier
objects
travel
downhill faster? How Question: ‗‗How Questions‖ have two types of patterns: For the first pattern, expected answer type is description of some process while second pattern returns some number as answer. Example: How data travels in internet? How many states in India?
Figure.1 System Flow Diagram Example: list the properties of acids.
What
When Questions: ‗‗When Questions‖ start with ‗‗When‖ keyword and usually refers for date or time.
Questions:
‗‗What
Questions‖
have
several types of patterns? ‗‗What Questions‖ can ask for virtually anything. Many ‗‗What Questions‖
are
disguised
in
the
form
of
‗‗Functional Word Questions‖
Example: When was Lincoln born? Where Questions: ‗‗Where Questions‖ start with
Example: What is android?
‗‗Where‖ keyword and usually related to the
B. Keyword Extraction:
location. It may be of mountains, geographical
Keyword extraction is used to extract only the
boundaries,
as
keyword from the users question and remove
temple, or some virtual location or fictional
the stop words and stem words. With the use of
place
keywords it is very easy to extract the answers.
manmade
locations
such
Example: Where is Tajmahal?
The keywords are extracted from the questions and are further used as a root to extract the
Which Questions: ‗‗Which Questions‖ start with
answers from the available online resources. To
‗‗Which‖ keyword and usually referred with the
obtain the keywords we can utilize some syntax
noun phrase associated with the noun phrase in
parsing tools.
the question. Example: Which is the best laptop?
CLEAR June2013
18
C. Information Retrieval:
The goal of a question answering system is to
Information retrieval is the activity of obtaining
retrieve answers to questions rather than full
information, relevant to the information needed
documents or best-matching passages, as most
from a collection of information resources.
information retrieval systems. Although our
Information retrieval is used to get the related
method takes advantage of the redundancy of
information about the questions asked by the
answer across stream and allowed significantly
user. The information retrieval can be obtained
reduce
from
Google,
presented to the user, question answering
will
system gives a succinct form of answer to
many
Wikipedia.
online
The
resources
keywords
like
obtained
be
helpful to retrieve the needed information from
the
number
of
incorrect
answer
user‘s question in natural language.
the available online resources. REFERENCES D. Collecting Frequent Item Sets Collecting frequent item sets help to identify
[1] H.
Feili,
Natural
Language
Processing
whether the keyword is occurring frequently in
Projects, [PowerPoint] Sharif UT, Tehran,
that document or not. So that it is easy to
Iran, 2003 [2] Robert
identify the top most relevant document.
F.
questions
Simmons, by
Answering
computer:
a
english survey,
E. Answer Extraction:
Communications of the ACM, Vol. 8, No. 1,
Answer extraction is used to get the succinct
pp.:53-70, Jan. 1965
form of answer for the given question. From
[3] L. Stephen Coles, An on-line question-
or
answering systems with natural language
passages from the information retrieval, the
and pictorial input, Proceedings of the 23rd
answer extraction performs detailed analysis
ACM national conference, Princeton, ACM,
and pin-points the answer to the question.
August 1968.
the
given
top
N
relevant
documents
Usually answer extraction produces a list of
[4] A. Kirschenbaum, S. Wintner, "Minimally
answer candidates and ranks them according to
supervised
some scoring functions.
translation",
transliteration Proceedings
for of
machine The
12th
Conference of the European Chapter of the IV Conclusion and Future Work
Association for Computational Linguistics
This paper summarizes the categories of QA
(EACL-09), April 2009.
system, and also helps us to understand the
[5] E.
Sneiders,
Automated
Question
Question
Answering: Template-Based Approach, PhD
answering system is one of the hot-spots in
thesis, Stockholm University / KTH press,
natural language processing. Compared with
Sweden, 2002
types
of
traditional
question
classification.
keyword-based
search
engine,
[6] White, R., Ruthven, I., and Jose, J. Finding
Question Answering system allows users to ask
relevant
documents
using
top
ranking
questions in natural language.
sentences: An evaluation of two alternative schemes. In Proceedings of SIGIR 2002.
CLEAR June2013
19
[7] Belkin, N., Keller, A., Kelly, D., Carballo, J., Sikora,
C.,
and
Sun,
question-answering
Y. in
Support
for
interactive
information retrieval: [8] Rutgers‘
TREC-9
experience.
In
interactive
Proceedings
of
track TREC-9,
2000. [9] Park, J. and Kim, J. Effects of contextual navigation aids on browsing diverse web systems. In Proceedings of CHI 2000. [10]
Dumais, S., Cutrell, E., and Chen, H.
Optimizing search by showing results in context. In Proceedings of CHI 2001.
“When a language dies, a way of understanding the world dies with it, a way of looking at the world. “ - George Steiner
CLEAR June2013
20
Human Language Processing: Biological Perspective Priyesh Sankar MBBS Student Government Medical College Kozhikode
"Communication is truly a multisensory experience. For most individuals, the pathway from creating sound (speaking) to receiving, processing, and interpreting sound (hearing) is critical." 1. Introduction Sound
offers
Contemporary hearing research is guided by of
lessons learned from sensory research, namely
communication. Our sense of hearing enables
that specialized nerve cells respond to different
us to experience the world around us through
forms
sound. Because our sense of hearing allows us
electromagnetic—and convert this energy into
to
sounds
electrochemical impulses that can be processed
continuously and without conscious effort, we
by the brain. The brain then works as the
may take this special sense of communication
central
for granted. But, did you know that
perceives
gather,
Human
us
a
process,
powerful
and
communication
means
interpret
is
multisensory,
involving visual, tactile, and sound cues? audible to painful, is over 100-trillion-fold? hair
cells,
are
responsible
processor and
of
sensory
interprets
chemical,
impulses.
them
using
or
It a
―computational‖ approach that involves several notion is different from the long-held view that the brain processes information one step at a
Tiny specialized cells in the inner ear, as
energy—mechanical,
regions of the brain interacting all at once. This
The range of human hearing, from just
known
of
time in a single brain region. Over the past
for
decade, scientists have begun to understand
converting the vibrational waves of sound into
the intricate mechanisms that enable the ear to
electrical signals that can be interpreted by the
convert the mechanical vibrations of sound to
brain?
electrical energy, thereby allowing the brain to
Tinnitus, commonly known as ―ringing in
process and interpret these signals.
the ears,‖ is actually a problem that originates in the brain?
Scientific understanding of the role of genes in hearing is also increasing at an impressive rate. The first gene associated with hearing was isolated in 1993. By the end of 2000, more than
60
genes
related
identified.
In
addition,
pinpointed
over
100
to
hearing
scientists
chromosomal
were have
regions
believed to harbor genes affecting the hearing pathway. Many genes were first isolated in the mouse, and from this, the human genes were
CLEAR June2013
21
identified. Completion of the Mouse and Human
of our lives. Anything we hear in the context of
Genome Projects is helping scientists isolate
speech after that will be sorted into one of the
these genes.
pre-existing percepts.
The rapid growth in our understanding is of
3. Major Concepts Related to Hearing and
more than academic interest. In a practical
Communication
sense, sharing this information with young people can enable them to adopt a lifestyle that
3.1 Communication is multisensory
promotes the long-term health of their sense of
Communication with others makes use of sound
hearing. With this in mind, this supplement will
and vision.
address several key issues, including
Although
some
people
might
define
What is the nature of sound?
communication as an interaction between two
What mechanism allows us to process
or more living creatures, it involves much more
sounds with great precision—from the softest
than this. For example, we are constantly
whisper to the roar of a jet engine, from a
receiving information from, and changing our
high-pitched whistle to a low rumble?
relationship
What are the roles of hearing, processing, and speaking in human communication? What
happens
when
the
with,
our
environment.
This
communication is received through our senses of smell, taste, touch, vision, and hearing.
hearing
Communication with others makes use of vision
mechanism is altered or damaged? How does
(making
sound processing change?
language) and sound (using speech or other
What
can
be
done
to
prevent
or
accommodate damage to our sense of hearing?
eye
contact
or
assessing
body
sounds, such as laughing and crying). When a group of people shares a need or desire to communicate, language is born. The most
2. Language Processing
common human language is the language of
The language center of the brain (Wernicke‘s
words. Words may be communicated in various
area in the dominant temporal lobe) is the
ways. Although they are usually spoken, they
―dictionary‖ of the brain – translating words
also
into
words.
expressed through sign language. Words may
Wernicke‘s area has input from auditory and
be communicated by writing, speaking, and
visual areas of the brain, which makes sense.
signing
concepts
and
concepts
into
may
be
written,
finger
spelled,
or
In essence, Werkincke‘s area hears speech and then translates those sounds into words that
3.2 Language acquisition: imprinting and
have abstract meaning our language cortex can
critical periods
recognize a limited set of speech sounds, or
Our brains have specific regions devoted to
components (called percepts). We learn these
speech, hearing, and language functions.
in the first four years of our life from hearing
Since the time of Plato, there has been debate
speech, and then the ―language window‖ closes
over the nature of language. Some believe that
and we are limited to those sounds for the rest
language is inborn and purposeful, while others
CLEAR June2013
22
believe it to be artificial and arbitrary. Some
which refers to the ability of some animals to
consider
learn rapidly at a very early age and during a
language
to
be
an
evolutionary
product, while others do not. It appears that
well-defined
words are not ―built into‖ the brain, because
Imprinting generally refers to the ability of
language is a relatively recent evolutionary
offspring to acquire the behaviors characteristic
development and also because languages differ
of their parents. This process, once it occurs, is
substantially from one another. Language and
not reversible
communication
are
made
possible
period
in
their
development.
by
specialized structures. We have evolved a
A second concept, related to imprinting, is
sophisticated apparatus for both speech and
critical periods. A nonhuman example of a
hearing.
regions
critical period is the limited time frame within
devoted to speech, hearing, and language
which a male bird must acquire his song. 8 For
functions.
instance,
Our
brains
Still,
the
have
specific
mechanisms by which
a
male
white-crowned
sparrow
children acquire language are only partially
usually begins singing his full song between
understood.
100
and
200
days
is
needed
acquisition
of
age.
for
Proper
mating
song
and
for
marking territory. However, to learn his song, the young bird must be exposed to an adult bird‘s song consistently and frequently between one week and two months after hatching Very soon after birth, human infants learn to distinguish speech sounds from other types of sound. Within the next month or two, the infant learns to distinguish between different speech sounds.4, recognize
14
An
and
18-month-old use
the
toddler
sounds
can
(called
phonemes) of his or her language and can construct two-word phrases. A 3½-year-old child can construct nearly all of the possible sentence types. From this point on, vocabulary and language continue to expand and be refined. 3.4 Perception of sound has a biological basis When sound, as vibrational energy, arrives at There are two concepts important to the
the ear, it is processed in a complex but distinct
acquisition of language. One is imprinting,
series
CLEAR June2013
of
steps.
These
steps
reflect
the
23
anatomical division of the ear into the outer
This allows the brain to approximate the
ear, middle ear, and inner ear
sound‘s location. Interestingly, the position and orientation of the pinna, at the side of the head, help reduce sounds that originate behind us. This helps us hear sounds that originate in the direction we are looking and reduces distracting background noises. Some students (and adults) may believe that the size of the ear is an indication of the organism‘s hearing ability—that is, the larger the ear, the better the ability to hear. This misperception doesn‘t take into account the
Figure Anatomy of the human ear.
internal structures of the ear that process sound vibrations. A large pinna may serve a
The pathway from the outer ear to the inner
function that is unrelated to hearing. For
ear is remarkable in its ability to precisely
example,
process sounds from the very softest to the
elephant is filled with small blood vessels that
very loudest and to distinguish very small
help the animal dissipate excess heat. The
changes in the frequency of sound (pitch).
external ear may be specialized in other ways,
Humans can discern a difference in frequency
as
of just 0.1 percent. This means that humans
undoubtedly
can tell the difference between sounds at
movement of their pet‘s pinnae as the animal
frequencies of 1,000 Hz and 1,001 Hz.
attempts to locate the source of a sound. The
well.
the
external ear of
Cat
owners,
observed
for the
the
African
example, rather
have
dramatic
cochlea is divided into an upper chamber, The outer ear. The outer ear is composed of
called the scala vestibuli or vestibular canal,
two parts. The pinna is the outside portion of
and
the ear and is composed of skin and cartilage.
tympani or tympanic canal. These are seen
The second part is called the ear canal (also
mo st easily if the cochlea is represented as
called the external auditory canal). The pinna,
uncoiled,
a
lower
chamber,
called
the
scala
with its twists and folds, serves to enhance high-frequency sounds and to focus sound waves into the middle and inner portions of the ear. The pinna also helps us determine the direction
from
which
a
sound
originates.
However, the greatest asset in judging the location of a sound is having two ears. Because one ear is closer to the source of a sound than the other, the brain detects slight differences in the times and intensities of the arriving signals.
CLEAR June2013
24
Both the upper and lower chambers are filled
lower chamber, to the round window. The
with a fluid, called perilymph, which is nearly
round window allows the release of the
identical to spinal fluid. The stapes vibrates
hydraulic pressure caused by vibration of the
against
fluid
stapes in the oval window. Additionally, the
vibrations that are transmitted as pressure
diameter of the chambers decreases from base
waves all the way through the cochlea. As
(closest to the windows) to apex.
the
oval
window,
creating
represented by the arrows in Figure, these waves move from the upper chamber to the
“All communication involves faith; indeed, some linguisticians hold that the potential obstacles to acts of verbal understanding are so many and diverse that it is a minor miracle that they take place at all.� -Terry Eagleton
CLEAR June2013
25
CLEARTK : Can I be a Competitor for NLTK and Stanford CoreNLP? Robert Jesuraj K
M.Tech Computational Linguistics Government Engineering College, Sreekrishnapuram ClearTK is a toolkit for developing statistical natural language processing components in Java and is based on the Apache Unstructured Information Management Architecture (UIMA) framework for text analysis It is developed by the Centre for Computational Language and Education Research (CLEAR) at the University of Colorado at Boulder.
The overall size of ClearTK (cleartk-release-1.4.1-bin) is
177Mb. Features:
Most of ClearTK is distributed under the BSD
A common interface and wrappers for popular machine learning libraries such as SVMlight, LIBSVM, OpenNLP MaxEnt, and Mallet.
license. However, there are a couple of subprojects that are licensed under the GPL license because they depend on GPL licensed third party libraries. ClearTK can be used to achieve state-of-the-art
performance
on
biomedical
A rich feature extraction library that can
part-of-speech tagging. UIMA provides a set of
be used with any of the machine
interfaces
for
learning classifiers. Under the covers,
analyzing
unstructured
ClearTK understands each of the native
provides
infrastructure
machine
configuring,
learning
libraries
and
defining
running,
components
for
information for
and
creating,
debugging,
and
translates your features into a format
visualizing these components. But, ClearTK
appropriate to whatever model you're
focused on UIMA‘s ability to process textual
using.
data. All components are organized around a type system which defines the structure of the
Infrastructure
for
NLP
annotations that can be associated with each
components for specific tasks such as
document. This information is instantiated in a
part-of-speech
data structure called the Common Analysis
chunking,
named
semantic
role
creating
tagging,
BIO-style
entity
recognition,
labeling,
temporal
relation tagging, etc.
as the Snowball stemmer, the OpenNLP the
MaltParser
dependency
parser, and the Stanford CoreNLP tools. Corpus readers for collections like the Penn Treebank, ACE 2005, CoNLL 2003,
(CAS).
There
is
one
CAS
per
document that all components that act on a document
Wrappers for common NLP tools such tools,
Structure
can
access
and
update.
Every
annotation that is created is posted to the CAS which is then made available for other UIMA components to use and modify. Here is a short list of the most important kinds of components: Collection Reader – a component that reads in documents and initializes the
Genia, TimeBank and TempEval.
CLEAR June2013
26
CAS
with
any
available
annotation
information.
mode in which it performs feature extraction
Analysis Engine – a component that performs analysis on the document and adds
classifier annotator can also be run in training
annotations
to
the
CAS
or
modifies existing ones.
and then writes out training data which is then used for building a model. Feature Extraction The ClearTK feature extraction library is highly
CAS Consumer – a component that
configurable and easily extensible. Each feature
processes the resulting CAS data (e.g.
extractor produces a feature or set of features
write annotations to a database or a
for a given annotation (or pair or collection of
file)
annotations as the feature extractor requires)
Collection Processing Engine (CPE) – an aggregate component that defines a pipeline that typically consists of one collection
reader,
a
sequence
of
analysis engines, and one or more CAS consumers.
for the purpose of characterizing the annotation in a machine learning context. A feature in ClearTK is a simple object that contains a value (i.e. A string, boolean, integer, or float value), a name, and a context that describes how the feature value was extracted. Most features are created by querying the CAS for information
While UIMA provides a solid foundation for
about existing annotations. Because features
processing text, it does not directly support
are typically many in number, short lived, and
statistical NLP. ClearTK provides a framework
dynamic in nature (i.e. features often derive
for
use
from previous classifications), they are not
for
represented in the CAS but rather as simple
creating
statistical
UIMA
learning
components as
the
that
foundation
decision making and annotation creation.
Java objects.
Statistical NLP in ClearTK ClearTK was designed and implemented with
The spanned text extractor is a very simple
special attention given to creating reusable and
example of a feature extractor that takes an
flexible code for performing statistical NLP. As
annotation and returns a feature corresponding
such, the library provides classes that facilitate
to the covered text of that annotation. The type
extracting features, generating training data,
path extractor is a slightly more complicated
building classifiers, and classifying annotations.
feature extractor that extracts features based on a path that describes a location of a value
ClearTK introduces classifier annotators which
defined by the type system with respect to the
are
annotation type being examined.
analysis
engines
that
perform
feature
extraction, classify the extracted features using a machine learning model, and interpret the
For example, Figure below shows a simple
results of the classification by e.g. labelling
hypothetical type system. A type path extractor
annotations or creating new annotations. A
initialized
CLEAR June2013
with
the
path
27
headword/partOfSpeech can extract features
The last three letters of the first two
corresponding to the part-of-speech of the
words of a named
head word of examined constituents.
annotation.
A much more sophisticated feature extractor is
The
the window feature extractor. It operates in
sentences.
conjunction with a simple feature extractor (such as the spanned text extractor or type path extractor) and extracts features over some numerically bounded and oriented range of annotations (e.g. five token to the left) relative to a focus annotation (e.g. a named entity annotation or syntactic constituent) that are within some window annotation (e.g. a sentence
or
paragraph
annotation.)
The
―featured‖ annotations, the focus annotation and the window annotation are all configurable with respect to the type system. This allows the window feature extractor to be used in a wide array of contexts. The window feature extractor also handles boundary conditions such that e.g.
lengths
of
the
annotation
appears
in
would
previous
10
A feature extractor is any class that generates feature
objects. For example,
the
window
extractor has a method that takes a focus annotation
(e.g.
a
word)
and
a
window
annotation (e.g. a sentence) and produces features relative to these two annotations according to how the feature extractor was initialized. Many feature extractors implement an interface that designate them as simple feature extractors which allows them to be used by more complicated feature extractors such as the window
extractor. It is the
responsibility of the classifier annotator to know how to initialize feature extractors and how to call
words appearing outside the sentence that the focus
entity mention
them.
be
considered as ―out-of-bounds.‖ This feature extractor allows one to extract features such as: The three part-of-speech tags to the left a word. The part-of-speech tag of the head word of Constituents
to
the
right
of
an
annotation.
NLP Components in ClearTK ClearTK provides a growing library of UIMA components that support a variety of NLP tasks. The library consists of three main types
The identifiers of recognized concepts
of components: collection readers, analysis
to the left an annotation.
engines, and classifier annotators which is summarized in table below
The penultimate word of a named entity mention annotation.
CLEAR June2013
28
Component
description
Penn Tree Reader
Reads the Penn Treebank corpus
PropBank
Reads the PropBank corpus
ACE2005 reader
Reads in named entity mentions from the ACE 2004 and 2005 tasks
CoNLL2003 reader
Reads in named entity mentions from the CoNLL 2003 task
GENIA reader
Reads in the GENIA corpus
Tokenizer
Penn Treebank style tokenizer
Sentence detector
Wrapper around OpenNLP sentence detector
syntax parser
Wrapper around Open NLP syntax parser
Stemmer
Wrapper around the Snowball stemmer
Gazetteer annotator
Finds mentions of entries in a gazetteer using simple string matching
POS tagger
Performs part-of-speech tagging
BIO chunker
Performs BIO-style chunking
Predicate annotator
Identifies predicates
Argument annotator
Identifies and classifies semantic arguments of predicates
The collection readers of particular interest
CAS such that the full syntactic parse of each
provided by ClearTK are those that read in
sentence is represented in the CAS such that
widely used annotated corpora such as Penn
constituents
Treebank or PropBank. The Penn Treebank
retrieved. The PropBank reader extends this
reader reads in constituent parse trees into the
reader by layering on the predicate/argument
CLEAR June2013
and
their
relations
can
be
29
structure provided by the PropBank corpus.
location‖ are used for words that begin a
There are also collection readers for reading in
person
the ACE 2005 corpus and the CoNLL 2003
mention, respectively. The BIO chunker is used
shared task data.
for named entity recognition, shallow parsing,
The
analysis engines provided by ClearTK
include a pattern-based tokenizer, a gazetteer annotator, and various wrappers around other NLP libraries. The tokenizer is based on Penn Treebank tokenization rules . The gazetteer annotator finds entries from a gazetteer in text using simple string matching. Other analysis
mention
or
are
inside
a
location
and tokenization. Semantic role labelling is achieved
by
the
predicate
and
argument
annotators. The predicate annotator decides whether constituents of a syntactic parse are predicates or not. The argument annotator runs subsequently and finds the arguments of a predicate.
engines include wrappers around the OpenNLP part-of-speech tagger, sentence detector, and syntax parser and a wrapper around the Snowball stemmer. ClearTK currently provides
References: 1. http://code.google.com/p/cleartk/
a small handful of classifier annotators: a partof-speech tagger, a BIO-style chunker, and a pair
of
semantic
classifier role
annotators
labelling.
The
2. Philip V. Ogren and Philipp G. Wetzler
that
support
and Steven Bethard, ClearTK: A UIMA
BIO
chunker
toolkit for statistical natural language
erforms text chunking using the popular Begin,
processing,
Inside, Outside labelling scheme for classifying
Interoperability for Large HLT Systems:
annotations as members of some kind of
UIMA for NLP workshop at Language
―chunk.‖
Resources and Evaluation Conference
For
example,
in
named
entity
recognition labels such as ―B-person‖ or ―I-
CLEAR June2013
Towards
Enhanced
(LREC), 2008.
30
M.Tech Computational Linguistics Department of Computer Science and Engineering Details of Master Research Projects Title Name of Student Abstract
Opinion Mining Ancy K Sunny Opinion Mining can be performed in various methods and in various domains. The Proposed system finds information about a product from the internet and extracts the sentences which expresses opinions and finds out the features which are commented. It then calculates the polarity of overall opinions. The first task is to identify whether the sentence collected is subjective (opinionated) or objective. This phase uses a bootstrap method which employs high precision (and low recall) classifiers to extract a number of subjective sentences. The labelled sentences are then fed to an extraction pattern learner, which produces a set of extraction patterns that are statistically correlated with the subjective sentences. These patterns are then used to identify more sentences within the un-annotated texts that can be classified as subjective. Next step is to extract object features that have been commented on in each sentence. In last phase the system finds the polarity of the opinion and summarizes the opinion on same features. To find the polarity of the opinion Adverb Adjective Combinations are used.
Tools
Python, NLTK, Sentiwordnet.
Place of Work
Govt. Engineering College, Sreekrishnapuram
Title
Discourse Analysis: Clustering Approach
Name of Student Abstract
Christopher Augustine Discourse analysis is concerned with coherent processing of text segments larger than the sentence and assumes that this requires something more than just the interpretation of the individual sentences.
While syntax and semantics work with
sentence-length units, the discourse level of NLP works with units of text longer than a sentence. Several types of discourse processing can occur at this level, two of the most common being anaphora resolution and discourse structure recognition. A discourse usually concentrated on a group of nouns. The clustering of nouns with appropriate boundary corrections can segment a text at discourse level. Tools
Python
Place of Work
Govt. Engineering College, Sreekrishnapuram
CLEAR June2013
31
Title Name of Student Abstract
Word Sense Disambiguation In Malayalam Mujeeb Rehman O The Peculiarity of any language is that, there might have lot of ambiguous words. Word Sense Disambiguation (WSD) is the task to determine which of the senses is invoked in a particular context. A standard approach to WSD is to consider the context of the words use in particular the words that occur in some predefined neighbouring context. Like many other languages Malayalam also have the ambiguous words. They can call as Nanarthas. This project adopted the Lesk Algorithm for disambiguating the Malayalam word sense disambiguation, in other words resolving Nanartha words.
Tools
Python
Place of Work
Govt. Engineering College, Sreekrishnapuram
Title
Dysarthric speech recognition and enhancement
Name of Student Abstract
Divya Das Dysarthria is a motor-neuro disorder. It causes the functioning of the speech production system. Clinically it can be healed by medicines and speech therapy. For improving the intelligibility of speech computer based approaches can be used. Here the input for computer based systems in disordered speech from dysarthric people. This paper deals with speech enhancement method to improve the intelligibility of dysarthric speech. Nemours database is used for getting the dysarthric speech. Mild Dysarthric speech is used for the experiment. The directories BB, FB, LL, MF of Nemours database contains the speech of mild dysarthria. Praat is used for analyzing the Dysarthric speech. The dysarthric speech is separated into voiced and unvoiced components. Speech enhancement is done on the voiced part. The voiced part of the speech mainly contains the information. Then from this voiced speech formant frequencies are extracted using burg algorithm. After that the extracted formant frequencies specially F1 and F2 are passed through a 4-order high pass filter. In dysarthric speech formant frequencies does not have more variations as that of a normal speech. So by applying high pass filtering more variations can be introduced to the dysarthric speech. In dysarthic speech spectral slope is lesser than that of the normal speech. By applying high pass filtering on formant frequencies the spectral slope can also be increased.
Tools
HTK, Praat, Matlab, Colea, Wavesurfer, P563, P862, Composite, fAI.
Place of Work
Amrita University, Coimbatore
CLEAR June2013
32
Title Name of Student Abstract
Chronological News Summarization Divya S News articles are one of the most exponentially increasing types of documents that we can find on Internet. And it has reached such a level that finding and recalling relevant news events is a difficult task. News summarization aims to identify common information among multiple related news documents and fuse it into a coherent text to produce an abstract of a news event. The proposed system is intended to produce informative summaries, highlighting common and most relevant information found in news documents in a user friendly manner. This will help Web users to pinpoint information that they need without extensive reading. This system takes as input a cluster of news stories on the same event and produces a summary which synthesizes common information across input stories. For a particular news event the system collects all the related stories from a particular time stamp (the beginning of the news event) to produce the abstract using Statistical approaches and natural language processing techniques. The summary is intended to contain all the relevant points of the news event from the starting of the event till date.
Tools
Python, NLTK, Hierarchical cluster, Wordnet, Hadoop
Place of Work
Govt. Engineering College, Sreekrishnapuram
Title
Ontology-based Domain-specific Natural Language Question Answering System
Name of Student Abstract
Athira PM Question answering (QA) system aims at retrieving precise information from a large collection of documents. This paper describes the architecture of a Natural Language Question Answering (NLQA) system for a specific domain based on the ontological information. The proposed system describes four basic modules suitable for enhancing current QA capabilities with the possibility of processing complex questions. The first module is the question processing which analyses and classifies the question and also reformulates the user query. The second module allows the process of retrieving the relevant documents. The third module processes the retrieved documents and finally the last module performs the extraction and generation of response. Ontology and domain knowledge is used for reformulation of queries and identifying the relations. The aim of the system is to generate short and specific answers to the question that is asked in the natural language in a specific domain.
Tools
Python nltk Stanford Core NLP, Verbnet, ProtĂŠgĂŠ
Place of Work
Govt. Engineering College, Sreekrishnapuram
CLEAR June2013
33
Title
Scalable Natural Language Report Management using Distributed IE and NLG from Ontology
Name of Student Abstract
Manu Madhavan The automatic text analysis and creation of Knowledge base from the natural language reports are the key ideas in the field of semantic web. In the age of information explosion, performing these tasks of big data become tedious and impractical. MapReduce, a programming paradigm proposed by Google, gives us a new approach to solve problems related to big-data analysis, by making use of the power of multimachines. This project make use Hadoop - an open source implementation of MapReduce to model a Scalable Natural Language Report Management system using distributed information extraction from large-scale natural language reports (in a specific domain). In this project, the knowledge is imparted to the machine in the form of ontology. The persistent storage of ontology is done using open source graph database - Neo4j. It also uses the techniques of Natural Language Generation (NLG) for querying and analysing knowledge base. Antlr an open source tool for generating domain specific grammar is used for rule-based information extraction.
Tools
Hadoop, Antlr, Jena, DOM Parser, SPARQL, Pellet, NaturalOWL, Postgres
Place of Work
Centre for Artificial Intelligence and Robotics(CAIR), DRDO, Bangalore
Title
Question Answering in Domain Specific Malayalalm Documents
Name of Student Abstract
Pragisha K This work attempts to find answers of Malayalam factual questions by using a repository of Malayalam documents.It uses Information Retreieval and Natural Language Processing in Malayalam to perform the extraction of appropriate responses. The proposed system is designed with three modules. The first one, question analysis, identifies the question word(s) and query words. It also generates answer templates. Next module performs text retrieval and answer snippet extraction. An IR module is used to interact with the document repository to obtain the documents for answer selection. These documents are analysed for the answer snippet extraction. The third module is responsible for the answer identification by using a scoring method. The system uses the language resources stemmer, POS tagger, named entity recognition system and wordnet for the Natural Language Processing in Malayalam.
Tools
Python, NLTK
Place of Work
Govt. Engineering College, Sreekrishnapuram
CLEAR June2013
34
Title
HMM-based Malayalam Text to Speech Synthesis
Name of
Rechitha C R
Student Abstract
Since speech is obviously one of the most important ways for human to communicate, there have been a great number of efforts to incorporate speech into human-computer communication environments. The function of a Text- to-Speech system is to convert some language text into its spoken equivalent by a series of modules. This involves the integration of speech technology and language technology. The task of a TTS system is thus a complex one that involves mimicking what human readers do. TTS synthesis system contains components supporting front-end processing of the input text, language modelling, and speech synthesis using its signal processing module. The proposed work involves the design of a TTS synthesis system for Malayalam.
Tools
Hidden Markov Model Toolkit (HTK), The Festival Speech Synthesis System, Speech Signal Processing Toolkit (SPTK), HMM-based SpeechSynthesisSystem(HTS),
Place of Work
Amrita University, Coimbatore
Title
Interlingua for Malayalam
Name of
Sibi S
Student Abstract
Automatic translation between human languages (‗Machine Translation‘) is a Science Fiction staple, and a long-term scientific dream of enormous social, political, and scientific importance. Machine Translation (MT) has lot of application in multilingual countries like India. A good MT system can help an individual to read and write any language, even if he is novice to those languages. So, it is necessary to implement a translation system that will translate from one language to another. This paper proposes a method for generating an intermediate form of the Machine Translation from Malayalam. It contains words with necessary information such as subject-object roles, gender, person, number, case, tense, etc. Thus, the target language can be easily constructed from the intermediate form.
Tools
SVM, TnT
Place of Work
Virtual Language Recourse Centre, IIITM-Kerala
CLEAR June2013
35
Title Name of Student Abstract
Malayalam WordNet Renuka Babu T Malayalam is one of the 22 official languages in India, spoken by nearly 33 million people. WordNets are being built for about thirteen of these official languages at different institutions. Hindi WordNet, developed at IIT Bombay is the first WordNet developed for an Indian language. Malayalam is a morphologically very rich, free word order Indian language, where very little computational work is reported about malayalam. This paper is one of the efforts towards building a Malayalam WordNet. Malayalam WordNet is a database of Malayalam word forms (words and collocations) which are grouped together in the form of synsets. The synsets are interconnected to other synsets via a number of lexical and semantic relations such as hypernym and hyponym (the is-a relation), meronym and holonym (the part-of relation), antonyms etc. The lexical relationships hold between semantically related forms of words and the semantic relationships hold between related word definitions.
Tools
Python, NLTK
Place of Work
Govt. Engineering College, Sreekrishnapuram
Title
Morphological Analyzer for Malayalam
Name of Student Abstract
Rinju O R Enabling computers to understand human language is one of the major challenges in the field of computing. Morphological Analyzer is a very important part in many NLP related applications. An NLP system is started with analyzing the input. So if we do not have a Morph analyzer with considerably good accuracy then the accuracy of whole system will get affected. This paper proposes a Morphological analyzer for Malayalam, which is part of a promising research in various NLP applications in Malayalam. Morphological analyzer takes a word as input and returns its morphemes along with its grammatical information, depending upon its word category. For nouns this tool will provide gender, number, and case information. For verbs, it will provide tense and aspects. Malayalam morph-analyzer would help in automatic spelling and grammar checking, natural language understanding, machine translation, speech recognition, speech synthesis, part of speech tagging and parsing applications.
Tools
Python 3.3
Place of Work
Virtual Language Recourse Centre, IIITM-Kerala
CLEAR June2013
36
Title
Semantic Framework for Natural Language Report Management using Distributed Information Extraction and Scalable Ontology Processing
Name of Student Abstract
Robert Jesuraj The project aims to develop an Intelligent Report Management System using Distributed Ontology processing. The system analyzes the document (corpus) in a specific domain and makes an Ontology, this ontology will be useful later to query the system. The system will have an option to generate intelligent report based on the query. Hadoop-Free open source framework for Map/Reduce paradigm will help to process the system faster. In present scenario the field of computer science is focussing on how to process big-data (Big data Problem) and also ways to make computer speak and understand human languages. For solving this kind of problems knowledge has to be imparted to the system, some level of Artificial Intelligence (AI) is needed. The system is said to be intelligent if it is able to find a solution that the predicted output will be more similar with the human. In-order to generate such a system Ontology is required for the particular domain. So creation of domain specific ontology is an important research area in AI and computer science. Even if such a system is developed the solution will be generated very slowly (as text processing is very slow by computers). In-order to generate at a faster rate, the process has to be distributed to different nodes. Jena is used to build the ontology. Pellet Reasoner is used to reason inferences.
Tools
antlr, Hadoop, neo4j, swoop, protege, jena, pellet, postgresql
Place of Work
Centre for Artificial Intelligence and Robotics(CAIR), DRDO, Bangalore
Title
Information Extraction from Advertisements using Classification Approach
Name of Student Abstract
Saani H Advertisements on websites such as Craigslist are largely unstructured text even though individuals would naturally want to perform structured search over certain attributes of interest for purposes such as purchasing a car,a book or job searching. My aim is to build a system to perform information extraction over unstructured job advertisements using various natural language processing techniques including machine learning, Bayesian classification, named entity recognition along with rule based approaches. The information extracted from these advertisements can be used to perform search over certain attributes of interest.
Tools Place of Work
Govt. Engineering College, Sreekrishnapuram
CLEAR June2013
37
Title
HMM Based Malayalam Speech Recognition
Name of
Sumi S Nair
Student Abstract
Speech is the primary means of communication between people. People are comfortable with speech and also persons wish to interact with computers via speech. The goal of Speech Recognition system is to translate acoustic signals into a sequence of words. The recognizer makes use of phone-based continuous density Hidden Markov Model (HMM) for acoustic modelling and n-gram statistics estimated on text material. To deal with phonological variability, alternate pronunciations are included in the lexicon. Speech recognition system is applied with acoustic observation O and the goal is to find the corresponding word sequence W that has the maximum posterior probability P (W |O). The proposed work is to design an Automatic Speech Recognition system for Malayalam.
Tools
HTK, Audacity, Praat, wavesurfer
Place of Work
Amrita University, Coimbatore
Title
Extraction of Semantic Relation from Medical Records
Name of
Ayisha Noori V K
Student Abstract
Biomedical natural language processing deals with the application of text mining techniques to clinical documents and to scientific publications in the areas of biology and medicine. A crucial area of Natural Language Processing is semantic analysis, the study of the meaning of linguistic utterances. Natural language processing of biomedical text benefits from the ability to recognize broad semantic classes from different clinical notes. This thesis proposes a method that extract semantics from medical patient records using statistical machine learning techniques. In particular, this is concerned with the identification of relationships between different diseases and enlist the necessary medical tests(ECG, CT scan etc.) required for a patient. For example, if a patient is having pneumonia, this method is intended to identify some possible diseases that the patient may encounter and list out the necessary tests that the patient have to perform.
Tools
Python, NLTK,
Place of Work
Government Engineering College, Sreekrishnapuram
CLEAR June2013
38
Title Name of Student Abstract
Conceptual Indexing and Semantic Searching for Malayalam documents Radhika K. T Conceptual search, i.e., search based on meaning rather than just character strings, has been the motivation of a large body of research in the IR field. Here proposes the system for indexing Malayalam documents using the index terms based on the concept information, not merely the word strings. Subsequently the system searches for documents based on that conceptual index terms and finally produces two set of ranked files. One set contains the exact relevant documents with respect to the query concept and the second set contains documents related with the query concept.
Tools
Python3
Place of Work
Govt. Engineering College, Sreekrishnapuram
CLEAR June2013
39
M.Tech Computational Linguistics Dept. of Computer Science and Engg, Govt. Engg. College, Sreekrishnapuram Palakkad www.simplegroups.in simplequest.in@gmail.com
SIMPLE Groups Students Innovations in Morphology Phonology and Language Engineering
Article Invitation for CLEAR- Sep-2013 We are inviting thought-provoking articles, interesting dialogues and healthy debates on multifaceted aspects of Computational Linguistics, for the forthcoming issue of CLEAR (Computational Linguistics in Engineering And Research) magazine, publishing on Sep 2013. The suggested areas of discussion are:
The articles may be sent to the Editor on or before 10th Sep, 2013 through the email simplequest.in@gmail.com. For more details visit: http://simplegec.blogspot.in
Editor,
Representative,
CLEAR Magazine
SIMPLE Groups
CLEAR June2013
40
Hello World, While we are coming with the fourth issue of CLEAR, the recent recognition of Malayalam as Classical language makes this episode more special. Languages are not only a medium of communication, but also a strong idol of our culture and tradition. The knowledge encrypted in each language is incredible and invaluable.
Of Course, the fast growing technology and globalization has dismantled all such idols. Thence the civilization evolved with the local languages are in a threat. “Natural Selection� is an evergreen truth. By standing away from technology, no language can survive this digital era. In order to fit the tech pad, it is combined responsibility of technocrats and linguists to develop computational resources for their own languages.
We expect, the long wait recognition for Malayalam will benefit Malayalam computing and the related projects. SIMPLE Groups open hands to language enthusiasts for their volunteer works.
Wish you all the best.
Manu Madhavan.
CLEAR June2013
41
CLEAR June2013
42
CLEAR June2013
43