Approximate/Fuzzy String Matching using Mutation Probability Matrices
Linguistic
Computing issues in non-English languages are
We consider the approximate/fuzzy string
generally being addressed with less depth and breadth,
matching problem in Malayalam language
especially for languages which have small user base.
and propose a log-odds scoring matrix for
Malayalam, one such language, is one of the four major
score-based alignment. We report a pilot
Dravidian languages, with a rich literary tradition. The
study designed and conducted to collect a
native language of the South Indian state of Kerala and the
statistics about what we have termed as “accepted
mutation
probabilities”
of
characters in Malayalam, as they naturally occur. Based on the statistics, we show how a scoring matrix can be produced for
Lakshadweep Islands in the west coast of India, Malayalam is spoken by 4% of India‘s population. While Malayalam is integrated fairly well with computers, with a user base that may not generate huge market interest, such fine issues of
Malayalam which can be used effectively in
language computing for Malayalam remains unaddressed
numeric scoring for the approximate/fuzzy
and unattended.
string matching. Such a scoring matrix would enable search engines to widen the
If we were to search Google to look for information on the
search operation in Malayalam. Being a
senior author of this paper, Achuthsankar, and we gave the
unique and first attempt, we point out a
query as Achutsankar or Achudhsankar, in both cases
large number of areas on which further
Google would land us correctly in the official web page of
research and consequent improvement are
the author. This ―Did you mean‖ feature of Google is
required. We limit ourselves to a chosen
managed by the Google-diff-match-patch [4]. The match
set of consonant characters and the matrix
part of the algorithm uses a technique known as the
we report
approximate string matching or fuzzy pattern matching
is a prototype for
further
improvement.
[10]. The close/fuzzy match to any query that is received by he search engine is routine and obvious to the English language user. However, when a non-English language such as Malayalam is used to query Google, the same facility is
Authors:
Dr. Achuthsankar S Nair Hon. Director, Centre for Bioinformatics University of Kerala
not seen in action.
When the word for
Sajilal Divakaran FTMS School of Computing, Kuala Lumpur
the
number
(Pathinaayiram – Malayalam word ten
thousand)
is
used
as
a
query in Google Malayalam search, we are directed to docu ments that contain a similar word
(Payinaayiara
m a common mispronunciation of the original word ) but not the word
.
This is because approximate/fuzzy string matching has not been addressed in Malayalam. In this paper we make preliminary attempts toward addressing this very special issue of approximate/fuzzy string matching Malayalam Approximate/Fuzzy String Matching
The
field described as approximate or fuzzy string matching in computer science has been firmly established since
1980s. Patrick & Geoff [5] define approximate string matching problem as follows: Given a string s drawn from some set S of possible strings (the set of all strings composed of symbols drawn from some alphabet A), find a string t which approximately matches this string, where t is in a subset T of S. The task is either to find all those strings in T that are ―sufficiently like‖ s, or the N strings in T that are ―most like‖ s. One of the important requirements to analyze similarity is to have a scientifically derived measure of similarity. The soundex system of Odell and Russell[13] is perhaps one of the earliest of such attempts to use such a measure. It uses a soundex code of one letter and three digits. CLEAR Sep.2012
1
Odell and Russell[13] is perhaps one of the earliest of such attempts to use such a measure. It uses a soundex code of one letter and three digits. These have been used successfully in hospital databases
and
airline reservation systems [8]. Damerau-Leveshtein metric[2]
among a small group of school children
proposed
operations
(N=30). The observed mistakes (natural
(insertions, deletions, substitutions, or reversals) to change one
mutations) are tabulated in Table 2 as
string into another. This metric can be used with standard
probabilities. It is noted that the sample
optimization techniques[14] to derive the optimal score for each
size of N=30 is inadequate for a linguistic
string matching and thereby choose matches in the order of
study of this kind. However, as already
closeness. Approximate or fuzzy string matching is in vogue not
highlighted, this paper reports a pilot
only in natural languages but also in artificial languages. In fact
study
approximate string matching has been developed into a fine art in
concept. Moreover, the sample size can be
computational sciences, such as bioinformatics. Bioinformatics
made larger once the research community
deals
whets the approach put forward by us.
a
measure
mainly
-
with
the
smallest
bio
number
sequences
of
derived
from
to
demonstrate
proof
of
the
DNA, RNA, and Amino Acid Sequences[9]. Dynamic programming algorithm
(Needleman–Wunch
and
Smith–Waterman
algorithms)[11] which enable fast approximate string matching using carefully crafted scoring
matrices are in great use in
bioinformatics. The equivalent of Google for modern
biologist is
basic local alignment search tool (BLAST)[1], which uses scoring matrices such as point accepted mutation matrices (PAM)[3] and BLOcks of Amino Acid SUbstitution Matrix (BLOSUM)[6]. To the best of the knowledge of the authors, such a scoring system is not in existence for any natural language including English.
Log-odds Scoring Matrix It is possible to use Table 2 itself for
Recently an attempt has been made in this direction for English
scoring string matches. However, it might
language[7]. The statistics for accepted mutation in English was
be unwieldy in practice. For long strings
cleverly derived based on already designed Google searches. In
we will need to multiply probabilities,
the case of Malayalam, statistics of character mutations are not
which might result in numeric underflow.
easily derivable from any corpus or any existing search engines
Hence,
or other language computing tools. Hence, data for this needs to b
transformation. Another effect that we will
e generated to go ahead with development
matrix
use is to convert from probability to odds.
system. We will now describe generation of primary data of
The odds can be defined as the ratio of
natural mutation in Malayalam.
the probability of occurrence of an event
of
scoring
Malayalam has a set of 51 characters, and basic statistics of its matrix.
The
mutation are required for developing a scoring
occurrence
probabilities
are
will
use
a
logarithmic
to the probability that it does not. If the
Occurrence and Mutation Probabilities occurrence and
we
available,
derived
from corpus of considerable size in 1971 and again in 2003[12].
probability of an event is p, then odds is p/1-p. We will however not use this formula directly, but define odds for any given match i-j as: Sij = 10 log (Pij/Pi)
We describe here only a subset of characters in view of economy of space. In Table 1, we give the probabilities of one set of consonants, which we have extracted from a small test corpus of
In
the
above
equation,
pij
is
the
probability that character i mutates to character j and pj is the probability of
Malayalam text derived from periodicals.
natural occurrence of character j. Thus We then designed and conducted a study to extract the character mutation probabilities. We selected 150 words that cover all the chosen
consonant
among a CLEAR Sep.2012
characters.
A
dictation
was
administered
the negative score for a mutation of a less frequently
occurring
character
will
be
more in this scheme. The multiplier 10 is ed 2
used just to bring the scores to a convenient range. Table 3
References [1] Altschul, S F, et al. (1990). ―Basic local alignment search tool‖, Molecular Biology, 215(3), 403-410.
shows the log- odds score thus derived using occurrence probabilities and mutation probabilities given in Table 1 and 2. These can be used to score approximate matches and select the
[2] Damerau, F J (1964). ―A technique for computer detection and correction of spelling errors‖, ACM C ommunications, 7(3), 171-176.
most similar one.
[3] Dayhoff, M O, et al. (1978). ―A model of Evolutionary Change in Proteins‖, Atlas of protein sequence and structure, 5(3), 345-358. [4] Google-diff-match-patch, [Online]. Available: http://code.google.com/p/google-diffmatch patch/, Accessed on 20 Jan. 2012. Results, Discussions, and Conclusion The prototype scoring matrix we have designed above can be demonstrated to be capable of scoring approximate matches and can therefore be a means of selecting the closest match. We will demonstrate this with an example of scoring four approximate matches for the word k. Table 4 lists the scores for the four different matches and the exact match scores best. The next best match as per the new scoring scheme is കക.
[5] Hall, P A V and Dowling, G R (1980). ―Approximate String Matching‖, ACM Computing Surveys, 12(4), 381- 402. [6] Henikoff, S and Henikoff, J G (1992). ―Amino Acid Substitution Matrices from Protein Blocks‖, Proceedings of the National Academy of Sciences of the United States of America, 22(22),1091510919. [7] Kanitha, D (2011). ―A scoring matrix for English‖, MPhil Dissertation in Computational Linguistics, Dept. Of Linguistics, University of Kerala. [8] Leon, D (1962). ―Retrieval of 24 misspelled names in an airlines passenger record system‖, ACM Communications, 5, 169-171.
Our demonstration has been on a chosen set of consonant characters, but it can be expanded to cover all Malayalam characters. For demonstrating more general words, scoring matrix for vowels is essential. We have computed the same and will be reporting it in a forthcoming publication. During our studies, we also noticed that the grouping of characters as done conventionally may not suit our studies. For example, we found that the character though
they
are
is a possible mutation for , very rarely, even not
grouped
together
conventionally.
A
regrouping based on natural mutations is a work we see as requiring attention. To the best of our knowledge, our work is a unique proposition for the Malayalam language, which can be incorporated into Malayalam search engines. We would like to reiterate that our work is in prototype stage. The sample size of the corpus as well as the size of the subjects in the survey is not substantial. The
[9] Nair, A S (2007). ―Computational Biology & Bioinformatics: A Gentle Overview‖, Communications of the Computer Society of India, 31(1), 1-13. [10] Navarro, G (2001). ―A Guided Tour to Approximate String Matching‖, ACM Computing Surveys, 33(1), 31 88. [11] Needleman, S B and Wunsch, C D (1970). ―A general method applicable to the search for similarities in the amino acid sequence of two proteins‖, Journal of Molecular Biology, 48(3), 443-453. [12] Prema, S (2004). ―Report of Study on Malayalam Frequency Count‖, Dept. Of Linguistics, University of Kerala. [13] Soundex, [Online]. Available: http://en.wikipedia.org /wiki/Soundex, Accessed on 2 Dec. 2011. [14] Wagner, R A and Fischer, M J (1974). ―The String-to-String Correction Problem‖, Journal of the ACM, 21(1), 168-178.
authors hope to expand the work with a sizable database from which statistics is extracted and then the scoring matrix can be made more reliable. We also propose to validate the scoring
This article was published in CSI MAY 2012 and reused here with author's permission.
approach with sample trials involving language experts. CLEAR Sep.2012
3
INDIAN SEMANTICS AND NATURAL LANGUAGE PROCESSING The
Author:
history of modern linguistics is chronologically divided into
two as BC (Before Chomsky) and AD (After Dissertation). Here dissertation means the thesis which Chomsky submitted to Pennisilvania University for Doctorate degree. His ideas are considered epoch making comparable to the Darvin‘s theory of evolution
and
Therefore
took
Chomsky
time
to get recognition
himself
published
it
like as
M.Jathavedan, Emeritus Professor, Department of Computer Applications, CUSAT, Cochin mjvedan@cusat.ac.in
Darvin.
‗Syntactic
Structures‘. Paninian grammar was introduced to modern linguistics as a
Presented
a
Sanskrit:
An
paper
entitled
Inter-lingua
for
‗
Sastric Machine
Translation ‗.
forerunner of Chomsky‘s generative grammar introduced in the above book. ‘Many linguists, foreign and Indian, joined the bandwagon and paused as experts in Paninian grammar in Chomskian terms ( Joshy S.D.). The renewed interest
had
influenced the interpretation of Paninian grammar itself as generative grammar – the idea that grammar consists of modules in a hierarchy or levels. The first contribution in this direction was due to Kiparsky and Staal (1969 ) who proposed a hierarchy of four levels of representation. This was criticized by Hauben (2002)as they did not permit semantic factors. Other important contributions are due to Caradona (1976).
Thus computational Sanskrit emerged as a new branch of research. Apart from
Joshy continues: ‗Somewhat later Chomsky had drastically
computer assisted teaching and research
reversed his ideas and after the enthusiasm for Chormsky
of
subsided, it became clear that the idea of transformation is
automated reconstruction of Sanskrit texts
alien to Panini. Now a new type of linguistics has come up,
and machine aided translation
called Sanskrit Computational Linguistics with three capital
designing a working system of Paninian
letters. Although Chomsky is out , Panini is still there ready to
grammatical
framework
for
machine
be acclaimed as the forerunner of SCL.‘ But SCL was identified
translation
especially
for
Indian
as a branch of study in 2007 only and there were other factors
Languages, it‘s possible applications in
that led to its formation.
cognitive science, AI are some areas of
Sanskrit
(like
any
other
subject), (MAT),
active research in Sanskrit departments of In a paper entitled ‗Knowledge representation in Sanskrit and
many universities and computer science
Artificial
departments of many institutes.
Intelligence‘
a
NASA
scientist
Rick
Briggs
drew
attention of computer scientists to the works on semantics in Sanskrit literature instead of Paninium. note is that he was referring
The important fact to
the ‗Vaiyakarana
Siddhanta
Laghu Manjusha‘ of Bhatta- Nagesa (1730-1810), perhaps the last Sanskrit scholar in the Indian tradition. This paper, rightly or wrongly, aroused great enthusiasm among Sanskrit scholars. Some of them went even to the extent of claiming that the future direction of research in artificial language would be decided by Sanskrit. The immediate result was the ‗ First Seminar
It is a surprising fact that we are not able to locate any more contribution of Briggs in
this
field.
Further,
comments
are
pouring in the internet for and against the arguments put forward by Briggs. Another point to be noted is that the authority of the paper is Briggs in person and not NASA as ill-conceived by many.
on Knowledge Representation and
Samskritam ‗ (1986) held at Bangalore in which Briggs
CLEAR Sep.2012
4
A question that naturally raised was the role of Sanskrit as a
the
“Kriya is the action of the verb in
development of a compiler for use of Sanskrit instructions.
dedicated
programming
language
which
meant
the sentence. The other words
C-DAC, Bangalore had initiated some work in this direction
which are “factors in the action “of
in early 1990s itself. It was claimed that Astadhyayi
the verb are called karakas.”
(Paninium ) was useful in this matter – i.e., meta-rule, meta-language and linguistic marker system of Panini to draw up the specification and requirements of such a
The formal categories in their discussions
processor. To what extent the search has been successful
were mainly those established in Paninium
after twenty years is a question.
and
investigated
semantically
and
philosophically by Bhartruhary. We will The International Symposiums on Sanskrit Computational
consider two or three of them.
Linguistics ( SCL )were the results of the attempt to provide a common platform for the traditional Sanskrit Scholars and
As an example we consider the sentence:
the computational linguists. It was a culmination of the World Sanskrit Conferences, especially the thirteenth one held at Edinburg and the First National Symposium on Modeling and shallow parsing of Indian Languages in Mumbai, both held in the year 2006. The first Symposium was held in France in 2007 and the last one at Jawaharlal Nehru University, New Delhi (2010).
„Rama cooks rice‟ In the subdivision of a sentence into words, the grammarians take the verb as important. Other words are related to this meaning-bearing word in one way or other. Kriya is the action of the verb in the sentence.
The
other
words
which
are
―factors in the action ―of the verb are LINGUISTICS AND PHILOSOPHY
called
Linguistics is considered as a part of philosophy in India. It
karakas.
karakas. Panini has defined six
is often said that ‗ the grammatical method of Panini is as fundamental to the Indian thought as is the geometrical method of Euclid for the western thought.‘
For the sentence in our example the grammarians
may
give
the
following
analytical description: Semantics in Sanskrit was never a well –defined domain of
It is the activity of cooking, taking place in
a separate discipline ( Hauben, ). Rather, it remained the
the present time, having an agent which is
battle field for exegetes, logicians and grammarians with
identical with Rama, having an object
various backgrounds and philosophical commitments. It
identical with rice.
was only a few centuries after Bhartrhari (4 th century A.D. ) that a sophisticated specialized language and terminology
Thus the sentence is split into elements
were developed for discussing semantic problems and
such as stem, root, affix, ending and the
theories of verbal understandings. Thus during the period
attribution of well-defined
from
each
thirteenth to sixteenth centuries semantic issues
linguistic
element.
meaning to The
central
were seriously taken up for discussion between different
element in this analysis is the meaning
philosophical schools not only focussing on language but
expressed by the verb ‗cooks‘, or to be
also from a religious point of view.
more precise, the meaning of the verb root ‗to
cook‘
(pac).
The
verbal
form
(in
Sanskrit the verbal ending ti in pa(ca)ti ) indicates that the activity takes place in the present time. The agent of the action is expressed by the grammatical subject, Rama, the object of the action is the grammatical object rice.
CLEAR Sep.2012
5
For the Mimamsa thinkers also the verb is the central
But
centuries
were
elapsed
before
element in a sentence. While grammarians take the verbal
Bhartrhari (4th century AD )developed his
root and the activity expressed by it as more important than
sphota theory after Panini (4th century
the verbal ending and its meaning, the latter are more
B.C.).
important for Mimamsakas. According to them the basic
Bhatta Nagesa gave completion to sphota
meaning of all verbs is a creative urge which stimulates
theory in eighteenth century. The later
action. This basic urge is expressed – transmitted to the
development
listener – by the verbal ending, not by the verbal root which
considered as a continuation of this.
Again centuries elapsed before
of
linguistics
can
be
merely qualifies this creative urge. Thus according to them the sentence in our example can be given the following structural description:
There are four factors involved in a proper cognition – expectancy, mutual compatibility, proximity and intention of
“It is the creative urge which is conducive to cooking , taking
the speaker. It is difficult to include the
place in the present time, having the same substratum as the
last one in any syntatic solution. According
agent residing in Rama, having as object rice. ―
to
Bhartrhari
a
speaker
can
seldom
communicate through words all that he Now for the Nyaya school, it is not the verb which is the central element in the sentence, but, generally the noun in
intended to and the hearer understands more or at times less than what he hears!
the first ending ( nominative ). Thus the structure of the verbal knowledge in our example according to them is:
Thus there is mutual dependency of Indian theories of syntax and semantics. It is
― It is Rama who possesses the volitional effort conducive to cooking which produces the softening and moistening which is based in rice. ―
said that the Indian linguists of the fifth century B.C. knew more of the subject than western linguists of the nineteenth century A.D. Further, if there is any area
Underlying all these descriptions is the presupposition that the main structural relation in the sentence is that between qualifier and the thing to be qualified (visesana/visesya ) and unlike grammarians and Mimamsakas for whom the visesya is verb, for Nyaya thinkers the visesya is the noun in the first
where the ancient Sanskrit scholars have been
much
developments, semantics
and
it
ahead
of
is
the
in
systems
of
modern field
of
knowledge
representation.
ending. SANSKRIT COMPUTATIONAL LINGUISTICS I have already quoted S.D.Joshy. The sentences were from his paper ‗ Background of the Astadhyayi ‗ read in the third International
Symposium
on
Sanskrit
Computational
Linguistics held in 2009 at Hyderabad. He continues: ‘ Contrary to some western misconceptions the starting point of Panini‘s analysis is not meaning or the intention of the speaker, but words from elements.
Panini
starts from
morphology to arrive at a finished word.‘ But ‗he developed a number of theoretical concepts which can be applied to other languages also.‘ Coming back to Briggs, we note that in contrast to other works his paper has for the first time drew attention of computer scientists to the semantic theories available in Sanskrit. Since it is meaning that is important in a sentence, syntax is developed to tackle the semantic problem.
CLEAR Sep.2012
REFERENCES: 1. Briggs, Rick, 1985, Knowledge representation in Sanskrit and artificial intelligence, The AI magazine. 2. Briggs, Rick, 1986, Shastric Sanskrit: an interlingua for machine translation, First National Conferece on Knowledge Representation, Bangalore. 3. Chormsky, N, 1957, Syntactic Structures, The Hague, Mouton. 4. Caradona, George, 1976, Panini: A survey of Research, The Hague, Mouton. 5 .Kiparsky, Paul and Staal J.F., 1969, Syntactic and semantic relations in Panini, FL 5. 6. Hauben, E.M, 2002, Semantic in the Sanskrit tradition on the eve of colonialism, Project report, Leiden University. 7. Joshy, S.D., 2009, Background of the Astadhyayi, Third International Symposium on Sanskrit Linguistics, Hyderabad.
6
Overview of Question Answering System Interaction between humans and computers is one of the most important active areas of research in this modern world. Particularly interaction with natural language becomes more popular. Natural Language Processing is a computational technique for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis to achieve human-like language processing for a wide range of applications.
One of the most powerful applications of NLP is
Question Answering System. The need for automated question answering systems becomes more urgent due to the enormous growth of digital information in text form. QA system involves analysis of both questions and answers. In this overview, we focus on Question Type Classification, Question Generation, and Answer Generation for both closed and open domain.
Authors
Introduction
Research
in Natural Language Processing [1] has
been going on for several decades dating back to the late 1940s. The goal of NLP is to accomplish human-
K.M. Arivuchelvan, Research Scholar, Periyar Maniammai University.
like language processing. The discipline and practice of
NLP
are:
Linguistics
-
focuses
on
formal,
K. Lakshmi Professor, Periyar Maniammai University.
structural models of language and the discovery of language universals - in fact the field of NLP was originally referred to as Computational Linguistics;
general knowledge about the structure of the world
Computer Science - is concerned with developing
that language users must have in order to maintain
internal
a conversation.
representations
processing
of
these
of
data
structures,
and and;
efficient Cognitive
Psychology - looks at language usage as a window
Natural language processing is used for a wide
into human cognitive processes, and has the goal of
range
modelling the use of language in a psychologically
applications
plausible way.
Retrieval
The most explanatory method for presenting what actually
happens
within
a
Natural
Language
Processing system is by means of the ‗levels of
of
applications. utilizing
(IR),
The
NLP
Information
Question-Answering,
most
includes
frequent
Information
Extraction
Summarization,
(IE),
Machine
Translation, Dialogue Systems. In this paper we discuss more towards Question-Answering.
language‘ approach. Phonology concerns how words are
related
to
the
sounds
that
realize
them.
Morphology concerns how words are constructed from more basic meaning units called morphemes. A morpheme is the primitive unit of meaning in a language. Syntax level concerns how words can be put
together
to
form
correct
sentences
and
determines what structural role each word plays in the sentence and what phrases are subparts of what other phrases. Semantic level concerns what words mean and these meanings combine in sentences to form sentence meanings. Pragmatic level concerns how sentences are used in different situations and how use effects the interpretation of the sentence. Discourse
level
concerns
how
the
immediately
preceding sentences affect the interpretation of the next sentence. World knowledge includes the
CLEAR Sep.2012
Question-Answering system can be performed in two domains: Closed and Open Domain. Closeddomain
question
answering
[4]
deals
with
questions under a specific domain (for example, medicine or automotive maintenance), and can be seen as an easier task because NLP systems can exploit
domain-specific
formalized
in
knowledge
ontologies.
frequently
Alternatively,
closed-
domain might refer to a situation where only a limited type of questions are accepted, such as questions procedural
asking
for
information.
descriptive
rather
Open-domain
than
question
answering [4] deals with questions about nearly anything, and can only rely on general ontologies and world knowledge. On the other hand, these systems usually have much more data available
7
from which to extract the answer.
1.
Interpretation: What does X mean?
2.
Causal antecedent: Why/how did X
Question Answering [5] is a specialized form of information
retrieval.
Given
a
collection
occur?
of
3.
documents, a Question Answering system attempts
Causal
consequence:
What
next?
What if?
to retrieve correct answers to questions posed in
4.
natural language. Open-domain question answering
Goal orientation: Why did an agent do X?
requires question answering systems to be able to
5.
answer questions about any conceivable topic. Such
Instrumental/procedural: How did an agent do X?
systems cannot, therefore, rely on hand crafted
6.
domain specific knowledge to find and extract the
Enablement:
What
enabled
X
to
occur?
correct answers.
7.
Expectation: Why didn‘t X occur?
8.
Judgmental: What do you think of X
Question Classification
9.
Assertion:
Question Classification [2] is an important task in
10. Request/Directive
Question-Answering. The most well known question taxonomy was one proposed by Graesser and
After analyzing 5,117 questions in the research
Person (1994) based on their two studies about
methods and 3,174 questions in the algebra
human
sample,
tutors
and
students‘
questions
during
tutoring sessions in a college research method course and middle school algebra course. Six trained human judges coded the questions in the
they
found
four
frequent
question
categories: verification, instrumental-procedural, concept completion, and quantification questions.
transcripts, obtained from the tutoring sessions, on
Question Generation (QG)
four dimensions: Question Identification, Degree
For the first time in history [], a person can ask a
Specification (e.g. High Degree means questions
question on the web and receive answers in a few
contain more words that refer to the elements of
seconds. Twenty years ago it would take hours or
desired information), Question-content Category,
weeks to receive answers to the same questions
and Question Generation mechanism (the reasons
as a person hunted through documents in a
for generating questions include knowledge deficit
library. In the future, electronic textbooks and
in
information sources will be main stream and they
the
learner own
knowledge base,
ground
between
dialogue
actions
among
dialogue
common
participants,
social
participants,
and
conversation control ). They defined following 18 question categories according to the content of information sought rather than on the interrogative words (i.e. why, how, where, etc).
will be accompanied by sophisticated question asking and answering facilities. Applications
facilities
are
sample, some of which are addressed in this
Where? What
are
did
an
and
computer
deeper learning. 3.
16. Quantification: How much? How many?
Questions that human
tutors might ask to promote and assess
the
properties of X? How
other media. 2.
14. Example: What is an example of X?
Suggested good questions that learners might ask while reading documents and
13. Concept completion: Who? What? When?
17. Instrumental/procedural:
QG
endless and far reaching. Below are listed a small
1.
12. Disjunctive: Is X, Y, or Z the case?
specification:
automated
report:
11. Verification: invites a yes or no answer.
15. Feature
of
Suggested questions for patients and caretakers in medicine.
agent do X? 18. Comparison: How is X similar to Y?
CLEAR Sep.2012
8
4.
5.
Suggested questions that might be asked in
These data often comprise text documents in
legal contexts by litigants or in security
which the structure of the document or certain
contexts by interrogators.
extracted information is expressed by a markup. from
Such markups can be attributed manually (e.g.,
information repositories as candidates for
the structure of a document) and/or in an
Frequently Asked Question (FAQ) facilities.
automatic way, e.g., markups for identified
Questions
automatically
generated
The time is ripe for a coordinated effort to tackle QG in the field of computational linguistics and to launch
person
and
company
names
and
their
relationships in newspaper articles.
a multi-year campaign of shared tasks in Question Generation (QG). We can build on the disciplinary
Conclusion
and interdisciplinary work on QG that has been
Question answering is a complex task needing
evolving in the fields of education, the social
effective improvements of different research
sciences and computer science. The QG system
areas including, question generation, question
operates
ranking,
directly
on
the
input
text,
executes
question
classification,
information
implemented QG algorithms, and consults relevant
retrieval, natural language processing, database
information sources. Very often there are specific
technologies,
goals that constrain the QG system.
human computer interaction, speech processing
Semantic
Web
technologies,
and computer vision. Question Answering Today‘s question answering [7] is not limited by the type of document or data repository – it can address both traditional databases and more advanced ones that
contain
text,
images,
audio
and
video.
Structured and unstructured data collections can be considered
as
information
sources
in
question
answering. Unstructured data allows querying of raw features (for example, words in a body of text), extracting attached. structured
information Related and
to
with this
unstructured
clear
semantics
distinction data
between
there
is
a
traditional distinction between restricted domain question answering, or RDQA, and open domain question answering (ODQA). RDQA systems are designed to answer questions posed by users in a specific domain of competence, and usually rely on manually constructed data or knowledge sources. They often target a category of users
who
know
and
use
the
domain-specific
terminology in their query formulation, as, for example, in the medical domain. ODQA focuses on answering
questions
regardless
of
the
subject
domain. Extracting answers from a large corpus of textual documents is a typical example of an ODQA system. Recently, we have witnessed an approach of question answering involving semi-structured data.
REFERENCES 1. Liddy, E. D. In Encyclopaedia of Library and Information Science, 2nd Ed. Marcel Decker, Inc. 2. Ming
Liu
Intelligent
Rafael
A.
Automatic
Calvo
―G-Asks:
Question
An
Generation
System for Academic Writing Support‖ Dialogue and Discourse 3(2) (2012) 101–124. 3. Mark
Andrew
Greenwood
―Open-Domain
Question Answering‖ September 2005. 4. http://en.wikipedia.org/wiki/Question_answer ing. 5. Andrew Lampert ―A Quick Introduction to Question Answering‖ December 2004. 6. Workshop Report ―The Question Generation Shared
Task
and
Evaluation
Challenge‖
Sponsored by the National Science Foundation. 7. Oleksandr Kolomiyets, Marie-Francine Moens ―A survey on question answering technology from
an
information
retrieval
perspective‖
Information Sciences 181 (2011) 5412–5434.
CLEAR Sep.2012
9
I-Search.... Future of Search Engines Author
Manu Madhavan
In
this
web-age,
searching-or
more
M. Tech Computational Linguistics Govt. Engg. College, Sreekrishnapuram mmnamboodiry@gmail.com
precisely
surfing the web may be a casual phrase in day to day business. The netizens continuously enrich the web-vocabulary by words like ―Googling‖. What this
Semantic Search
speaks is how search engines are important in this digital era. A web search engine is designed to
A semantics search engine attempts to make sense of search results based on context. It
search for information on the World Wide Web.
automatically identifies the concepts structuring Today‘s
search
engines
come
in
two
types.
Directory-based engines, like Yahoo, are still built manually. What that means is that you decide what your directory categories are going to be Business, and Health, and Entertainment and then you put a person in charge of each category, and that person builds up an index of relevant links. Crawler-based engines, like Google, employ a software program — called a crawler — hat goes out and follows links, grabs the relevant information, and brings it back to build your index. Then you have an index engine that allows you to retrieve the information in some order, and an interface that allows you to see it. It‘s
the
texts.
―election‖
For instance, a
semantic
if
you
search
search
engine
for
might
retrieve documents containing the words ―vote‖, ―campaigning‖ and ―ballot‖, even if the word ―election‖ is not found in the source document. Semantic
Search
systems
consider
various
points including context of search, location, intent,
variation
of
words,
synonyms,
generalized and specialized queries, concept matching
and
natural
language
queries
to
provide relevant search results. Major search engines like Google and Bing incorporate some elements of Semantic Search. The objective of this article is to discuss the recent advances in
all done automatically.
area of Semantic Search. As the Web continues to grow, however, and to be more
and
more
important
for
commerce,
Google's Knowledge Graph:
communication, and research, information-retrieval
Google usually returns the search result for any
problems become a more serious handicap. The
query based on the text and the content. To put
percentage of Web content that shows up on search
it right, it does not understand the exact
engines continues to wane. And as search engines
meaning of the words. It matches the keywords
struggle
the
of the query with those of the sites and returns
information they provide may be increasingly out-of-
pages that have a significant authority on those
date.
words.
Recent advances in intelligent search suggest that
Amit Singhal, Google‘s senior VP of engineering,
these limitations can be partially overcome by
said [1]: “The introduction of Knowledge Graph
providing search engines with more intelligence and
enables Google to understand whether a search
with the user‘s underlying knowledge. That is called
for
natural language processing. It might also have to
confectionary manufacturer.
understand what the user need, even when he
about discovery' – the basic human need to
doesn‘t say it. And that requires some knowledge of
learn and broaden your horizons”.
to
add
more
and
more
content,
„Mars‟
refers
to
the
planet
or
the
'Search is a lot
the user. These ideas lead to the birth of a new generation of web technologies, popularly known as Semantic Web.
CLEAR Sep.2012
10
“The introduction of Knowledge Graph enables Google to understand whether a search for „Mars‟ refers to the planet or the confectionary manufacturer.
'Search is a lot about
discovery' – the basic human need to learn and broaden your horizons”. Amit Singhal
Bing's Semantics Search
By making search more natural and intuitive,
Microsoft specifically brands Bing as a "decision
Powerset is fundamentally changing how we
engine," and not as a general purpose search
search the web, and delivering higher quality
engine--even though it provides that functionality
results. [3]
as well--in order to differentiate it from Google Search. Bing's search is based on semantic technology from Powerset that was acquired by Microsoft in 2008. Notable changes include the listing of search suggestions as queries are entered and a list of related searches (called "Explore
pane").
capabilities
like
captions
based
analysis
of
extraction
Bing
presenting on
linguistic
content. is
features
The
leveraged
more
readable
and
semantic
concept in
semantic
Bing,
of
entity
providing
knowledge on phrases and what they uniquely refer to. [2]
Hakia: Hakia is a general purpose semantic search engine, that search structured corpora (text) like Wikipedia. For some queries (typically popular queries and queries where there is little
ambiguity),
Hakia
produces resumes.
These are portals to all kinds of information on the subject. Every resume has an index of links to the information presented on the page for quick reference. Often, Hakia will propose related
queries,
which
is
also
great
for
research. [3] Bing‘s new product Adaptive Search strives to capitalize
on
semantic
search
technology.
Cognition
Adaptive Search will take into consideration your
Cognition has a search business based on a
user behaviour, then tailor your Bing results to be
semantic map, built over the past 24 years,
most appropriate. So if you‘ve searched for a
which
word then clicked on a specific site previously,
comprehensive
Bing will predict that it‘s likely that what you‘re
English language available today. It is used in
searching for falls into the context of that site,
support
thus it can provide you with results that are more
translation, document search, context search,
tailored. [5].
and much more. [3]
Powerset
Swoogle:
Powerset is a Microsoft owned Company building
Swoogle, the Semantic web search engine, is a
a transformative consumer search engine based
research project carried out by the ubiquity
on natural language processing. Their unique
research group in the Computer Science and
innovations in search are rooted in breakthrough
Electrical
technologies that take advantage of the structure
University of Maryland. It‘s an engine tailored
and nuances of natural language. Using these
towards finding documents on the semantic
advanced techniques; Powerset is building a
web.
large-scale
search
engine
that
breaks
the
company
of
and
claims complete
business
Engineering
is
the
map
analytics,
Department
most of
the
machine
at
the
the
confines of keyword search.
CLEAR Sep.2012
11
Swoogle is capable of searching over 10,000 ontologies and indexes more that 1.3 million web documents. It also computes the importance of a Semantic Web document. The techniques used for indexing are the more Google-type page ranking and also mining the documents for interrelationships that are the basis for the semantic web. [4]
PyLucene PyLucene is a GCJ-compiled version
of
Java
Lucene
integrated with Python. Its goal is to allow you to use
Conclusion
Lucene's text indexing and
NLP is a complex area of research, requiring a solid understanding of grammars (not just grammar), and a good grounding in computational
searching
capabilities
from
Python.
linguists (in order to apply the techniques to machine, which is not always easy). Understanding the techniques used in NLP allows us to provide the best format and patterns for the search engine. Seeing as NLP seeks to mimic human language understanding, using common sense is a good idea. But before any broader, more sophisticated sort of intelligence can be placed into a machine we humans will have to get a better grasp on just what intelligence is.
References: 1. http://mashable.com 2. http://semanticweb.com 3. http://thenextweb.com 4. http://web2innovations.com 5. http://blogs.wsj.com
Google synonyms and natural language processing Google just blogged about synonyms as they related to searcher intent. They provide several examples of how a concept as simple as a synonym complicates natural language processing. This also brings up some important recommendations for site owners with respect to SEO. Prospective customers type in all kinds of variations on your most obvious keywords (hence the need for keyword research). Often they make use of synonyms, some common, some not. These variations often represent less competitive opportunities for high search engine rankings if you can incorporate those synonyms into your website. In particular: Use common variations within your existing copy rather than using the same phrase repeatedly. (This also tends to make long blocks of text more readable.) Develop pages that specifically focus on each of the most common and valuable synonyms. If there are enough synonyms and industry-specific terms, consider developing a glossary of terms. Find opportunities to talk about the synonyms, such as a blog post or article that talks about how synonyms may actually be somewhat different or whose similarity is up for debate (e.g. SEM vs. Search Engine Advertising). http://www.web1marketing.com
CLEAR Sep.2012
12
Remolding Professional sectors: the SaaS way.. SaaS : Purpose and Functions The costs and time to market benefits of outsourcing business
Author
services like payroll, Storage space, Customer Relationship
Dr. Sudheer S Marar
Management (CRM) applications, and company websites has been proven for many businesses. The term for these types of outsourced services is most recently known as Software As A Service (SaaS).
MCA MBA PhD Associate Professor and HOD, Department of MCA Nehru College of Engineering and Research Centre
Some
applications
immediate
success
are
an
in
the
market while others take time or in the worst case never get toehold in a given market. The ideal introduction scenario for a Introducing new technology is an expensive undertaking,
carrier would be that they could
usually requiring high capital outlays and can take many
try a new service in a particular
months of training, installation and integration before service
market without having to make
can be delivered in network. Outsourcing these services to
a significant investment all the
organizations that are experts in the technology lowers costs,
while gaining key market data.
increases uptime, accelerates revenue realization and provides increased flexibility & functionality.
Therefore, companies today are faced with the challenges of
Due to these results, hosting for these critical business
controlling
functions continues to grow and many companies are looking
operating costs, protecting their
for similar opportunities in other operational areas.
current investments and having the
ability
equipment
to
and
deploy
new
Effects of Downturn
applications quickly. To add to
As stated in Movius Corporation annual report, The economic
the challenge, many carriers
downturn has globally forced many companies to reduce
are faced with older application
spending across the board. This has put companies that are in
platforms that are limited in
highly competitive and innovation driven industries, say
capability
telecommunications in a exigent balancing act. While they
approaching end of life. These
need to try to control expenses, if they are not also continuing
companies need cost-effective
to introduce the latest applications and services, they will
solutions
quickly begin to lose their market share.
conversion
and
that
potentially
permit
from
the legacy
networks to IP infrastructure The ideal situation for a carrier would be, to almost suddenly
without
introduce new services without risking precious in hand
network
capital. Under the best possible scenario the carrier could
application design.
major
changes
infrastructure
to or
begin generating revenue in a matter of weeks after making the decision to launch a new service. If the service could be introduced without the need to add additional staff, the solution is essentially risk free.
CLEAR Sep.2012
13
Enterprise-level applications
The Futuristic.
As extracted from a lead article of IDC-SAP initiated
Clearly SaaS applications are maturing. The
paper, ―..Professional service firms focus their business
number of companies that either are using
management energy on optimizing the utilization of an
SaaS
expert's or a consultant's time. They attempt to develop
applications in the next year has grown
service offerings or skill sets that clients will find
considerably
compelling. Ultimately, they focus on properly charging
suggesting that the barriers to adoption —
and receiving payment from clients. Larger firms tend to
either
broaden their offerings to ensure a greater wallet share.
overcome. We see a bright future for SaaS
Meanwhile Smaller firms tend toward key-field focusing
across a broad range of application areas
and deep industry expertise, hoping to foster continuing
and for large and small professional services
relationships with a small number of clients.‖
firms.
In short, All firms balance developing a talent pipeline
SaaS is not without its problems, however.
with maximizing utilization rates. Client satisfaction and
Functionality and security concerns hang
trusting relationships drive both repeat business and
back, and while these concerns are more a
referrals
in
most
professional
services
applications or plan
real
over or
the
to use
past
perceived
—
few
SaaS years,
are
being
segments.
perception than reality, it is important when
Therefore, firms seek to ensure deliverables of the
considering applications from a SaaS vendor
highest possible quality and strive to fully meet client
that appropriate due diligence be applied to
expectations throughout the engagement process. Firms
ensure that the functionality meets critical
increasingly use technology to support all parts of their
business
business: Finance and scheduling software are common,
corporate client to have a good choice on its
knowledge management and data warehouse capability
SaaS vendor, not all are created equal. As
help improve service quality, and client management and
this domain is a maturing capability, one
engagement management software are increasingly used
should make it sure to select a vendor that
to monitor and maximize customer satisfaction. The
brings experience, financial stability, and a
increased use of technology has both aided and hindered
good reputation for working effectively with
professional services firm-constrains to improve their key
professional
value propositions.
thereby ensuring the client on its business benefits,
Benefits of SaaS
needs.
It‘s
services
scalable
important
of
growth,
the and
for
any
company, business
continuity.
The cost of a complex business management software implementation is often the starting point for a discussion and often a point where the discussion meets a quick end. In their research, IDC has identified several areas where SaaS system delivery costs differ from on-premise delivery costs. Primarily, They are the following: •
License fees. Both initial and Maintenance cost.
•
Hardware costs.
•
IT infrastructure costs.
•
Test Environment maintaining development cost
•
IT personnel/support costs.
•
Security, backups, and disaster recovery.
CLEAR Sep.2012
14
Apple's SIRI What is Siri?
Author
Siri
Robert Jesuraj K
(Speech Interpretation and Recognition Interface) is
an intelligent personal assistant and knowledge navigator which
M. Tech Computational Linguistics Govt. Engg College Sreekrishnapuram
works as an application for Apple's iOS. The application uses a natural language user interface to answer questions, make recommendations, and perform actions by delegating requests to a set of Web services.
rajaroberjesuraj@gmail.com
Siri was originally introduced as an iOS application available in the App Store by Siri, Inc. Siri, Inc. was acquired by Apple on April 28, 2010. Siri, Inc. had announced that their software would be available for BlackBerry and for Android-powered phones, but all development efforts for non-Apple platforms were cancelled after the acquisition by Apple. Siri is now an integral part of iOS 5, and available only on the iPhone 4S, launched on October 14, 2011.
On
November
8,
2011,
Apple
publicly
announced that it had no plans to support Siri on any of its older devices Obviously, Siri won't be able to answer every
Using Siri The app transcribes spoken text and then takes these commands and routes them to the right web services. If you try to book a table at a Thai restaurant ("get me a table at a good
Thai
restaurant nearby"), for example, Siri will check where you are, query Yelp for reviews of nearby Thai restaurants, show you the options and then pre-populate a reservation form on OpenTable with your information. All you have to do is to confirm Siri's selection.
query - and sadly the app doesn't use Wolfram Alpha to give you answers to factual questions (yet). Should that happen, Siri will just route your query to a search engine and display the search results. As the Siri team told us, however, users tend to learn which queries work best pretty quickly (just like we learned how to structure effective queries for Google). To use the iPhone app, you just have to say aloud a command like "Book a table for six at
The software is surprisingly good at translating
7pm at McDonalds" (I'm sure you're classier
voice queries into text. The application works so
than that, but let's stick with it for now), and
well because it is able to recognize the context of
then using speech-recognition technology and
your queries. This kind of semantic analysis is a
the iPhone's GPS capabilities, your command is
very computing intensive problem, so most of the
translated
actual number crunching happens on Siri's servers.
responding with confirmation of booking—or
Siri outsources the voice recognition to Nuance and
lack of availability.
and
processed
by
the
app,
if you are not comfortable with speaking into your phone, you can always use a regular text query as well.
CLEAR Sep.2012
15
Siri, which has ties with Stanford Research Institude
DARPA Helps Invent The Internet And
and
Helps Invent Siri
DARPA,
has
collaborated
with
OpenTable,
MovieTickets, StubHub, CitySearch and TaxiMagic to help with bookings and information, which pretty
With Siri, Apple is using the results of over 40
much wipes out the reason why you'd want to
years
download any of those services' apps individually.
(http://www.darpa.mil/
Siri is all this and something that could only be held to the definition of true synergy, e.g.: ―Two or more things functioning together to produce a result not independently obtainable‖. None of the individual parts are "new" but the combination Siri created has never really been seen before. It has been the Holy Grail of computer researchers to one day create a device that could become conversational and intelligent in such a way that it would appear that the dialog is human generated.
of
research
International‘s
funded
by
)
Artificial
DARPA
via
SRI
Intelligence
Center
(http://www.ai.sri.com/ Siri Inc. was a spin off of SRI Intentional) through the Personalized Assistant
That
Learns
Program
(PAL,
https://pal.sri.com) and Cognitive Agent that Learns and Organizes Program (CALO). This includes the combined work from research teams from Carnegie Mellon University, the University of Massachusetts, the University of Rochester,
the
Institute
for
Human
and
Machine Cognition, Oregon State University, the University of Southern California, and Stanford University. This technology has come
Apple Siri can speak Hindi now When Siri was announced with the iPhone 4S,
a very long way with dialog and natural
everyone
language
thought
the
device
would
never
understanding,
machine
learning,
understand the Indian accent let alone be able to
evidential and probabilistic reasoning, ontology
speak Hindi. We were however left bewildered when
and
we found a video online where Siri responds to
reasoning and service delegation.
knowledge
representation,
planning,
users queries in Hindi! Similar applications for hand-held devices Siri‘s support for Hindi comes to us courtesy Kunal
1) S Voice is a intelligent personal assistant
Kaul. The hack connects Siri to Kunal‘s Google API
and knowledge navigator which works as an
server and interacts in Hindi.
application
for
Samsung's
Android
smartphones, similar to Apple inc's Siri on the Another interesting aspect of the video is that the questions are asked in English and the responses given by Siri are in Hindi and the devanagari script appears on screen. The face that the questions are asked in English has led us to believe that Siri does not understand questions asked in Hindi.Another interesting aspect of the video is that the questions are asked in English and the responses given by Siri are in Hindi and the devanagari script appears on screen. The face that the questions are asked in English has led us to believe that Siri does not understand questions asked in Hindi.
iPhone. It first appeared on the Samsung Galaxy S III on May 3, 2012. The application uses a natural language user interface to answer
questions,
make
recommendations,
and perform actions by delegating requests to a set of Web services. 2) Assistant is the codename of a rumored upcoming Google application that will integrate voice recognition and a virtual assistant into Android. It is expected to launch in Q4 of 2012. Before March 2, 2012, the project was known as "Google Majel", and that name was originated from Majel Barrett-Roddenberry, the actress best known
as
the
voice
of
the
Federation Computer from Star Trek.
CLEAR Sep.2012
16
The software is an evolution of Google's Voice
With the app, an Android user can just "ask"
Actions that is currently available on most Android
Iris
phones while adding natural language processing.
information. The developers claim Iris can talk
Where Voice Actions required the users to issue
on topics ranging from Philosophy, Culture,
specific
History,
commands
like
"send
text
to…"
or
instead
of
science
"Google-searching"
to
general
for
conversation.
"navigate to…", "Assistant" will allow the users to
However, Android users need to have "Voice
perform
language.
Search" and "TTS library" installed in their
According to search engineer Mike Cohen, the
actions
in
their
natural
phones for Iris to work. Among its features are
"Assistant" project has three parts: "getting the
voice
world's knowledge into a format a computer can
searching on the web, and looking for a
understand, creating a personalization layer —
contact.
actions
including
calling,
texting,
Experiments like Google +1 and Google+ are Google's way of gathering data on precisely how people interact with content; building a mobile, voice-cantered "Do engine" ('Assistant') that's less
About Whoosh
about returning search results and more about Whoosh
accomplishing real-life goals".
is
indexing
a
fast,
and
featureful
full-text
searching
library
implemented in pure Python. Programmers can use it to easily add search functionality to their applications and websites. Every part
of
how
Whoosh
works
can
be
extended or replaced to meet your needs 3) Iris is a personal assistant application for Android. The application uses natural language processing to answer questions based on user voice request. Iris currently supports Call, Text, Contact Lookup, and Web Search actions including playing videos, looking for: lyrics, movies reviews, recipes, news, weather, places and others. It was developed in 8 hours by Narayan Babu and his team
at
Dexetra
Software
Solutions
Private
Limited, a Kochi (India) based firm. The name is actually
Siri
spelled
backwards,
which
is
the
original application for the same use built by Apple Inc.
exactly. Some of Whoosh's features include: Pythonic API. Pure-Python. No compilation or binary
packages
needed
no
mysterious crashes. Fielded indexing and search. Fast
indexing
and
retrieval
--
faster than any other pure-Python search solution I know of. See Benchmarks. Pluggable
scoring
algorithm
(including BM25F), text analysis, storage, posting format, etc. Powerful query language. Pure Python spell-checker (as far as I know, the only one).
http://packages.python.org/Whoosh/quick start.html#a-quick-introduction
CLEAR Sep.2012
17
CLEAR Sep.2012
18
Inviting Articles for CLEAR Dec2012 We are cordially inviting thought-provoking articles, interesting dialogues and healthy debates on multi-faceted aspects of Computational Linguistics, for the second issue of CLEAR, publishing on Dec 2012. The topics of the articles would preferably be related to the areas of Natural Language Processing, Computational Linguistics and Information Retrieval. Authors are requested to send their articles in doc/odt format to the Editor, before 15
th
November 2012, by email simplequest.in@gmail.com. -Editor
Thanks To Principal, Govt. Engg. College Sreekrishnapuram, Staffs and Students, Dept. of CSE, Govt. Engg. College Sreekrishnapuram, Authors of CLEAR Sep 2012- Dr. Achutsankar, Prof. Jathavedan M, Dr. Sudheer S Marar, Mr. Sajilal D, Dr. Lakshi K, Mr. Arivuchelvan