CLEAR March 2013
1
CLEAR March 2013
2
C
Editorial …… ……. 5 SIMPLE News & Updates ……. ……… 6 CLEAR Jun 2013Invitation…………… 25 Last word…………. 26
CLEAR MARCH 2013 Volume-2 Issue-1 CLEAR Magazine (Computational Linguistics in Engineering And Research) M. Tech Computational Linguistics Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad 678633 simplequest.in@gmail.com Chief Editor Dr. P. C. Reghu Raj Professor and Head Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad Editors Manu Madhavan Robert Jesuraj. K Athira P M Sreejith C Cover page and Layout Mujeeb Rehman. O
CLEAR March 2013
Conceptual Indexing and Compound Word Splitter for better Information Retrieval in Malayalam ….search engine fails to present them. This issue can be solved by a new indexing technique where concept of the document is chosen for the search instead of mere keywords….. 7
The Unfinished Symphony: Sanskrit and AI ….One of the main differences between the Indian approach to language analysis and that of most of the current linguistic theories is that the…. 11
Towards Efficient search …Semantic search seeks to improve search accuracy by understanding searcher intent and the contextual meaning of terms as they appear in the searchable data space, …… 17
Google Translate …..machine-translation service provided by Google Inc. to translate written text from one language into another. Google used a SYSTRAN based translator…. ………… 21
Ontology Tools: An over view …..applications designed to assist in the creation or manipulation of ontologies. They often express ontologies in one of many ontology languages. This article discusses some of the popular tools that can be used….. … 23
3
CLEAR March 2013
4
Greetings! When we bring you the third edition of CLEAR, we have some reasons to cheer about. First, the prestigious GARUDA Challenge award (in the students category), instituted by C-DAC to popularize their Grid computing platform, was given to Robert Jesuraj, who is our second year student and an active volunteer of CLEAR. Secondly, this edition's contributors are all our students, who share their understanding and insights in areas related to Computational Linguistics and Information retrieval. Some have taken up local language computing (Malayalam) seriously, and are pursuing their project work in that direction. Some of these works are Conceptual Indexing and Compound word Splitter For Better Information Retrieval in Malayalam by Radhika, an overview of Ontology Tools by Manu, the relevance of Sanskrit to AI by Athira, and some pointers to efficient search methods by Sreejith. I have reasons to be happy that our efforts have started bearing fruits, and I place before you this edition of CLEAR on this positive note. With Best Wishes, Dr. P. C. Reghu Raj (Chief Editor)
CLEAR March 2013
5
NEWS & UPDATES TEQIP phase II: GEC Sreekrishnapuram selected for direct central assistance
Technical Education Quality Improvement Programme (TEQIP ) is a scheme of the Ministry of Human Resource and Development (MHRD) and is aimed at strengthening the quality of technical education in the country. Govt. Engineering College, Sreekrishnapuram is selected for direct central assistance under the second phase of TEQIP scheme. Under the scheme, we have to set up improved lab facilities, introduce new PG and doctoral programs, achieve academic autonomy, and establish centers of excellence in engineering and technology. This significant recognition will definitely give a fillip to postgraduate education and research and development activities of the College.
Robert Won First Prize in GARUDA Challenge
Publications The following papers were published in 3rd National Conference on Indian Language Computing (NCILC) held on January 19-20, 2013 at CUSAT. The papers were also selected for CSI digital resource center. 1. Manu Madhavan, Mujeeb Rehman O, P. C. Reghu Raj, " Computing Prosodic Pattern for Malayalam". CSI Digital Resource center Link:
http://csidl.org/xmlui/handle/1234567 89/543
Robert Jesuraj of SIMPLE Groups, won the first prize in GARUDA Challenge 2012 . The competition was conducted by CDAC for GRID enthusiasts. The winner received certificate, memento and a cash award of Rs. 50000/-. Congraulations !!!! CLEAR March 2013
2. Robert Jesuraj, P. C. Reghu Raj, "MBLP approach applied to POS tagging in Malayalam Language". CSI Digital Resource center Link:
http://hdl.handle.net/123456789/544
6
Conceptual Indexing and Compound Word splitter: For Better Malayalam Information Retrieval Radhika K T radhydev@yahoo.com Today computers are used as an entry
points
to
the
information
Today computers are used as entry points to the information highways. But, building an efficient
Malayalam
search
engine
is
a
highways. But building an efficient
challenging area in the field of Malayalam
Malayalam
computing. Even though it is possible to search
search
engine
is
a
challenging area in the field of Malayalam computing. Even though it
Malayalam documents now a days, the user seldom experience relevant results.
is possible to search Malayalam
Malayalam search engine displays a list of
documents now a days, the user
documents, after fining exact match with the
seldom experience relevant results. Malayalam search engine displays a
query terms. This approach produces many non -relevant documents which are least expected by the user. Also many relevant documents are
list of documents, after it finds exact
missed when the query terms are not exactly
match with the query terms. This
matched with the document terms. These
approach
produces many non
-
relevant documents which are not
issues can be solved in two ways: one is by adding
a
compound
an
relevant documents are missed when
Conceptual Indexing.
matched with the document terms.
splitter
at
the
document side and another one is by applying
expected by the user. Also many
the query terms are not exactly
word
alternate
indexing
technique
called
Working of traditional Malayalam Search engine is based on Keyword Indexing. An obvious
These issues can be solved in two
option is scan the text sequentially. But it is
ways: one is by adding a compound
time consuming. Another option is to build data
word splitter at the document side
structures over the text called indices to speed up the search. Here, the first and important
and another one is by applying an
step
alternate indexing technique.
preparation of index table and this process is
of
building
a
search
engine
is
the
called indexation. The indexation process is conducted by software called spider or crawler. In order to start the journey, the search engine will provide a seed URL to the crawler. The crawler
CLEAR March 2013
visits
each
document
in
the
web
7
starting from the seed URL and collects the terms. For each term, it keeps a list that records in which document the term occurs. Thus an index always maps back from terms to the parts of a document where they occur. When the user enters a query (information need), search engine goes through this index table and returns the document's URL identified from the table. Otherwise it simply returns Figure
``no results have been found for your query''.
2:
Results ”
for
the
query
“
“Present conditition trough an illustration” Compound Word Splitter ”,
As Malayalam is one of the highly inflected and
” is shown in Figure1
agglutinative languages in Dravidian family, the
The result for the queries “
process of index term matching is failed in the
and Fgure2 respectively.
case of compound words. Splitting is needed for compound words whose morphemes are of different lexical categories.
Hybrid sandhi-
splitter is a program which is developed[1] to split
compound
words
into
its
constituent
morphemes. In Malayalam, it is easy to join a noun and another word which starts with a “swaram”. The second word can be a verb, adverb or adjective. is
done
mainly
pronunciation.
for
Such word compounding the
benefit
of
easy
A compound word generally
consists of noun -noun, noun-adjective, nounverb,
adverb-verb
and
adjective-noun
combination. In some cases all the words of an entire sentence may combined to form a single
”
Figure 1: Results for the query “
one. Examples of such word combination are:
The results are different even though both have same
concept. Also
the
relevant
which tells exactly about " missed
in
first
search
compound word splitter
result.
" is Adding
a
at the document side
will give a better solution for this issue.
CLEAR March 2013
“
”, “
”.
document The proposed Hybrid Sandhi Splitter system has two main modules: one for training the system
to
automatically
detect
compound
words and another module for splitting them to
8
constituent morphemes. The hybrid Sandhisplitter developed is using as a preprocessor for morphological analyzer. When
indexing
is
done
after
applying
compound word splitter at the document side, it will obviously represent documents having same
concept
using
same
common
index
terms. Conceptual Indexing Let us discuss the second issue. An online
Figure
4:
Result “
of
the
query
“
information seeker often fails to find what is wanted because the words used in the request are different from the words used in the relevant material . This issue is illustrated in
Even though the query is most specific, and the appropriate document is there, search engine fails to present them. This issue can be solved by a new indexing technique where concept of
the Figure 3 and Figure 4.
the document is chosen for the search instead of mere keywords. organizing Indexing
This new technique for
information combines
called
techniques
Conceptual from
both
knowledge representation and NLP and will add meaning to index terms. In order to find the concept, the conceptual indexing and retrieval system automatically extracts words and phrases from text and
Figure 3: Example for searching the word The top1 ranked document is a relevant one and is describing about the whole concept about
and
also
about
its
. But when a user want to know about
this relevant
organizes them into a semantic network that integrates
syntactic,
semantic,
and
morphological relationships. It needs a lexicon containing
syntactic,
morphological information
semantic,
and
about words, word
senses, and phrase as a back end.
document is not getting. According
to
the
architecture
of
proposed
conceptual indexing system the most general query
like
documents
telling ,
CLEAR March 2013
gets about etc and
9
the
most
specific
query
like
References
gives exact result about and not retrie document telling about
.
[2] Berry M.,Dumais T., Landauer T., and OBrian G., “Using linear algebra for intelligent information retrieval”, SIAM Review. 37 (1995) 573-595.
Congratulations!!!
The author of this article, presented this paper
“Conceptual
Indexing
and
Compound word splitting for better Information retrieval in Malayalam”, at
National
level
Thiruvananthapuram,
workshop jointy
at
organized
by Malayalam University and Kerala IT Mission.
The
CLEAR
Team
SIMPLE heartily
Groups
and
congratulate
Radhika for her achievement.
[1] Barzilay R. and Elhadad M.,“ Using Lexical chains for text summarization”, in Proc. of the ACL Workshop on Intelligent Scalable Text Summarization. (1997) Madrid, Spain, 10-17.
[3] Dhanya P.M, Jathavedan M,“Text summarization using language understanding: A survey”, in Proceedings of Second National Conference on Indian Language Computing,Dept. of Computer Applications, CUSAT, 2012. [4] Edmundson H. P., New methods in automatic extracting, Journal of the ACM, 16(2):264-285, April 1969. [5] Hajime M. and Manabu O., “A comparison of summarization methods based on taks- based evaluation”, 2nd International Conference on language resources and evaluation, LREC-2000. (2000) Athens, Greece, 633-639. [6] Hovy, E.H. , Automated Text Summarization In R. Mitkov (ed), The Oxford Handbook of Computational Linguistics, pp. 583-598. Oxford: Oxford University Pres, 2005. [7] Julian M. Kupiec, J. Pedersen Trainable Document Summarizer”, ACM-SIGIR conference on Research in Information Retrieval. July 1995, 73.
and F. Chen, “A in Proc. of 18th and Development Seattle, USA, 68-
[8] Luhn H. P., The Automatic Creation of Literature Abstracts, Presented at IRE National Convention, New York, 159-165, 1958. [9] Pierre-Etienne Genest,“Framework for Abstractive Summarization using Text-to-Text Generation”, Guy Lapalme, RALI-DIRO. [10] Saravan M.,“A probabilistic apprach to Multi document Summarization for Generating A Tiled Summary”,International Journal of Computational Intelligence and Applications, 2006. [11] Tanveer Siddiqui, Tiwary U.S., Natural Language Processing and Information Retrieval. Oxford University Press, 2008.
CLEAR March 2013
10
The Unfinished Symphony: Sanskrit and AI Athira P. M. athira69@gmail.com “The extraordinary thing about Sanskrit is that it offers
direct accessibility to anyone to that elevated plane where the two — mathematics and music, brain and heart, analytical and intuitive, scientific and spiritual — become one.” Whitehead's Modes of Thought speaks highly of language: "...The mentality of mankind and the language of mankind created each other. If we like to assume the rise of language as a given fact, then it is not going too far to say
as the leading focus of Indian studies of language for three millennia. These studies have
ranged
over
the
full
gamut
of
the
scientific study of language, and have for the most part been preserved up to the present day.
that the souls of men are the gift from language to mankind. The account of the sixth day should be written: He gave them speech,
We have greatly under-estimated the real sacred power of language. When the power of language
and they became souls."
to
recognized, But
Whitehead's
words
are
somewhat
ambiguous, and may have created in readers as many different responses as there are readers. One may perceive his statement as a noble and inspiring truth. Another may react to the notion that a 'soul' could depend on language. Still another may be completely in the dark about what Whitehead is saying. A Sacred Language? Sanskrit is principally known outside India as the sacred language of Hinduism. However, one effect of this sacred status has been the longterm development of linguistic science in India, on a rigorous empirical basis. In fact, the attitude to Sanskrit as sacred has been the solid foundation and justification for its position
CLEAR March 2013
create
and
language
discover
becomes
life
is
sacred;
in
ancient times, language was held in this regard. Nowhere was this more so than in ancient India. It is evident that the ancient scientists of language were acutely aware of the function of language
as
understanding
a
tool
life,
and
for
exploring
their
intention
and to
discover truth was so consuming that in the process of using language with greater and greater rigor, they discovered perhaps the most perfect tool for fulfilling such a search that the world has ever known — the Sanskrit language. Of all the discoveries that have occurred and developed in the course of human history, language is the most significant and probably the most taken for granted. Without language, civilization could obviously not exist. On the other
hand,
to
the
degree
that
language
11
becomes
sophisticated
and
accurate
in
consonants he put them into classes. The
describing
the
and
complexity
of
construction of sentences, compound nouns etc.
subtlety
human life, we gain power and effectiveness in
is explained as ordered rules operating on
meeting its challenges.
underlying structures in a manner similar to modern
Panini's Language Theory: It
was
Panini
who
theory.
In
many
ways
Panini's
constructions are similar to the way that a
formalised
Sanskrit's
mathematical function is defined today.
grammar and usage about 2500 years ago. No
The Ashtadhyayi is one of the earliest known
new 'classes' have needed to be added to it
grammars of Sanskrit, although PÄ áš‡ini refers to
since then. "Panini should be thought of as the
previous texts like the Unadisutra, Dhatupatha,
forerunner of the modern formal language
and Ganapatha. It is the earliest known work
theory used to specify computer languages,"
on descriptive linguistics, and together with the
say J J O'Connor and E F Robertson. Their
work of his immediate predecessors (Nirukta,
article also quotes: "Sanskrit's potential for
Nighantu,
scientific use was greatly enhanced as a result
Pratishakyas)
stands
at
the
beginning of the history of linguistics itself. His
of the thorough systemisation of its grammar
theory of morphological analysis was more
by Panini. ... On the basis of just under 4000
advanced than any equivalent Western theory
sutras [rules expressed as aphorisms], he built
before the mid 20th century, and his analysis of
virtually the whole structure of the Sanskrit language,
whose
general
'shape'
hardly
noun
compounds
still
forms
the
basis
of
modern linguistic theories of compounding,
changed for the next two thousand years."
which have borrowed Sanskrit terms such as
Panini was a Sanskrit grammarian who gave a
bahuvrihi and dvandva.
comprehensive phonetics,
and
scientific
phonology,
and
theory
of
morphology.
Sanskrit - A Scientific Language?
Sanskrit was the classical literary language of
Panini should be thought of as the fore-runner
the
of the modern formal language theory used to
Indians
and
Panini
is
considered
the
founder of the language and literature. It is
specify
interesting to note that the word "Sanskrit"
Normal Form was discovered independently by
means "complete" or "perfect" and it was
John Backus in 1959, but Panini's notation is
thought of as the divine language.
equivalent in its power to that of Backus and
A treatise called Astadhyayi (or Astaka ) is Panini's
major
work.
It
consists
of
eight
chapters, each subdivided into quarter chapters. In this work Panini distinguishes between the language
of
sacred
texts
and
the
usual
language of communication. Panini gives formal production rules and definitions to describe Sanskrit grammar. Starting with about 1700 basic
elements
like
CLEAR March 2013
nouns,
verbs,
vowels,
computer
languages.
The
Backus
has many similar properties. It is remarkable to think that concepts which are fundamental to today's theoretical computer science should have their origin with an Indian genius around 2500 years ago. Sanskrit linguistics is in an excellent position to make immediate use of most
modern
techniques
in
language
processing, since it is already provided with most of the infrastructural tools which are
12
currently seen as desirable.
Rick Briggs: Sanskrit and AI
Lakshmi Thathachar's view of Sanskrit's nature
According to Forbes magazine, (July, 1987),
may be paraphrased as follows: All modern
"Sanskrit is the most convenient language for
languages have etymological roots in classical
computer software programming." Relevant to
languages. And some say all Indo-European
this, there has recently been an astounding
languages are rooted in Sanskrit, but let us not
discovery made at the NASA research centre.
get lost in that debate. Words in Sanskrit are
The following quote is from an article Sanskrit
instances of pre-defined classes, a concept that
& Artificial Intelligence, which appeared in AI
drives
(Artificial Intelligence) magazine in spring of
object
oriented
programming
[OOP]
today. All words have the OOP approach, except that defined classes in Sanskrit are so exhaustive that they cover the material and abstract --indeed cosmic-- experiences known to man. So in Sanskrit the connection is more than etymological.
1985, written NASA researcher Rick Briggs: "In the past twenty years, much time, effort, and money has been expended on designing an unambiguous
representation
languages
make
to
computer
them
processing.
of
natural
accessible
These
efforts
to
have
Every 'philosophy' in Sanskrit is in fact a
centred around creating schemata designed to
'theory of everything'. [The many strands are
parallel
synthesised in Vedanta --Veda + anta--, which
expressed by the syntax and semantics of
means the 'last word in Vedas']. Thathachar
natural
believes it is not a 'language' as we know the
cumbersome and ambiguous in their function
term
huge,
as vehicles for the transmission of logical data.
The
Understandably, there is a widespread belief
current time in human history is ripe, he feels
that natural languages are unsuitable for the
for India's young techno wizards to turn to
transmission
researching
languages can render with great precision and
but
interlinked,
the
only
analogue
Mimamsa
front-end knowledge
and
to
a
base.
developing
the
ultimate programming language around it; nay, an operating system itself. “The
modern
world”,
relations
languages,
of
with
which
many
ideas
relations
are
that
clearly
artificial
mathematical rigor.” But this dichotomy, which has served as a
Thathachar
premise underlying much work in the areas of
declares, “needs Sanskrit,” because Sanskrit is
linguistics and artificial intelligence, is a false
such a systematic and scientific language. Lord
one. There is at least one language, Sanskrit,
Macaulay, the British politician who famously
which for the duration of almost 1000 years
foisted an English-medium education system
was
upon India, thought it a dead language.
Now
considerable literature of its own. Besides
that Panini’s grammar is recognised almost as a
works of literary value, there was a long
meta-grammar for the world by those such as
philosophical and grammatical tradition that
the American linguist Noam Chomsky, the
has continued to exist with undiminished vigor
professor welcomes Sanskrit’s ascendant star in
until
the IT era.
accomplishments of the grammarians can be
CLEAR March 2013
Professor
logical
a
living
the
spoken
present
language
century.
with
a
Among the
13
reckoned a method for paraphrasing Sanskrit in
every sentence expresses an action that is
a manner that is identical not only in essence
conveyed both by the verb and by a set of
but in form with current work in Artificial
"auxiliaries." The verbal action is represented
Intelligence. This article demonstrates that a
by the verbal root of the verb form; the
natural language can serve as an artificial
"auxiliary activities" by the nominal (nouns,
language also, and that much work in AI has
adjectives, indeclinable) and their case endings
been reinventing wheel millennia old.
(one of six).
In early AI research it was discovered that in
The meaning of the verb is said to be both
order to clear up the inherent ambiguity of
vyapara (action, activity, cause), and phala
natural languages for computer comprehension,
(fruit, result, effect). Syntactically, its meaning
it
is invariably linked with the meaning of the
was
necessary
to
utilize
semantic
net
systems to encode the actual meaning of a
verb "to do".
sentence.
He further comments, "The degree
sentence in terms of the verb "to do" or one of
to which a semantic net (or any unambiguous
its synonyms, and an object formed from the
non-syntactic representation) is cumbersome
verbal root which expresses the verbal action
and odd-sounding in a natural language is the
as an action noun. This information in Sanskrit
degree to which that language is 'natural' and
is indicated by the fact that there is an agent
deviates from the precise or 'artificial.' As we
who is engaged in an act and that the action is
shall see, there was a language (Sanskrit)
taking place in the present time. The next step
spoken among an ancient scientific community
in the process of isolating the verbal meaning is
that has a deviation of zero."
to rephrase the description in such a way that
One of the main differences between the Indian approach to language analysis and that of most of the current linguistic theories is that the
This allows us to rephrase the
the agent and number categories appear as qualities of the verbal action.
“Let us not forget that among
analysis of the sentence was not based on a noun-phrase model with its attending binary parsing technique but instead on a conception that viewed the sentence as springing from the
the
great
the
Indian
accomplishments thinkers
was
of the
invention of zero and of the
semantic message that the speaker wished to convey. In its origins, sentence description was
binary
phrased in terms of a generative model: From a
thousand
number of primitive syntactic categories (verbal action, agents, object, etc.) the structure of the
number years
system, before
a the
West re-invented them.�
sentence was derived so that every word of a sentence
could
be
referred
back
to
the
syntactic input categories. Secondarily and at a later period in history, the model was reversed to establish a method for analytical descriptions.
The Sanskrit language has seven case endings, and six of these are definable representations of specific "auxiliary activities." The seventh,
In the analysis of the Indian grammarians,
CLEAR March 2013
14
the genitive, represents a set of auxiliary
the process as a uniting and disuniting of an
activities that are not defined by the other six.
agent. This process is equivalent to the concept
The auxiliary actions are listed as a group of six:
of addition to and deletion from sets. A leaf
Agent, Object, Instrument, Recipient, Point of
falling to the ground can be viewed as a leaf
Departure, and Locality. They are the semantic
disuniting from the set of leaves still attached
correspondents of the syntactic case endings:
to the tree followed by a uniting with (addition
nominative, accusative, instrumental, dative,
to) the set of leaves already on the ground.
ablative and locative, but these are not in exact
This theory is very useful and necessary to
equivalence since the same syntactic structure
formulate changes or statements of state.
can represent different semantic messages. There is a good deal of overlap between the
Last word....
karakas and the case endings, and a few of
The main point in which the two lines of
them are also used for syntactic information.
thought
Word order in Sanskrit has usually no more
decomposition of each prose sentence into
than stylistic significance, and the Sanskrit
karaka-representations
theoreticians paid no more than scant attention
verbal-action, yields the same set of triples as
to it. The language is then very suited to an
those which result from the decomposition of a
approach that eliminates syntax and produces
semantic net into nodes, arcs, and labels. It is
basically a list of semantic messages associated
interesting to speculate as to why the Indians
with the karakas.
found it worthwhile to pursue studies into
The comparison of the analyses shows that the
unambiguous coding of natural language into
Sanskrit sentence matches the analysis arrived
semantic elements. It is tempting to think of
at
them
through
the
application
of
computer
as
have
converged
computer
of
is
that
action
scientists
and
without
the focal
the
processing. That is surprising, because the form
hardware, but a possible explanation is that a
of the Sanskrit sentence is radically different
search for clear, unambiguous understanding is
from that of the English. Of course, many
inherent in the human being. The analysis of
versions of semantic nets have been proposed,
language
some of which match the Indian system better
distinction
than others do in terms of specific concepts and
intelligence, and may throw light on how
structure. The important point is that the same
research in AI may finally solve the natural
ideas are present in both traditions and that in
language
the case of many proposed semantic net
translation problems.
systems it is the Indian analysis which is more
“Let us not forget that among the great
specific.
accomplishments of the Indian thinkers was the
The
semantic
net
analysis
resembles
the
Sanskrit analysis remarkably, but the latter has
casts
doubt
between
on
natural
understanding
the
humanistic
and
and
artificial
machine
invention of zero and of the binary number system, a thousand years before the West re-
an interesting flavour. Instead of a change from one location to another, as the semantic net analysis prescribes, the Indian system views
CLEAR March 2013
15
invented them. � It is by no means the case that these analyses have been exhausted, or that their potential has been exploited to the full. On the contrary, it would seem that
Let us end with an evaluation of Panini's contribution by Cardona in 'Panini: a survey of research' (Paris, 1976):-
detailed analyses of sentences and discourse
"Panini's
units had just received a great impetus,
evaluated from various points of
research
as
historical
and
structural
linguistics, and lately generative linguistics, has for a long time acted as an impediment to
view.
grammar
After
evaluations,
all I
has
these think
been
different that
the
further research along the traditional ways.
grammar merits asserting ... that
Lately,
it
however,
serious
and
responsible
research into Indian semantics has been resumed, especially at the University of Pune and Central Institute of Indian Languages,
is
one
monuments
of
the of
greatest human
intelligence."
Mysore.
Ivy Guide Translator Pen or Ivy Guide Mini Translator It is a unique device that fits over any pen or pencil and scans words for translation. As a translator device, basically it will helps you understand the language better. Reading a foreign book or magazine not a problem with Ivy Guide design by Shi Jian, Sun Jiahao & Li Ke this mini translator pen is rechargeable via USB and adapts to your grip with ease. With international students crossing borders and the global community shrinking, this could prove to be a good learning device
CLEAR March 2013
16
Towards Efficient Search Sreejith C Sreejithc321@gmail.com
“A perfect search engine is one that understands exactly what you mean and gives you exactly what you want.� According to Larry Page, Google's CEO and co-founder
In recent years, it's become fashionable to
natural language queries to provide relevant
declare that we're facing the end of search as
search results. Major web search engines like
Google's algorithms mature and newcomers like
Google and Bing incorporate some elements of
semantic search.
challenge
the
status
quo.
Many
observers claim that keyword-based SEO is no longer the defining aspect of search and that inherently social platforms like Facebook are the future of search online. While it's tempting to accept this simple assessment of the state of search and its future, the reality is a bit more complicated. In truth, the evolution of search is ongoing and we're really just starting to explore the potential of the industry.
navigational
and
research.
In
navigational search, the user is using the search engine as a navigation tool to navigate to a particular intended document. Semantic search
is
not
applicable
to
navigational
searches. In research search, the user provides the search engine with a phrase which is user is trying to gather/research information. There is no particular document which the user
search
knows about that s/he is trying to get to.
accuracy by understanding searcher intent and
Rather, the user is trying to locate a number of
the contextual meaning of terms as they
documents which together will give him/her the
appear in the searchable data space, whether
information s/he is trying to find. Semantic
on the Web or within a closed system, to
search lends itself well with this approach that
generate
is closely related with exploratory search.
search
search
search:
intended to denote an object about which the
Semantic Search Semantic
Guha et al. distinguish two major forms of
more
seeks
to
relevant
systems
consider
improve
results.
Semantic
various
points
including context of search, location, intent, and variation of words, synonyms, generalized and specialized queries, concept matching and
CLEAR March 2013
Rather than using ranking algorithms such as Google's
Page
Rank
to
predict
relevancy,
semantic search uses semantics or the science
17
of meaning in language, to produce highly
variety of sources. Knowledge Graph display
relevant search results. In most cases, the goal
was added to Google's search engine in 2012,
is to deliver the information queried by a user
starting in the United States, having been
rather than have a user sort through a list of
announced
loosely
However,
structured and detailed information about the
Google itself has subsequently also announced
topic in addition to a list of links to other sites.
its own Semantic Search project.
The goal is that users would be able to use this
related
keyword
results.
Other authors primarily regard semantic search as a set of techniques for retrieving knowledge from
richly
structured
data
sources
like
ontology’s as found on the Semantic Web. Such technologies enable the formal articulation of domain
knowledge
at
a
high
level
of
expressiveness and could enable the user to specify his intent in more detail at query time.
on
May
16,
2012.It
provides
information to resolve their query without having to navigate to other sites and assemble the information themselves.
According to Google, this information is derived from many sources, including the CIA World Factbook, Freebase and Wikipedia. The feature is similar in intent to answer engines such as Ask Jeeves and Wolfram Alpha. As of 2012,
Google unveils its semantic search plans - the
its semantic network contained over 570 million
Knowledge Graph
objects and more than 18 billion facts about and
“What search engines have lacked so far, until today, was the notion
relationships
between
these
different
objects which are used to understand the meaning
of
the keywords entered
for
the
search.
that those words refer to a thing. If we maintain a representation of a thing, we can use that to better understand both what you are
What Is Google's Semantic Search? One of Google’s stated goals is to index all of the mass
world’s of
information,
combined
the
knowledge
ever-changing and
snarky
asking for and what the web itself
commentary that lives on the Internet. Today
is talking about.”
this index is getting some context, with billions of attributes and connections linking millions of
(Jack Menzel, product management director of
individual
nouns
—
Things,
in
Google’s
parlance. This type of context-informed dataset
search at Google)
is frequently known as the semantic web, but Google is avoiding that term and calling it Knowledge Graph The Knowledge
Knowledge Graph. Graph is
base used
by Google to
engine's
search
search information
CLEAR March 2013
a knowledge
enhance
results gathered
its search
If you’re logged into Google, you may be seeing
with semantic-
this new function already — it started rolling
from
a
wide
out May 16 and will be complete for all logged-
18
in English language users by the 18th. Type in
Definitions of things are inherently contextual
a search term, and instead of listing what you
— whether your first definition of Kings is a
might interested in, the search will provide you
hockey team, a basketball team, a TV show or
a set of options. Use
“Andromeda” as an
a gang depends on who you are and where you
example. You could choose between the galaxy,
are. Google will also make some determinations
the Greek myth, the Swedish metal band, and
based on your search profile and especially
so on.
your
location,
but
personalization
is
still
incomplete. Google will also be bringing its semantic search to tablets and smartphones soon, so to see just what
Google's
Knowledge
on
about
(interesting)
when Graph
it
says
(not-so
interesting)
The Future of Search
To do this, Google set about indexing universal definitions, using every public database from Wikipedia
to
the
CIA
World
Factbook
to
Google’s own products. The result is a new set of 500 million people, places and things, with 3.5 billion connections among them. Along with allowing you to narrow your context, search results
now
contain
little
connections
and
suggestions to augment an initial search term.
Search has come a long way since the early
People search results come with biographical
days of AltaVista and Yahoo.
information, for instance; places results come
We've gone from relying solely on search
with data about the place; and so on. Search
engines with a bunch of blue links to using
for Frank Lloyd Wright, and you’ll see a
more advanced tools like Google's Knowledge
Wikipedia-based
a
Graph, Siri, Google Now, Yelp, and Foursquare.
biographical sketch, and a Google-curated list
But search still hasn't been completely solved.
summary
of
him,
of houses he designed, which will take you to further information if you click.
CLEAR March 2013
19
Here's the future of search :
Search will become even more personal, and
Answers,
not
links,
will
become
more
prevalent.
Google will be able to know what you're
looking for based on who you are.
Search will do things, rather than simply suggest things.
Search engines will not only find, but interpret what
they
find
by
generating
their own
algorithms.
Our digital lives will be combined into one searchable platform.
Advances in artificial intelligence and natural language understanding will result in deeper descriptions and understanding of web pages.
References : http://technorati.com/technology/it/ article/the-future-of-search1/ http://www.google.com/insidesearc h/features/search/knowledge.html http://en.wikipedia.org/wiki/Knowle dge_Graph http://www.popsci.com/technology/ article/2012-05/googles-newsemantic-search-gives-computersgift-context
CLEAR March 2013
20
Google Translate Robert Jesuraj K
rajarobertjesuraj@gmail.com GoogleTranslate is a free statistical multilingual machine-translation service
7. French 8. German
provided
9. Greek
by Google Inc. to translate written text from one
language
into
another.
10. Hindi
used
11. Italian
a SYSTRAN based translator which is used by other
translation
services
such
12. Japanese
as Babel
13. Korean
Fish, AOL, and Yahoo. SYSTRAN, 1968, is
14. Norwegian
founded one
by
of
the
translation companies. extensive
work
Department
of
Dr. Peter
15. Serbo-Croatian
oldest machine
SYSTRAN
for
Toma in
has
the United
Defense and
the
16. Spanish
done
17. Swedish
States
18. Persian
European
19. Polish
Commission. Commercial versions of SYSTRAN
20. Portuguese
can run on Microsoft Windows (including Windo
21. Ukrainian
-ws
22. Urdu
Mobile), Linux,
and Solaris.
Historically,
SYSTRAN systems used Rule-based machine translation (RbMT) technology. With the release of
SYSTRAN
implemented
Server a
7
hybrid
in
2010,
SYSTRAN
rule-based/Statistical
machine translation (SMT) technology which was the first of its kind in the marketplace.
Google Translate Features: The service limits the
number
of
paragraphs,
or
range
of
technical terms, that will be translated. It is also possible to enter searches in a source language
that
are
first
translated
to
a
destination language allowing users to browse
The following is a list of the source and target
and
languages with which SYSTRAN works. Many of
destination
the pairs are to or from English or French.
language. For some languages, users are asked
Russian into English (1968)
for alternate translations such as for technical
1. English
into
Russian
(1973)
for
the Apollo-Soyuz project 2. English source (1975) for the European Commission 3. Arabic
interpret
results language
from
the
selected
in
the
source
terms, to be included for future updates to the translation process. Text in a foreign language can be typed, and if "Detect language" is selected, it will not only detect the language but will translate it into English by default.
4. Chinese
5. Danish
translation tools, has its limitations. While it
6. Dutch
can help the reader to understand the general
CLEAR March 2013
Translate,
like
other
automatic
21
content of a foreign language text, it does not
always
Some
downloadable application for Android OS users.
languages produce better results than others.
The first version was launched in January 2010.
Google Translate performs well especially when
It works simply like the browser version.
English is the target language and the source
Google translation for Android contains two
language is one of the languages of the
main options: "SMS translation" and "History".
deliver
accurate
translations.
European Union. Results of analyses were reported in 2010, showing that French to English translation is relatively accurate and 2011 and 2012 showing that Italian to English translation
is
relatively
accurate
as
well. However, rule-based machine translations perform better if the text to be translated is shorter; this effect is particularly evident in Chinese to English translations.
Translate
is
available
as
a
free
An early 2011 version supported Conversation Mode when translating between English and Spanish (in alpha testing). This new interface within
Translate
allows
users
to
communicate fluidly with a nearby person in another language. In October it was expanded to 14 languages. The application supports 53 languages and voice input for 15 languages. It is available for devices running Android 2.1 and
Texts written in the Greek, Devanagari, Cyrillic
above and can be downloaded by searching for
and Arabic scripts
“Google Translate� in Android Market. It was
can
be
transliterated
automatically from phonetic equivalents written
first
in the Latin alphabet. A number of Firefox
improved version available on January 12,
extensions exist
2011.
for
services,
and
released
in
January
2010,
with
an
likewise for Google Translate, which allow right-
Indian languages (in alpha) and a transliterated
click
input method were first introduced in following
command
access
to
the
translation
service. An extension for Google's Chrome browser also exists, in February 2010, Google translate was
languages.
1. Assamese
integrated into the standard Google Chrome
2. Bengali
browser for automatic webpage translation.
3. Gujarati 4. Kannada 5. Malayalam 6. Marathi 7. Oriya 8. Punjabi 9. Tamil 10. Telugu
Translate Application for Android
CLEAR March 2013
22
Ontology Tools: An overview Manu Madhavan
mmnamboodiry@gmail.com Semantic
Web is an attempt to add semantics
(meaning) to the traditional web. It is a collaborative movement led by the international standards
body,
the
World
Wide
Web
Consortium (W3C). According to the W3C, "The Semantic Web provides a common framework that allows data to be shared and reused across application,
enterprise,
and
community
boundaries."
The current web pages tell the
Ontology editors are applications designed to assist
in
the
creation
or
manipulation
of
ontologies. They often express ontologies in one of many ontology languages. This article discusses some of the popular tools that can be used
for
creating,
editing,
visualizing
and
querying ontologies. Sesame and Jena: APIs for Ontology
browsers how to represent the page. Semantic
The ontology APIs provides usable methods and
web specifies some standards to add more
interfaces, can be used to create ontologies
information to understand the contents in the
programmatically.
web
ontology manipulation are Jena and Sesame.
pages
(by
browsers).
This
standard
(known as Resource Description Frame Work, RDF) promotes common data formats on the internet.
semantic
web
realm,
a
special
knowledge
representation standard is defined. Ontology is the
standard
formats
to
represents
the
common vocabulary in a domain. In theory, an ontology is a "formal, explicit specification of a
shared
conceptualisation".
An
ontology
renders shared vocabulary and taxonomy which models
a
domain
with
the
definition
popular
APIs
for
Sesame Sesame
In order to share the common knowledge in the
The
is
an
open-source
framework
for
querying and analysing RDF data. It was created, and is still being maintained, by the Dutch
software
company
Aduna.
It
was
originally developed as part of the "On-ToKnowledge", a semantic web project that ran from 1999 to 2002. It contains a triple store. Sesame supports two query languages: SeRQL and SPARQL.
of
Sesame as an API allows for mapping Java
objects/concepts, as well as their properties
classes onto ontologies and for generating Java
and relations.
source files from ontologies. This makes it
The specific languages used to represent the ontology include RDF and OWL (web ontology
possible to use specific ontologies like RSS, FOAF and the Dublin Core directly from Java.
language). The languages have some interrelation with each other. OWL is defined over RDF by adding more constraints.
CLEAR March 2013
23
browser like look and feel, with hyper link to
Jena Jena
is
an
open
source
Semantic
Web
framework for Java. It provides an API to extract data from and write to RDF graphs. The graphs are represented as an abstract "model". A model can be sourced with data from files, databases, URLs or a combination of these. Model can be written as owl/rdf file. A Model can also be queried through SPARQL and updated through SPARUL.
URI. It uses many reusable ontologies and name spaces. All ontology editing in SWOOP is done in-line with the HTML renderer, using different
colour
emphasize
codes
ontology
and
font
changes,
eg.
styles
to
different
representations for added axioms vs. deleted axioms vs. inferred axioms. It support ABox and TBox reasoning on ontologies. Protégé: Protégé is a free, open source ontology editor and
a
knowledge
acquisition
system.
Like
Swoop, Protégé is a framework for which various other projects suggest plugins. This application is written in Java and heavily uses Swing
to
create
the
rather complex user
interface. Protégé recently has over 200,000 registered users. Protégé is being developed at Stanford University in collaboration with the University of Manchester and is made available under the Mozilla Public License 1.1. Protégé Jena is similar to Sesame; though, unlike
provides SPARQL query engine to query the
Sesame, Jena provides support for OWL (Web
Onology. OntoViz, an ontology visualization tool
Ontology
is associated with Protégé helps to visualize the
various
Language). internal
reasoner
(an
The
reasoners
open
source
framework and
the
Java
has Pellet
ontology graphs.
OWL-DL
reasoner) can be set up to work in Jena. SWOOP: Swoop
is
the
most
existing
ontology
development toolkit that provides an integrated environment to build and edit ontologies, check for
errors
reasoner),
and browse
inconsistencies multiple
(using
ontologies,
a and
share and reuse existing data by establishing mappings among different ontological entities. The most exiting feature of SWOOP is the web
CLEAR March 2013
24
SIMPLE Groups M.Tech Computational Linguistics Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram simplequest.in@gmail.com www.simpegroups.in
CLEAR Online Magazine
Article Invitation for CLEAR- June-2013 We are inviting thought-provoking articles, interesting dialogues and healthy debates on multifaceted aspects of Computational Linguistics, for the forthcoming issue of CLEAR (Computational Linguistics in Engineering And Research) magazine, publishing on June 2013. The suggested areas of discussion are:
Machine Translation Web Information Extraction/Mining Speech Recognition/Understanding Information Retrieval and Extraction Speech Analysis/Synthesis Quantitative Linguistics Computational Models Language Learning Computational Linguistics Linguistics Modelling Techniques Computational Theories Multilingual/Cross-lingual Language Processing Natural Language Processing Corpus Linguistics Spoken Dialog Systems Software Engineering for NLP Formal Linguistics-Theoretic and Grammar Induction Language and Social Networks Computational Semantics Applying NLP on Domain Specific Data Automatic Text Summarization Lexical Semantics, Word Sense Disambiguation Sentiment Analysis and Opinion Mining Human-Computer Dialogue Systems Models of Cognitive Processes Discourse and Anaphora Ques/Ans and Dialogue Systems Morph Analyzer Textual Entailment Multimodal Representations and Processing Natural Language Interface for Database Deep Learning in NLP
The articles may be sent to the Editor on or before 10th June, 2013 through the email simplequest.in@gmail.com. For more details visit: http://simplegec.blogspot.in
Editor CLEAR Magazine
CLEAR March 2013
25
Hello World, Here I’m sharing a true motivational story for all NLP aspirants. It’s about Mr. Sathyaseelan, who conquered his blindness with the torch of technology. Mr. Sathyaseelan is an office bearer of Association Blinds and faculty at a school for Blinds at Kasargod. We (with Robert and Mujeeb) met him at CUSAT , during his demonstration of Software tools for Visually Challenged people, as part of NCILC13, CUSAT. There with help of his tools, he read e-papers, books, send e-mails and even install a complete system with help of speech tools. He not only used it for his needs, but also teaches and motivates his fellow people to use it. The system is developed by integrating many open source tools with help of his 17years old son, Nalin. Now he is inviting the ILC communities and NLP aspirants to improve the system. Mr. Sathyaseelan’s success is a real example for how NLP can touch the society. The fast growing technology should not be far from such challenged people. They should also get the chance to experience the craziness of the technology. The socially motivated technocrats should come forward to make it true. During the discussion, he invited our volunteer contributions to improve the Indian language Computing resources. When he shared his visions on ILC, open source technology and current issues in ILC he became more talkative and energetic. There we saw a light in his blind eyes- a light of hope!! Dear readers….The tech-pads are open!!! Wish you all the best. Manu Madhavan.
CLEAR March 2013
26
CLEAR March 2013
27
CLEAR March 2013
28