CLEAR March 2013

Page 1

CLEAR March 2013

1


CLEAR March 2013

2


C

Editorial …… ……. 5 SIMPLE News & Updates ……. ……… 6 CLEAR Jun 2013Invitation…………… 25 Last word…………. 26

CLEAR MARCH 2013 Volume-2 Issue-1 CLEAR Magazine (Computational Linguistics in Engineering And Research) M. Tech Computational Linguistics Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad 678633 simplequest.in@gmail.com Chief Editor Dr. P. C. Reghu Raj Professor and Head Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad Editors Manu Madhavan Robert Jesuraj. K Athira P M Sreejith C Cover page and Layout Mujeeb Rehman. O

CLEAR March 2013

Conceptual Indexing and Compound Word Splitter for better Information Retrieval in Malayalam ….search engine fails to present them. This issue can be solved by a new indexing technique where concept of the document is chosen for the search instead of mere keywords….. 7

The Unfinished Symphony: Sanskrit and AI ….One of the main differences between the Indian approach to language analysis and that of most of the current linguistic theories is that the…. 11

Towards Efficient search …Semantic search seeks to improve search accuracy by understanding searcher intent and the contextual meaning of terms as they appear in the searchable data space, …… 17

Google Translate …..machine-translation service provided by Google Inc. to translate written text from one language into another. Google used a SYSTRAN based translator…. ………… 21

Ontology Tools: An over view …..applications designed to assist in the creation or manipulation of ontologies. They often express ontologies in one of many ontology languages. This article discusses some of the popular tools that can be used….. … 23

3


CLEAR March 2013

4


Greetings! When we bring you the third edition of CLEAR, we have some reasons to cheer about. First, the prestigious GARUDA Challenge award (in the students category), instituted by C-DAC to popularize their Grid computing platform, was given to Robert Jesuraj, who is our second year student and an active volunteer of CLEAR. Secondly, this edition's contributors are all our students, who share their understanding and insights in areas related to Computational Linguistics and Information retrieval. Some have taken up local language computing (Malayalam) seriously, and are pursuing their project work in that direction. Some of these works are Conceptual Indexing and Compound word Splitter For Better Information Retrieval in Malayalam by Radhika, an overview of Ontology Tools by Manu, the relevance of Sanskrit to AI by Athira, and some pointers to efficient search methods by Sreejith. I have reasons to be happy that our efforts have started bearing fruits, and I place before you this edition of CLEAR on this positive note. With Best Wishes, Dr. P. C. Reghu Raj (Chief Editor)

CLEAR March 2013

5


NEWS & UPDATES TEQIP phase II: GEC Sreekrishnapuram selected for direct central assistance

Technical Education Quality Improvement Programme (TEQIP ) is a scheme of the Ministry of Human Resource and Development (MHRD) and is aimed at strengthening the quality of technical education in the country. Govt. Engineering College, Sreekrishnapuram is selected for direct central assistance under the second phase of TEQIP scheme. Under the scheme, we have to set up improved lab facilities, introduce new PG and doctoral programs, achieve academic autonomy, and establish centers of excellence in engineering and technology. This significant recognition will definitely give a fillip to postgraduate education and research and development activities of the College.

Robert Won First Prize in GARUDA Challenge

Publications The following papers were published in 3rd National Conference on Indian Language Computing (NCILC) held on January 19-20, 2013 at CUSAT. The papers were also selected for CSI digital resource center. 1. Manu Madhavan, Mujeeb Rehman O, P. C. Reghu Raj, " Computing Prosodic Pattern for Malayalam". CSI Digital Resource center Link:

http://csidl.org/xmlui/handle/1234567 89/543

Robert Jesuraj of SIMPLE Groups, won the first prize in GARUDA Challenge 2012 . The competition was conducted by CDAC for GRID enthusiasts. The winner received certificate, memento and a cash award of Rs. 50000/-. Congraulations !!!! CLEAR March 2013

2. Robert Jesuraj, P. C. Reghu Raj, "MBLP approach applied to POS tagging in Malayalam Language". CSI Digital Resource center Link:

http://hdl.handle.net/123456789/544

6


Conceptual Indexing and Compound Word splitter: For Better Malayalam Information Retrieval Radhika K T radhydev@yahoo.com Today computers are used as an entry

points

to

the

information

Today computers are used as entry points to the information highways. But, building an efficient

Malayalam

search

engine

is

a

highways. But building an efficient

challenging area in the field of Malayalam

Malayalam

computing. Even though it is possible to search

search

engine

is

a

challenging area in the field of Malayalam computing. Even though it

Malayalam documents now a days, the user seldom experience relevant results.

is possible to search Malayalam

Malayalam search engine displays a list of

documents now a days, the user

documents, after fining exact match with the

seldom experience relevant results. Malayalam search engine displays a

query terms. This approach produces many non -relevant documents which are least expected by the user. Also many relevant documents are

list of documents, after it finds exact

missed when the query terms are not exactly

match with the query terms. This

matched with the document terms. These

approach

produces many non

-

relevant documents which are not

issues can be solved in two ways: one is by adding

a

compound

an

relevant documents are missed when

Conceptual Indexing.

matched with the document terms.

splitter

at

the

document side and another one is by applying

expected by the user. Also many

the query terms are not exactly

word

alternate

indexing

technique

called

Working of traditional Malayalam Search engine is based on Keyword Indexing. An obvious

These issues can be solved in two

option is scan the text sequentially. But it is

ways: one is by adding a compound

time consuming. Another option is to build data

word splitter at the document side

structures over the text called indices to speed up the search. Here, the first and important

and another one is by applying an

step

alternate indexing technique.

preparation of index table and this process is

of

building

a

search

engine

is

the

called indexation. The indexation process is conducted by software called spider or crawler. In order to start the journey, the search engine will provide a seed URL to the crawler. The crawler

CLEAR March 2013

visits

each

document

in

the

web

7


starting from the seed URL and collects the terms. For each term, it keeps a list that records in which document the term occurs. Thus an index always maps back from terms to the parts of a document where they occur. When the user enters a query (information need), search engine goes through this index table and returns the document's URL identified from the table. Otherwise it simply returns Figure

``no results have been found for your query''.

2:

Results ”

for

the

query

“Present conditition trough an illustration” Compound Word Splitter ”,

As Malayalam is one of the highly inflected and

” is shown in Figure1

agglutinative languages in Dravidian family, the

The result for the queries “

process of index term matching is failed in the

and Fgure2 respectively.

case of compound words. Splitting is needed for compound words whose morphemes are of different lexical categories.

Hybrid sandhi-

splitter is a program which is developed[1] to split

compound

words

into

its

constituent

morphemes. In Malayalam, it is easy to join a noun and another word which starts with a “swaram”. The second word can be a verb, adverb or adjective. is

done

mainly

pronunciation.

for

Such word compounding the

benefit

of

easy

A compound word generally

consists of noun -noun, noun-adjective, nounverb,

adverb-verb

and

adjective-noun

combination. In some cases all the words of an entire sentence may combined to form a single

Figure 1: Results for the query “

one. Examples of such word combination are:

The results are different even though both have same

concept. Also

the

relevant

which tells exactly about " missed

in

first

search

compound word splitter

result.

" is Adding

a

at the document side

will give a better solution for this issue.

CLEAR March 2013

”, “

”.

document The proposed Hybrid Sandhi Splitter system has two main modules: one for training the system

to

automatically

detect

compound

words and another module for splitting them to

8


constituent morphemes. The hybrid Sandhisplitter developed is using as a preprocessor for morphological analyzer. When

indexing

is

done

after

applying

compound word splitter at the document side, it will obviously represent documents having same

concept

using

same

common

index

terms. Conceptual Indexing Let us discuss the second issue. An online

Figure

4:

Result “

of

the

query

“

information seeker often fails to find what is wanted because the words used in the request are different from the words used in the relevant material . This issue is illustrated in

Even though the query is most specific, and the appropriate document is there, search engine fails to present them. This issue can be solved by a new indexing technique where concept of

the Figure 3 and Figure 4.

the document is chosen for the search instead of mere keywords. organizing Indexing

This new technique for

information combines

called

techniques

Conceptual from

both

knowledge representation and NLP and will add meaning to index terms. In order to find the concept, the conceptual indexing and retrieval system automatically extracts words and phrases from text and

Figure 3: Example for searching the word The top1 ranked document is a relevant one and is describing about the whole concept about

and

also

about

its

. But when a user want to know about

this relevant

organizes them into a semantic network that integrates

syntactic,

semantic,

and

morphological relationships. It needs a lexicon containing

syntactic,

morphological information

semantic,

and

about words, word

senses, and phrase as a back end.

document is not getting. According

to

the

architecture

of

proposed

conceptual indexing system the most general query

like

documents

telling ,

CLEAR March 2013

gets about etc and

9


the

most

specific

query

like

References

gives exact result about and not retrie document telling about

.

[2] Berry M.,Dumais T., Landauer T., and OBrian G., “Using linear algebra for intelligent information retrieval”, SIAM Review. 37 (1995) 573-595.

Congratulations!!!

The author of this article, presented this paper

“Conceptual

Indexing

and

Compound word splitting for better Information retrieval in Malayalam”, at

National

level

Thiruvananthapuram,

workshop jointy

at

organized

by Malayalam University and Kerala IT Mission.

The

CLEAR

Team

SIMPLE heartily

Groups

and

congratulate

Radhika for her achievement.

[1] Barzilay R. and Elhadad M.,“ Using Lexical chains for text summarization”, in Proc. of the ACL Workshop on Intelligent Scalable Text Summarization. (1997) Madrid, Spain, 10-17.

[3] Dhanya P.M, Jathavedan M,“Text summarization using language understanding: A survey”, in Proceedings of Second National Conference on Indian Language Computing,Dept. of Computer Applications, CUSAT, 2012. [4] Edmundson H. P., New methods in automatic extracting, Journal of the ACM, 16(2):264-285, April 1969. [5] Hajime M. and Manabu O., “A comparison of summarization methods based on taks- based evaluation”, 2nd International Conference on language resources and evaluation, LREC-2000. (2000) Athens, Greece, 633-639. [6] Hovy, E.H. , Automated Text Summarization In R. Mitkov (ed), The Oxford Handbook of Computational Linguistics, pp. 583-598. Oxford: Oxford University Pres, 2005. [7] Julian M. Kupiec, J. Pedersen Trainable Document Summarizer”, ACM-SIGIR conference on Research in Information Retrieval. July 1995, 73.

and F. Chen, “A in Proc. of 18th and Development Seattle, USA, 68-

[8] Luhn H. P., The Automatic Creation of Literature Abstracts, Presented at IRE National Convention, New York, 159-165, 1958. [9] Pierre-Etienne Genest,“Framework for Abstractive Summarization using Text-to-Text Generation”, Guy Lapalme, RALI-DIRO. [10] Saravan M.,“A probabilistic apprach to Multi document Summarization for Generating A Tiled Summary”,International Journal of Computational Intelligence and Applications, 2006. [11] Tanveer Siddiqui, Tiwary U.S., Natural Language Processing and Information Retrieval. Oxford University Press, 2008.

CLEAR March 2013

10


The Unfinished Symphony: Sanskrit and AI Athira P. M. athira69@gmail.com “The extraordinary thing about Sanskrit is that it offers

direct accessibility to anyone to that elevated plane where the two — mathematics and music, brain and heart, analytical and intuitive, scientific and spiritual — become one.” Whitehead's Modes of Thought speaks highly of language: "...The mentality of mankind and the language of mankind created each other. If we like to assume the rise of language as a given fact, then it is not going too far to say

as the leading focus of Indian studies of language for three millennia. These studies have

ranged

over

the

full

gamut

of

the

scientific study of language, and have for the most part been preserved up to the present day.

that the souls of men are the gift from language to mankind. The account of the sixth day should be written: He gave them speech,

We have greatly under-estimated the real sacred power of language. When the power of language

and they became souls."

to

recognized, But

Whitehead's

words

are

somewhat

ambiguous, and may have created in readers as many different responses as there are readers. One may perceive his statement as a noble and inspiring truth. Another may react to the notion that a 'soul' could depend on language. Still another may be completely in the dark about what Whitehead is saying. A Sacred Language? Sanskrit is principally known outside India as the sacred language of Hinduism. However, one effect of this sacred status has been the longterm development of linguistic science in India, on a rigorous empirical basis. In fact, the attitude to Sanskrit as sacred has been the solid foundation and justification for its position

CLEAR March 2013

create

and

language

discover

becomes

life

is

sacred;

in

ancient times, language was held in this regard. Nowhere was this more so than in ancient India. It is evident that the ancient scientists of language were acutely aware of the function of language

as

understanding

a

tool

life,

and

for

exploring

their

intention

and to

discover truth was so consuming that in the process of using language with greater and greater rigor, they discovered perhaps the most perfect tool for fulfilling such a search that the world has ever known — the Sanskrit language. Of all the discoveries that have occurred and developed in the course of human history, language is the most significant and probably the most taken for granted. Without language, civilization could obviously not exist. On the other

hand,

to

the

degree

that

language

11


becomes

sophisticated

and

accurate

in

consonants he put them into classes. The

describing

the

and

complexity

of

construction of sentences, compound nouns etc.

subtlety

human life, we gain power and effectiveness in

is explained as ordered rules operating on

meeting its challenges.

underlying structures in a manner similar to modern

Panini's Language Theory: It

was

Panini

who

theory.

In

many

ways

Panini's

constructions are similar to the way that a

formalised

Sanskrit's

mathematical function is defined today.

grammar and usage about 2500 years ago. No

The Ashtadhyayi is one of the earliest known

new 'classes' have needed to be added to it

grammars of Sanskrit, although PÄ áš‡ini refers to

since then. "Panini should be thought of as the

previous texts like the Unadisutra, Dhatupatha,

forerunner of the modern formal language

and Ganapatha. It is the earliest known work

theory used to specify computer languages,"

on descriptive linguistics, and together with the

say J J O'Connor and E F Robertson. Their

work of his immediate predecessors (Nirukta,

article also quotes: "Sanskrit's potential for

Nighantu,

scientific use was greatly enhanced as a result

Pratishakyas)

stands

at

the

beginning of the history of linguistics itself. His

of the thorough systemisation of its grammar

theory of morphological analysis was more

by Panini. ... On the basis of just under 4000

advanced than any equivalent Western theory

sutras [rules expressed as aphorisms], he built

before the mid 20th century, and his analysis of

virtually the whole structure of the Sanskrit language,

whose

general

'shape'

hardly

noun

compounds

still

forms

the

basis

of

modern linguistic theories of compounding,

changed for the next two thousand years."

which have borrowed Sanskrit terms such as

Panini was a Sanskrit grammarian who gave a

bahuvrihi and dvandva.

comprehensive phonetics,

and

scientific

phonology,

and

theory

of

morphology.

Sanskrit - A Scientific Language?

Sanskrit was the classical literary language of

Panini should be thought of as the fore-runner

the

of the modern formal language theory used to

Indians

and

Panini

is

considered

the

founder of the language and literature. It is

specify

interesting to note that the word "Sanskrit"

Normal Form was discovered independently by

means "complete" or "perfect" and it was

John Backus in 1959, but Panini's notation is

thought of as the divine language.

equivalent in its power to that of Backus and

A treatise called Astadhyayi (or Astaka ) is Panini's

major

work.

It

consists

of

eight

chapters, each subdivided into quarter chapters. In this work Panini distinguishes between the language

of

sacred

texts

and

the

usual

language of communication. Panini gives formal production rules and definitions to describe Sanskrit grammar. Starting with about 1700 basic

elements

like

CLEAR March 2013

nouns,

verbs,

vowels,

computer

languages.

The

Backus

has many similar properties. It is remarkable to think that concepts which are fundamental to today's theoretical computer science should have their origin with an Indian genius around 2500 years ago. Sanskrit linguistics is in an excellent position to make immediate use of most

modern

techniques

in

language

processing, since it is already provided with most of the infrastructural tools which are

12


currently seen as desirable.

Rick Briggs: Sanskrit and AI

Lakshmi Thathachar's view of Sanskrit's nature

According to Forbes magazine, (July, 1987),

may be paraphrased as follows: All modern

"Sanskrit is the most convenient language for

languages have etymological roots in classical

computer software programming." Relevant to

languages. And some say all Indo-European

this, there has recently been an astounding

languages are rooted in Sanskrit, but let us not

discovery made at the NASA research centre.

get lost in that debate. Words in Sanskrit are

The following quote is from an article Sanskrit

instances of pre-defined classes, a concept that

& Artificial Intelligence, which appeared in AI

drives

(Artificial Intelligence) magazine in spring of

object

oriented

programming

[OOP]

today. All words have the OOP approach, except that defined classes in Sanskrit are so exhaustive that they cover the material and abstract --indeed cosmic-- experiences known to man. So in Sanskrit the connection is more than etymological.

1985, written NASA researcher Rick Briggs: "In the past twenty years, much time, effort, and money has been expended on designing an unambiguous

representation

languages

make

to

computer

them

processing.

of

natural

accessible

These

efforts

to

have

Every 'philosophy' in Sanskrit is in fact a

centred around creating schemata designed to

'theory of everything'. [The many strands are

parallel

synthesised in Vedanta --Veda + anta--, which

expressed by the syntax and semantics of

means the 'last word in Vedas']. Thathachar

natural

believes it is not a 'language' as we know the

cumbersome and ambiguous in their function

term

huge,

as vehicles for the transmission of logical data.

The

Understandably, there is a widespread belief

current time in human history is ripe, he feels

that natural languages are unsuitable for the

for India's young techno wizards to turn to

transmission

researching

languages can render with great precision and

but

interlinked,

the

only

analogue

Mimamsa

front-end knowledge

and

to

a

base.

developing

the

ultimate programming language around it; nay, an operating system itself. “The

modern

world”,

relations

languages,

of

with

which

many

ideas

relations

are

that

clearly

artificial

mathematical rigor.” But this dichotomy, which has served as a

Thathachar

premise underlying much work in the areas of

declares, “needs Sanskrit,” because Sanskrit is

linguistics and artificial intelligence, is a false

such a systematic and scientific language. Lord

one. There is at least one language, Sanskrit,

Macaulay, the British politician who famously

which for the duration of almost 1000 years

foisted an English-medium education system

was

upon India, thought it a dead language.

Now

considerable literature of its own. Besides

that Panini’s grammar is recognised almost as a

works of literary value, there was a long

meta-grammar for the world by those such as

philosophical and grammatical tradition that

the American linguist Noam Chomsky, the

has continued to exist with undiminished vigor

professor welcomes Sanskrit’s ascendant star in

until

the IT era.

accomplishments of the grammarians can be

CLEAR March 2013

Professor

logical

a

living

the

spoken

present

language

century.

with

a

Among the

13


reckoned a method for paraphrasing Sanskrit in

every sentence expresses an action that is

a manner that is identical not only in essence

conveyed both by the verb and by a set of

but in form with current work in Artificial

"auxiliaries." The verbal action is represented

Intelligence. This article demonstrates that a

by the verbal root of the verb form; the

natural language can serve as an artificial

"auxiliary activities" by the nominal (nouns,

language also, and that much work in AI has

adjectives, indeclinable) and their case endings

been reinventing wheel millennia old.

(one of six).

In early AI research it was discovered that in

The meaning of the verb is said to be both

order to clear up the inherent ambiguity of

vyapara (action, activity, cause), and phala

natural languages for computer comprehension,

(fruit, result, effect). Syntactically, its meaning

it

is invariably linked with the meaning of the

was

necessary

to

utilize

semantic

net

systems to encode the actual meaning of a

verb "to do".

sentence.

He further comments, "The degree

sentence in terms of the verb "to do" or one of

to which a semantic net (or any unambiguous

its synonyms, and an object formed from the

non-syntactic representation) is cumbersome

verbal root which expresses the verbal action

and odd-sounding in a natural language is the

as an action noun. This information in Sanskrit

degree to which that language is 'natural' and

is indicated by the fact that there is an agent

deviates from the precise or 'artificial.' As we

who is engaged in an act and that the action is

shall see, there was a language (Sanskrit)

taking place in the present time. The next step

spoken among an ancient scientific community

in the process of isolating the verbal meaning is

that has a deviation of zero."

to rephrase the description in such a way that

One of the main differences between the Indian approach to language analysis and that of most of the current linguistic theories is that the

This allows us to rephrase the

the agent and number categories appear as qualities of the verbal action.

“Let us not forget that among

analysis of the sentence was not based on a noun-phrase model with its attending binary parsing technique but instead on a conception that viewed the sentence as springing from the

the

great

the

Indian

accomplishments thinkers

was

of the

invention of zero and of the

semantic message that the speaker wished to convey. In its origins, sentence description was

binary

phrased in terms of a generative model: From a

thousand

number of primitive syntactic categories (verbal action, agents, object, etc.) the structure of the

number years

system, before

a the

West re-invented them.�

sentence was derived so that every word of a sentence

could

be

referred

back

to

the

syntactic input categories. Secondarily and at a later period in history, the model was reversed to establish a method for analytical descriptions.

The Sanskrit language has seven case endings, and six of these are definable representations of specific "auxiliary activities." The seventh,

In the analysis of the Indian grammarians,

CLEAR March 2013

14


the genitive, represents a set of auxiliary

the process as a uniting and disuniting of an

activities that are not defined by the other six.

agent. This process is equivalent to the concept

The auxiliary actions are listed as a group of six:

of addition to and deletion from sets. A leaf

Agent, Object, Instrument, Recipient, Point of

falling to the ground can be viewed as a leaf

Departure, and Locality. They are the semantic

disuniting from the set of leaves still attached

correspondents of the syntactic case endings:

to the tree followed by a uniting with (addition

nominative, accusative, instrumental, dative,

to) the set of leaves already on the ground.

ablative and locative, but these are not in exact

This theory is very useful and necessary to

equivalence since the same syntactic structure

formulate changes or statements of state.

can represent different semantic messages. There is a good deal of overlap between the

Last word....

karakas and the case endings, and a few of

The main point in which the two lines of

them are also used for syntactic information.

thought

Word order in Sanskrit has usually no more

decomposition of each prose sentence into

than stylistic significance, and the Sanskrit

karaka-representations

theoreticians paid no more than scant attention

verbal-action, yields the same set of triples as

to it. The language is then very suited to an

those which result from the decomposition of a

approach that eliminates syntax and produces

semantic net into nodes, arcs, and labels. It is

basically a list of semantic messages associated

interesting to speculate as to why the Indians

with the karakas.

found it worthwhile to pursue studies into

The comparison of the analyses shows that the

unambiguous coding of natural language into

Sanskrit sentence matches the analysis arrived

semantic elements. It is tempting to think of

at

them

through

the

application

of

computer

as

have

converged

computer

of

is

that

action

scientists

and

without

the focal

the

processing. That is surprising, because the form

hardware, but a possible explanation is that a

of the Sanskrit sentence is radically different

search for clear, unambiguous understanding is

from that of the English. Of course, many

inherent in the human being. The analysis of

versions of semantic nets have been proposed,

language

some of which match the Indian system better

distinction

than others do in terms of specific concepts and

intelligence, and may throw light on how

structure. The important point is that the same

research in AI may finally solve the natural

ideas are present in both traditions and that in

language

the case of many proposed semantic net

translation problems.

systems it is the Indian analysis which is more

“Let us not forget that among the great

specific.

accomplishments of the Indian thinkers was the

The

semantic

net

analysis

resembles

the

Sanskrit analysis remarkably, but the latter has

casts

doubt

between

on

natural

understanding

the

humanistic

and

and

artificial

machine

invention of zero and of the binary number system, a thousand years before the West re-

an interesting flavour. Instead of a change from one location to another, as the semantic net analysis prescribes, the Indian system views

CLEAR March 2013

15


invented them. � It is by no means the case that these analyses have been exhausted, or that their potential has been exploited to the full. On the contrary, it would seem that

Let us end with an evaluation of Panini's contribution by Cardona in 'Panini: a survey of research' (Paris, 1976):-

detailed analyses of sentences and discourse

"Panini's

units had just received a great impetus,

evaluated from various points of

research

as

historical

and

structural

linguistics, and lately generative linguistics, has for a long time acted as an impediment to

view.

grammar

After

evaluations,

all I

has

these think

been

different that

the

further research along the traditional ways.

grammar merits asserting ... that

Lately,

it

however,

serious

and

responsible

research into Indian semantics has been resumed, especially at the University of Pune and Central Institute of Indian Languages,

is

one

monuments

of

the of

greatest human

intelligence."

Mysore.

Ivy Guide Translator Pen or Ivy Guide Mini Translator It is a unique device that fits over any pen or pencil and scans words for translation. As a translator device, basically it will helps you understand the language better. Reading a foreign book or magazine not a problem with Ivy Guide design by Shi Jian, Sun Jiahao & Li Ke this mini translator pen is rechargeable via USB and adapts to your grip with ease. With international students crossing borders and the global community shrinking, this could prove to be a good learning device

CLEAR March 2013

16


Towards Efficient Search Sreejith C Sreejithc321@gmail.com

“A perfect search engine is one that understands exactly what you mean and gives you exactly what you want.� According to Larry Page, Google's CEO and co-founder

In recent years, it's become fashionable to

natural language queries to provide relevant

declare that we're facing the end of search as

search results. Major web search engines like

Google's algorithms mature and newcomers like

Google and Bing incorporate some elements of

Facebook

semantic search.

challenge

the

status

quo.

Many

observers claim that keyword-based SEO is no longer the defining aspect of search and that inherently social platforms like Facebook are the future of search online. While it's tempting to accept this simple assessment of the state of search and its future, the reality is a bit more complicated. In truth, the evolution of search is ongoing and we're really just starting to explore the potential of the industry.

navigational

and

research.

In

navigational search, the user is using the search engine as a navigation tool to navigate to a particular intended document. Semantic search

is

not

applicable

to

navigational

searches. In research search, the user provides the search engine with a phrase which is user is trying to gather/research information. There is no particular document which the user

search

knows about that s/he is trying to get to.

accuracy by understanding searcher intent and

Rather, the user is trying to locate a number of

the contextual meaning of terms as they

documents which together will give him/her the

appear in the searchable data space, whether

information s/he is trying to find. Semantic

on the Web or within a closed system, to

search lends itself well with this approach that

generate

is closely related with exploratory search.

search

search

search:

intended to denote an object about which the

Semantic Search Semantic

Guha et al. distinguish two major forms of

more

seeks

to

relevant

systems

consider

improve

results.

Semantic

various

points

including context of search, location, intent, and variation of words, synonyms, generalized and specialized queries, concept matching and

CLEAR March 2013

Rather than using ranking algorithms such as Google's

Page

Rank

to

predict

relevancy,

semantic search uses semantics or the science

17


of meaning in language, to produce highly

variety of sources. Knowledge Graph display

relevant search results. In most cases, the goal

was added to Google's search engine in 2012,

is to deliver the information queried by a user

starting in the United States, having been

rather than have a user sort through a list of

announced

loosely

However,

structured and detailed information about the

Google itself has subsequently also announced

topic in addition to a list of links to other sites.

its own Semantic Search project.

The goal is that users would be able to use this

related

keyword

results.

Other authors primarily regard semantic search as a set of techniques for retrieving knowledge from

richly

structured

data

sources

like

ontology’s as found on the Semantic Web. Such technologies enable the formal articulation of domain

knowledge

at

a

high

level

of

expressiveness and could enable the user to specify his intent in more detail at query time.

on

May

16,

2012.It

provides

information to resolve their query without having to navigate to other sites and assemble the information themselves.

According to Google, this information is derived from many sources, including the CIA World Factbook, Freebase and Wikipedia. The feature is similar in intent to answer engines such as Ask Jeeves and Wolfram Alpha. As of 2012,

Google unveils its semantic search plans - the

its semantic network contained over 570 million

Knowledge Graph

objects and more than 18 billion facts about and

“What search engines have lacked so far, until today, was the notion

relationships

between

these

different

objects which are used to understand the meaning

of

the keywords entered

for

the

search.

that those words refer to a thing. If we maintain a representation of a thing, we can use that to better understand both what you are

What Is Google's Semantic Search? One of Google’s stated goals is to index all of the mass

world’s of

information,

combined

the

knowledge

ever-changing and

snarky

asking for and what the web itself

commentary that lives on the Internet. Today

is talking about.”

this index is getting some context, with billions of attributes and connections linking millions of

(Jack Menzel, product management director of

individual

nouns

Things,

in

Google’s

parlance. This type of context-informed dataset

search at Google)

is frequently known as the semantic web, but Google is avoiding that term and calling it Knowledge Graph The Knowledge

Knowledge Graph. Graph is

base used

by Google to

engine's

search

search information

CLEAR March 2013

a knowledge

enhance

results gathered

its search

If you’re logged into Google, you may be seeing

with semantic-

this new function already — it started rolling

from

a

wide

out May 16 and will be complete for all logged-

18


in English language users by the 18th. Type in

Definitions of things are inherently contextual

a search term, and instead of listing what you

— whether your first definition of Kings is a

might interested in, the search will provide you

hockey team, a basketball team, a TV show or

a set of options. Use

“Andromeda” as an

a gang depends on who you are and where you

example. You could choose between the galaxy,

are. Google will also make some determinations

the Greek myth, the Swedish metal band, and

based on your search profile and especially

so on.

your

location,

but

personalization

is

still

incomplete. Google will also be bringing its semantic search to tablets and smartphones soon, so to see just what

Google's

Knowledge

on

about

(interesting)

when Graph

it

says

(not-so

interesting)

The Future of Search

To do this, Google set about indexing universal definitions, using every public database from Wikipedia

to

the

CIA

World

Factbook

to

Google’s own products. The result is a new set of 500 million people, places and things, with 3.5 billion connections among them. Along with allowing you to narrow your context, search results

now

contain

little

connections

and

suggestions to augment an initial search term.

Search has come a long way since the early

People search results come with biographical

days of AltaVista and Yahoo.

information, for instance; places results come

We've gone from relying solely on search

with data about the place; and so on. Search

engines with a bunch of blue links to using

for Frank Lloyd Wright, and you’ll see a

more advanced tools like Google's Knowledge

Wikipedia-based

a

Graph, Siri, Google Now, Yelp, and Foursquare.

biographical sketch, and a Google-curated list

But search still hasn't been completely solved.

summary

of

him,

of houses he designed, which will take you to further information if you click.

CLEAR March 2013

19


Here's the future of search : 

Search will become even more personal, and

Answers,

not

links,

will

become

more

prevalent.

Google will be able to know what you're 

looking for based on who you are.

Search will do things, rather than simply suggest things.

Search engines will not only find, but interpret what

they

find

by

generating

their own

algorithms. 

Our digital lives will be combined into one searchable platform.

Advances in artificial intelligence and natural language understanding will result in deeper descriptions and understanding of web pages.

References : http://technorati.com/technology/it/ article/the-future-of-search1/ http://www.google.com/insidesearc h/features/search/knowledge.html http://en.wikipedia.org/wiki/Knowle dge_Graph http://www.popsci.com/technology/ article/2012-05/googles-newsemantic-search-gives-computersgift-context

CLEAR March 2013

20


Google Translate Robert Jesuraj K

rajarobertjesuraj@gmail.com GoogleTranslate is a free statistical multilingual machine-translation service

7. French 8. German

provided

9. Greek

by Google Inc. to translate written text from one

language

into

another.

Google

10. Hindi

used

11. Italian

a SYSTRAN based translator which is used by other

translation

services

such

12. Japanese

as Babel

13. Korean

Fish, AOL, and Yahoo. SYSTRAN, 1968, is

14. Norwegian

founded one

by

of

the

translation companies. extensive

work

Department

of

Dr. Peter

15. Serbo-Croatian

oldest machine

SYSTRAN

for

Toma in

has

the United

Defense and

the

16. Spanish

done

17. Swedish

States

18. Persian

European

19. Polish

Commission. Commercial versions of SYSTRAN

20. Portuguese

can run on Microsoft Windows (including Windo

21. Ukrainian

-ws

22. Urdu

Mobile), Linux,

and Solaris.

Historically,

SYSTRAN systems used Rule-based machine translation (RbMT) technology. With the release of

SYSTRAN

implemented

Server a

7

hybrid

in

2010,

SYSTRAN

rule-based/Statistical

machine translation (SMT) technology which was the first of its kind in the marketplace.

Google Translate Features: The service limits the

number

of

paragraphs,

or

range

of

technical terms, that will be translated. It is also possible to enter searches in a source language

that

are

first

translated

to

a

destination language allowing users to browse

The following is a list of the source and target

and

languages with which SYSTRAN works. Many of

destination

the pairs are to or from English or French.

language. For some languages, users are asked

Russian into English (1968)

for alternate translations such as for technical

1. English

into

Russian

(1973)

for

the Apollo-Soyuz project 2. English source (1975) for the European Commission 3. Arabic

interpret

results language

from

the

selected

in

the

source

terms, to be included for future updates to the translation process. Text in a foreign language can be typed, and if "Detect language" is selected, it will not only detect the language but will translate it into English by default.

4. Chinese

Google

5. Danish

translation tools, has its limitations. While it

6. Dutch

can help the reader to understand the general

CLEAR March 2013

Translate,

like

other

automatic

21


content of a foreign language text, it does not

Google

always

Some

downloadable application for Android OS users.

languages produce better results than others.

The first version was launched in January 2010.

Google Translate performs well especially when

It works simply like the browser version.

English is the target language and the source

Google translation for Android contains two

language is one of the languages of the

main options: "SMS translation" and "History".

deliver

accurate

translations.

European Union. Results of analyses were reported in 2010, showing that French to English translation is relatively accurate and 2011 and 2012 showing that Italian to English translation

is

relatively

accurate

as

well. However, rule-based machine translations perform better if the text to be translated is shorter; this effect is particularly evident in Chinese to English translations.

Translate

is

available

as

a

free

An early 2011 version supported Conversation Mode when translating between English and Spanish (in alpha testing). This new interface within

Google

Translate

allows

users

to

communicate fluidly with a nearby person in another language. In October it was expanded to 14 languages. The application supports 53 languages and voice input for 15 languages. It is available for devices running Android 2.1 and

Texts written in the Greek, Devanagari, Cyrillic

above and can be downloaded by searching for

and Arabic scripts

“Google Translate� in Android Market. It was

can

be

transliterated

automatically from phonetic equivalents written

first

in the Latin alphabet. A number of Firefox

improved version available on January 12,

extensions exist

2011.

for

Google

services,

and

released

in

January

2010,

with

an

likewise for Google Translate, which allow right-

Indian languages (in alpha) and a transliterated

click

input method were first introduced in following

command

access

to

the

translation

service. An extension for Google's Chrome browser also exists, in February 2010, Google translate was

languages.

1. Assamese

integrated into the standard Google Chrome

2. Bengali

browser for automatic webpage translation.

3. Gujarati 4. Kannada 5. Malayalam 6. Marathi 7. Oriya 8. Punjabi 9. Tamil 10. Telugu

Translate Application for Android

CLEAR March 2013

22


Ontology Tools: An overview Manu Madhavan

mmnamboodiry@gmail.com Semantic

Web is an attempt to add semantics

(meaning) to the traditional web. It is a collaborative movement led by the international standards

body,

the

World

Wide

Web

Consortium (W3C). According to the W3C, "The Semantic Web provides a common framework that allows data to be shared and reused across application,

enterprise,

and

community

boundaries."

The current web pages tell the

Ontology editors are applications designed to assist

in

the

creation

or

manipulation

of

ontologies. They often express ontologies in one of many ontology languages. This article discusses some of the popular tools that can be used

for

creating,

editing,

visualizing

and

querying ontologies. Sesame and Jena: APIs for Ontology

browsers how to represent the page. Semantic

The ontology APIs provides usable methods and

web specifies some standards to add more

interfaces, can be used to create ontologies

information to understand the contents in the

programmatically.

web

ontology manipulation are Jena and Sesame.

pages

(by

browsers).

This

standard

(known as Resource Description Frame Work, RDF) promotes common data formats on the internet.

semantic

web

realm,

a

special

knowledge

representation standard is defined. Ontology is the

standard

formats

to

represents

the

common vocabulary in a domain. In theory, an ontology is a "formal, explicit specification of a

shared

conceptualisation".

An

ontology

renders shared vocabulary and taxonomy which models

a

domain

with

the

definition

popular

APIs

for

Sesame Sesame

In order to share the common knowledge in the

The

is

an

open-source

framework

for

querying and analysing RDF data. It was created, and is still being maintained, by the Dutch

software

company

Aduna.

It

was

originally developed as part of the "On-ToKnowledge", a semantic web project that ran from 1999 to 2002. It contains a triple store. Sesame supports two query languages: SeRQL and SPARQL.

of

Sesame as an API allows for mapping Java

objects/concepts, as well as their properties

classes onto ontologies and for generating Java

and relations.

source files from ontologies. This makes it

The specific languages used to represent the ontology include RDF and OWL (web ontology

possible to use specific ontologies like RSS, FOAF and the Dublin Core directly from Java.

language). The languages have some interrelation with each other. OWL is defined over RDF by adding more constraints.

CLEAR March 2013

23


browser like look and feel, with hyper link to

Jena Jena

is

an

open

source

Semantic

Web

framework for Java. It provides an API to extract data from and write to RDF graphs. The graphs are represented as an abstract "model". A model can be sourced with data from files, databases, URLs or a combination of these. Model can be written as owl/rdf file. A Model can also be queried through SPARQL and updated through SPARUL.

URI. It uses many reusable ontologies and name spaces. All ontology editing in SWOOP is done in-line with the HTML renderer, using different

colour

emphasize

codes

ontology

and

font

changes,

eg.

styles

to

different

representations for added axioms vs. deleted axioms vs. inferred axioms. It support ABox and TBox reasoning on ontologies. Protégé: Protégé is a free, open source ontology editor and

a

knowledge

acquisition

system.

Like

Swoop, Protégé is a framework for which various other projects suggest plugins. This application is written in Java and heavily uses Swing

to

create

the

rather complex user

interface. Protégé recently has over 200,000 registered users. Protégé is being developed at Stanford University in collaboration with the University of Manchester and is made available under the Mozilla Public License 1.1. Protégé Jena is similar to Sesame; though, unlike

provides SPARQL query engine to query the

Sesame, Jena provides support for OWL (Web

Onology. OntoViz, an ontology visualization tool

Ontology

is associated with Protégé helps to visualize the

various

Language). internal

reasoner

(an

The

reasoners

open

source

framework and

the

Java

has Pellet

ontology graphs.

OWL-DL

reasoner) can be set up to work in Jena. SWOOP: Swoop

is

the

most

existing

ontology

development toolkit that provides an integrated environment to build and edit ontologies, check for

errors

reasoner),

and browse

inconsistencies multiple

(using

ontologies,

a and

share and reuse existing data by establishing mappings among different ontological entities. The most exiting feature of SWOOP is the web

CLEAR March 2013

24


SIMPLE Groups M.Tech Computational Linguistics Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram simplequest.in@gmail.com www.simpegroups.in

CLEAR Online Magazine

Article Invitation for CLEAR- June-2013 We are inviting thought-provoking articles, interesting dialogues and healthy debates on multifaceted aspects of Computational Linguistics, for the forthcoming issue of CLEAR (Computational Linguistics in Engineering And Research) magazine, publishing on June 2013. The suggested areas of discussion are:                

Machine Translation Web Information Extraction/Mining Speech Recognition/Understanding Information Retrieval and Extraction Speech Analysis/Synthesis Quantitative Linguistics Computational Models Language Learning Computational Linguistics Linguistics Modelling Techniques Computational Theories Multilingual/Cross-lingual Language Processing Natural Language Processing Corpus Linguistics Spoken Dialog Systems Software Engineering for NLP Formal Linguistics-Theoretic and Grammar Induction Language and Social Networks Computational Semantics Applying NLP on Domain Specific Data Automatic Text Summarization Lexical Semantics, Word Sense Disambiguation Sentiment Analysis and Opinion Mining Human-Computer Dialogue Systems Models of Cognitive Processes Discourse and Anaphora Ques/Ans and Dialogue Systems Morph Analyzer Textual Entailment Multimodal Representations and Processing Natural Language Interface for Database Deep Learning in NLP

The articles may be sent to the Editor on or before 10th June, 2013 through the email simplequest.in@gmail.com. For more details visit: http://simplegec.blogspot.in

Editor CLEAR Magazine

CLEAR March 2013

25


Hello World, Here I’m sharing a true motivational story for all NLP aspirants. It’s about Mr. Sathyaseelan, who conquered his blindness with the torch of technology. Mr. Sathyaseelan is an office bearer of Association Blinds and faculty at a school for Blinds at Kasargod. We (with Robert and Mujeeb) met him at CUSAT , during his demonstration of Software tools for Visually Challenged people, as part of NCILC13, CUSAT. There with help of his tools, he read e-papers, books, send e-mails and even install a complete system with help of speech tools. He not only used it for his needs, but also teaches and motivates his fellow people to use it. The system is developed by integrating many open source tools with help of his 17years old son, Nalin. Now he is inviting the ILC communities and NLP aspirants to improve the system. Mr. Sathyaseelan’s success is a real example for how NLP can touch the society. The fast growing technology should not be far from such challenged people. They should also get the chance to experience the craziness of the technology. The socially motivated technocrats should come forward to make it true. During the discussion, he invited our volunteer contributions to improve the Indian language Computing resources. When he shared his visions on ILC, open source technology and current issues in ILC he became more talkative and energetic. There we saw a light in his blind eyes- a light of hope!! Dear readers….The tech-pads are open!!! Wish you all the best. Manu Madhavan.

CLEAR March 2013

26


CLEAR March 2013

27


CLEAR March 2013

28


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.