March clear 2014

Page 1

CLEAR March 2014

1


CLEAR March 2014

2


C

CLEAR September 2013 Volume-3 Issue-1 CLEAR Magazine (Computational Linguistics in Engineering And Research) M. Tech Computational Linguistics Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad 678633 www.simplegroups.in simplequest.in@gmail.com Chief Editor Dr. P. C. Reghu Raj Professor and Head Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad Editors Reshma O K Sreejith C Gopalakrishnan K Neethu Jhonson

Editorial

4

SIMPLE News & Updates

5

CLEAR June 2014 Invitation

35

Last word

36

Indian Language Computing: From Dream to Reality Mr. Manu Madhavan

6

The Sphota Theory of Bhartrhari Ms. Kavitha Raju, Mr. Manu V.Nair

8

Natural Language Processing: A Paninian Perspective Ms. Anagha Manoharan, Ms. Reji Rahmath, Ms. Reka Raj C.T. 18

Anaphora Resolution - An Overview Ms. Athira S, Ms. Lekshmi T.S.

27

Indian Language Computing Platforms Mr. Sreejith C

32

Cover Page and Layout Sreejith C

CLEAR March 2014

3


Greetings! India has a well-acclaimed history of excellence in mathematical and philosophical domains of knowledge. Scientific theories related to syntax, semantics and other aspects of language flourished in ancient India. Long before ACSII and EBCDIC were invented, Indians used the 'katapayaadi' notation to denote numbers in scientific literature and even in poetry. Messages and text were compressed to very short codes, which were used to specify scientific principles and rules of the grammar. Shiksha and Niruktham were two important branches of study in the Indian system of education. This edition of CLEAR is a humble tribute to those great philosophers and scientists who gave a sound logical framework to the science of language. The articles deal with the Sphota theory and Paninian Grammatical framework among others. It is therefore aptly tagged as a Special Edition on Indian Language Computing. Please send in your valuable feedback on the content so that we keep on improving in the times to come. Warm Regards Dr. P. C. Reghu Raj (Chief Editor)

CLEAR March 2014

4


NEWS & UPDATES

Workshop on Soft Computing The three day workshop on Soft computing, was held at GEC Sreekrishnapuram from 9-11 th January 2014. Eminent personalities from IIST, IIT Delhi and M.S.U Baroda gave sessions on topics Genetic algorithms, neural networks, Fuzzy logic and Data-mining. The workshop was inaugurated by Dr.P.C.Reghu Raj, Principal of GEC Sreekrishnapuram. First session was based on Fuzzy logic by Dr.Deepak Mishra, IIST Trivandrum and is followed by another wonderful session on Google Page Ranking algorithm by Dr.Sumitra.S.Nair, IIST Trivandrum. First day of workshop came to an end with a session on Artificial Neural Networks by Prof.Raju.K.George , IIST Trivandrum followed by a lab session. On Second day there was a great session on Genetic algorithms by Prof.V.D Pathak, M.S.U Baroda. And then there was a session on „Theory of Kernal Methods‟ by Dr.Sumitra.S.Nair , IIST Trivandrum. Second day ended with a lab session on ANNs.Third day session mainly focused on Data Mining and its applications. The session was by Prof.B.Chandra, IIT Delhi. It was an interesting session .The three day workshop came to an end with the distribution of certificates to all the participants by Prof.B.Chandra and Prof.V.D Pathak.

Publications  Sreejith C, Nibeesh K, and P. C. Reghu Raj, " CHODYOTHARI : Question Answering System for Malayalam ", in proceedings of 4th NATCON, Kerala, 2014.  Nibeesh K, Sreejith C, P. C. Reghu Raj, " Text Classification for Effective News Filtering Using Support Vector Machine", in proceedings of 4th NATCON, Kerala, 2014.  Christopher Augustine, P. C. Reghu Raj, “Agglomerative Sentence Clustering Approach for Discourse Segmentation", International Journal of Engineering Research & Technology (IJERT), Vol. 2 Issue 12, December - 2013. SIMPLE Groups Congratulates all of them for their achievement s!!!

CLEAR March 2014

5


Indian Language Computing: From Dream to Reality Manu Madhavan Asst. Professor, SIMAT

scheduled Nowadays, Language computing and related technologies

brought

some

notable

contributions in researches. But emerging it as an application for the common man is still a dream. Of Course, there are many language enthusiasts, who are correlating the gap between

applications

and

research

in

languages

and

computer

technology breaks the language barrier and bridges the gap between the various sections of the society through easier access to information using their respective languages and hence language computing becomes central to the exchange of information across speakers of various language.

language oriented computation. Especially in Indian scenario, this field is still in its

However innovative the product is, its reach

childhood. Indian engineers and scientists

depends on how it inspire the common man.

are a dominant force in the IT world, but

The main barrier between technology and the

have also faced criticism for being grossly

common man is language. All the main

negligent of the needs of the common man

information from the technology side is

from their own region. "This has pushed

English centric. The great innovators may

India to the top of the list of countries

some time biased by some wrong assumption

suffering from the Digital Divide," argue

that most of their requirement can be

campaigners

achieved by English only interface. Real

article

like

positively

challenges

and

Venkataramanan.This review

applications

the

scope,

world is Multilingual and majority of the

of

Indian

world does not know English as well.

language computing.

Quoting

the

words

of

Sri

Santhosh

Thottingal, popularly known as Malayalamâ€&#x;s The first point to be checked is the reach of Indian Language Computing. India is a multilingual country with as many as 22 CLEAR March 2014

Wiki Warrior “ now a days it is like if you are writing a program that can work only with English, it rarely meet the requirements. 6


Such assumption that the input will be only

Thus we cannot have a generic tool,

English or we will have to process only

especially for translation, and all tools have

English

multi-language

to be developed for all of the languages, adds

computing is becoming part of software

Shukla. In a C-DAC report he mentions that

engineering, not as a different field� .

although all Indian languages have emerged

is

wrong.

So

Researchers working on Indian language computing soon realized that the tools present in the global market cannot be replicated in India owing to the complexity of multiple languages that exist in the country. (India comprises not only 22 major languages with as many dialects as 1,652,

out of Sanskrit, the core ancient language, and mostly all of them follow Paninian grammar, but that itself is a problem as different languages depend on Sanskrit and Panini

in

different

manner.

Therefore

accuracy for any of these systems is not 100 percent.

but there are also 11 scripts to represent

Most of the language computing works in

these languages.) Swarn Lata, head of the

India are concentrating on translation. A

TDIL program and director of Human

direct translation between Indian languages

Centred Computing group in the DIT,

is not simple, but can be made easy by

explains that “in Indian languages one-to-one

considering their etymology and historical

mapping or translation of each word as it is

relations with Sanskrit. Other areas like OCR,

to form a sentence is not workable. The

speech;

methodology to be followed here is to first

conceptualization are yet not be fully

process the source language, convert words

developed and have large scope.

according to the target language, and then process it all again with respect to the target language for the conversion to make sense.

sentiment

analysis

and

Even though the difficulties exist, the enthusiasm shown by the young techies towards language computing gives a positive

Apart from the typical nature of Indian

future. The works done by Open source

languages, cultures also affect our language

active groups like Swathanthra Malayalam

usage and pronunciation. For example, in

Computing (SMC), institutions like IIITH,

northern parts of India, Hindi is spoken in

IIITM-k, and C-DAC are highly appreciable

varied forms across different states and cities.

and are a good guideline for new comers.

CLEAR March 2014

7


The Sphota Theory of Bhartrhari Kavitha Raju

Manu V Nair

Dept. of Computer Science and Engineering Govt. Engg. College Sreekrishnapuram Palakkad, India -678 633 kavitharaju18@gmail.com

Dept. of Computer Science and Engineering Govt. Engg. College Sreekrishnapuram Palakkad, India -678 633 manunair1990@gmail.com

.

Sphota theory is one of the major contribution of Indian philosophers(mainly Bhartrhari) to the Linguistics. It deals with mainly the semantic aspects of the language phenomenon. Sphota theory originates with Bhartrhari, but the term has a usage from the literatures of Vyasa. In sanskrit, sphota is etymologically derived from the root ’sphut’ which means ’to burst’. Sounds have spatial and temporal relations; they are produced differently by different speakers. But the word as meaning bearer has to be regarded as having no size or temporal dimension. It is indivisible and eternal. Distinguished from the sphota are the abstract sound pattern (prakritadhvani) and the utterances (vikritadhvani). Bhartrihari held that the sentence is not a collection of words or an ordered series of them. A word is rather an abstraction from a sentence; thus, the sentence-sphota is the primary unit of meaning. A word is also grasped as a unity by an instantaneous flash of insight (pratibha).

I. Introduction ancient India derives its impetus from the Sphota is an important concept in the Indian grammatical tradition of Vyakarana, relating to the problem of speech production, how the mind orders linguistic units into coherent discourse and meaning. Its a milestone to the linguistics in india. Across cultures, the early history of linguistics is associated with a need to disambiguate discourse, especially for ritual texts or in arguments. Linguistics in CLEAR March 2014

need to correctly recite and interpret the Vedic texts. Already in the oldest Indian text, the Rigveda, vak (”speech”) is deified. By 1200 BCE, the oral performance of these texts becomes standardized, and treatises on ritual recitation suggest splitting up the Sanskrit compounds into words, stems, and phonetic units, providing an impetus for morphology and phonetics. 8


In India, Hindu philosophy is traditionally

Katyayana, Kumarila bhatta, Mandana misra,

divided into six astika schools of thought, or

Panini,

darsanam , which accept the Vedas as

Sakatayana, Vacaspati Misra.

Patanjali,

Pingala,

Prabhakara,

supreme revealed scriptures.These darsanas deals with almost all the things that human

II. The meaning of ‘MEANING’

want.Thus it includes linguistics also. The

The most common Sanskrit term for

six systems of Indian philosophy are:

meaning is artha. In the Western literature on

Nyaya, Vasiseshika, Sankhya, Yoga, Purva

the notion of meaning in the Indian tradition,

Mimamsa, Uttara Mimamsa(vedanta).

various terms, such as ‟sense‟, ‟reference‟, ‟denotation‟,

‟connotation‟,

‟designatum‟

Each of these systems differs in one way or

and ‟intension‟, have been used to render the

the

concepts,

Sanskrit. Artha basically refers to the object

phenomena, laws and dogmas. Each system

signified by a word. In numerous contexts, it

has it‟s own founder as well. It is important

stands for an object in the sense of an

to know that the founders of these systems of

element of external reality.

philosophy are sages of the highest order that

According to Indian thought there are two

have devoted their lives for the study and

main approaches to the study of the problem

propagation of philosophy. Each system of

of meaning.

Indian philosophy is called a Darshana. Thus

1) Khandapaksa

the Sanskrit word Shad-Darshana refers to

2) Akhandapaksa

other

in

terms

of

its

the six systems of philosophy. In these six schools of thoughts, Nyaya, Vaisheshika,

According

to

khandapaksa

word

is

Purva Mimamsaka are almost directly related

considered as an autonomous unit of thought

to linguistics. Naiyayika is the name given to

and sense and the sentence is taken as the

followers of the Nyya School. Mimamsakas

concatenation of words. In khandapaksa, the

is the name given to followers of the Jaimini.

conditions for syntactic relation between words in a sentence are mutual expectancy

There

are

many

philosophers

and

(akanksa),

consistency

(yogyata)

and

grammarians in India, who contributed many

proximity (samnidhi) and also recognized the

ideas to the world of linguistics. Some of

importance of contextual factors and the

them

are

Vyasa,

CLEAR March 2014

Yaska,

Bhartrhari, 9


intention of the speaker in determining the

indivisible meaning bearing units. The word

meaning of words.

or sentence, thus considered as a single meaning bearing unit, is called the ‟sphota‟.

According to akhandapaksa, fundamental

The articulated sounds used in linguistic

linguistic fact is the sentence. Bhartrhari,

discourse are merely the means by which the

who accept this thought define the sentence

symbol is revealed; it is the symbol which is

as „a single integral symbol‟ (eko navayavah

the meaning bearer.

sabdah). Sentence is revealed by individual letters and words. Meaning conveyed by the

Though this theory was fully developed and

sentence is an „instantaneous flash of insight

systematized by Bhartrhari, in his work

or intuition‟ (pratibha). Words are only hints

Vakyapadiya, some of the ideas underlying

that help the listener to arrive at the meaning.

this

theory

can

be

found

in

earlier

grammatical works. Also there are works According

to

theory

of

vyanjana

by

Anandavardhana, artha or meaning means

that mention the term Sphota in different senses.

not only cognitive and logical meaning but also emotive elements and the social-cultural significance

of

utterances

which

are

suggested with the help of contextual factors.

It is believed that it was the ancient grammarian Vyadi, who was earlier to Patanjali and Katyayana who might have started the discussion on Sphota theory in his

III. The History of Sphota

lost work, the Samgraha. The school of thought started by Audumbarayana may be

The theory of Sphota is one of the ancient Indian linguistic concepts. It was mainly formulated by the philosopher Bhartrhari of

considered as the forerunner of the Sphota theory of Bhartrhari. According to Yaska, Audumbarayana held the view that only the

5th century CE. The word Sphota means ‟to

sentence is found in the minds of the listener

burst‟ in Sanskrit.

and the speaker. Pannini seems not to know of this theory.

According to this theory, a sentence should not be considered as being made up of words and words in turn by letters. They are single CLEAR March 2014

But he mentions sage Sphotayana as an authority. commentry

Katyayana for

who

Panninisutras

wrote

the

had

the

10


distinction of permanent varnas and vrtti. But

time. It is explained as analogous to a flame

he never used the term Sphota.

and its light. There is also a third philosophy

The Mimamsakas say that varnas are

that defines the Sphota as a class and the

permanent and it is in varnas that the

dhvanis as its members.

meaning is carried. Patanjali is not sure as to whether it is in varnas or words the meaning

The modern linguist de Saussure says, there

is contained. He is of the opinion that the

is a significant and a signifie. His concepts

articulated sound is ofcourse ephemeral and

match with the dhvani and Sphota concepts

doesn't

somehow

of Bhartrhari. The present day psychologists

intelligent humans graps them together in

accepts the existence of an expression plane

mind and it reveals the Sphota. The

and a meaning plane in the mind. This is also

Nyayaikas say varnas do not have meaning

in support of the Sphota theory.

exist

together.

But

and are ephemeral. According to them, the last sound heard aided by the memory of the preceding sounds presents the meaning of

IV. The Concept of Sphota; As Stated by Bhartrhari

the word to the mind. They use the term samskara, the traces left on the mind by experience that can produce a recollection of what was experienced, to explain this phenomenon. The concept of samskara is similar to the concept of engrams in modern

According to Bhartrhari the speech and thought are only two aspects of the same speech principle. A sentence is to be considered as a single undivided utterance and its meaning as an instantaneous flash of insight (pratibha). The sentence is the

psychology.

fundamental linguistic fact and words are Bhartrhari lists three different concepts of sphota

that

prevailed

among

different

schools, in Vakyapadiya. The first one is that the Sphota happens first. It leads to the

only its unreal abstractions. The sentence meaning is to be grasped as a unity. The division into words and word-meanings are only useful in the study of language.

generation of dhvani. This dhvani has the ability to propagate by generation of wavelets. The second theory is that both the sphota and dhvani is created at the same CLEAR March 2014

Bhartrhari begins his discussion on sphota theory with the observation that words or sentences can be considered under two

11


aspects, as sound patterns or as meaning

tempo, pich etc. depending on the

bearing symbols. The sound pattern, which is

individual speakers.

the external facet of the language symbol, is

2) Prakrtha

dvani,

the

normal

called the prakrta dvani. It is the abstract

phonological

pattern

sound pattern, with the time sequence still

expression in mind. It is indicated by

attached to it. The second meaning-bearing

vaikrta

aspect of the language symbol is its semantic

linguistic personal variations are

facet. This artha, which is a partless integral

eliminated at this stage. Both the

linguistic symbol, is called the Sphota.

speaker and the listener are conscious

dvani,but

all

or

the

the

non-

of the normal phonological pattern The external aspect is what is relevant in

alone. The time sequence is still

grammatical context. The semantic aspect is

present in this.

of no relevance there. Whereas in cognition

3) The Sphota, the integral linguistic

of speech it is only the semantic aspect that

symbol, which is the unit of meaning.

is perceived not the sound pattern. When we

It cannot be pronounced or written.

say we add an affix, say 'van', to a word

This is revealed by the prakrta dvani.

'maram', we refer to the external facet of the

It is indivisible and timeless. There is

the word not to the actual thing symbolized

no sphota without meaning. It is the

by it. Thus a word has double power, one to

meaning-bearing

indicate itself and the other to indicate the

expression that makes it a sphota.

nature

of

an

thing symbolized by it. It is like the power of a fire to reveal itself and at the same time

V. How Sphota is Comprehended

reveal other things. The Sphota-the word or the sentence located Bhartrhari's analysis envisages three aspects

in the minds (of the speaker and listener) and

of the language situation:

taken as an integral symbol - is revealed by

1) The vaikrta dvani, the actual sound

the sound produced in a fixed order. The

patten pronounced and heard by a

sounds are only the manifesting agencies and

person during speech. It includes all

have no function other than of revealing the

the various differences in intonation,

symbol. Each symbol helps in manifesting this sphota, the first one vaguely and the next

CLEAR March 2014

12


one more clearly and so on, until the last one,

that constitute it, and the parts may be

aided by the impressions of the preceding

considered as irrelevant and illusory. Its

perceptions, reveals it clearly and distinctly.

function is only in differentiating the word from other words. It doesn't carry any

This sphota is one and indivisible; the sounds

meaning.

uttered to reveal this sphota cannot be considered as parts of the essential word or

The cognition that takes place by means of

sphota, but only as diacritical marks to reveal

letters, is a series of errors finally leading to

the identity of the whole word. The process

the truth. Even invalid cognitions can

of revelation of the word by the sound is

sometimes

from the indeterminate stage; it begins from

Bhartrhari explains it by means of some

complete ignorance, passes though partial

examples. When we see a tree from a

knowledge and ends in complete knowledge.

distance we may mistake it for an elephant.

lead

to

valid

knowledge.

But as we get closer to it we will get a The process of comprehension of sphota is

clearer view and would lead to a valid

illustrated by grammarians by means of

cognition. Similar is the case of mistaking a

various analogies. They explain it as similar

rope for a snake. When we hear each syllable

to the way a jeweler examines a diamond.

of the word kamalam, it is just like we are

Bhartrhari gives the example of a student by-

getting closer to the tree and getting a better

hearting a verse by repeated reading. It is the

clearer view. By the perception of the last

last reading aided by the impressions left

letter, we reach at a valid cognition.

behind by the previous readings that helps him to know the verse fully.

VI. Classifications of Sphota

The sphota is the object of cognition. It is

The

revealed with the help of the parts(letters). It

different approaches to Sphota must be made

is not the existence of cognition of parts that

here in order to show the richness and the

is denied; for we do undoubtedly cognize

precision of the topics being discussed

individual letters; it is their significance that

among ancient and medieval grammarians in

is in question. The whole taken as an integral

India. There are eight major approaches to

symbol is something different from the parts

the theory of Sphota.

CLEAR March 2014

general overview of the concepts and

13


the sentence can be understood. But since 1) Varnasphota

the meaning of the sentence is the final

It is defined as denotative, vacaka, when

meaning which is to be understood then

a single phoneme or a stem or affix is

the pada sphota theory is found in

found to be so, and therefore the varna

sufficient in the description of perception

sphota is taking place. This theory

of meaning and leads to the next level of

utilizes the analysis from â€&#x;bottom to top,

synthesis:vakyasphoa.

which is mainly found in grammatical treatises such as Paninis descriptive

3) Vakyasphota

grammar.

it

It maintains that the sentence is a unique

sdifficulties in the immediate application

entity which conveys the meaning. The

to the analysis of the word, especially

sentence in itself is a unit of meaning.

when the synthetic forms of the word are

Vakya sphota however does not claim

examined such as ghaena, with the pot,

that the constituents of the sentence do

for it cannot clearly define them into

not have meaning. The main point of this

separate and meaningful units.

theory is that the word should be always

Varna

sphota

has

seen and understood in a context. The 2) Padasphota

words have their meaning only when

It maintains that the finished word as a

they form a part of sentence.

unique entity conveys the meaning,and the division into the morphological

4) Akhadapadasphoa

components into suffixes, stems etc. does

It maintains that the word is perceived as

not occur when the speaker or the listener

undivided single meaning bearing unit. It

understands the speech. This theory

is not perceived by its parts: suffixes,

claims that the text can be described by

stems etc., but as a single and undivided

listening to the words and their meaning,

meaningful entity.

as well as by perceiving the relation between them in a syntactic structure of

5) Akhadavakyasphoa

the sentence. It is by listening to the

It says that it is insufficient to perceive

meaning of every word and linking it

the separate word, for in ordinary

with another word that the meaning of

communications the sentence as the

CLEAR March 2014

14


whole is perceived as meaningful and not

There was one more distinction important to

a separate word. Bhartrihari thinks that

mention here, which

such division of the sentence into words

different approaches to the understanding of

and stems etc. does not exist in the

Sphota:

ordinary

anvitabhidhanavada

perception

of

speech.

In

the

formulated the two

abhihitanvayavada theories.

and The

common use of speech the meaning is

abhihitanvayavada theory maintains that the

taken as a whole, including the context. It

words and grammatical units have their own

is only when the utterance is made that

meaning and by joining together through

the speaker can dwell on it and analyse it

their syntactic relation build up the meaning

in parts as words, stems etc., but not

of the sentence. The anvitabhidhanavada

when he is speaking. And if he is able to

theory on the contrary affirms that the

grasp the parts of speech, such as

meaning of the word can be understood only

syllables, he will loose the meaning of it

in the context of the sentence.

all. According to this theory the varna

theories

and pada sphota describe language in its

variations and commentaries make a rich

functions, but not in its use.

layout for the linguistic studies of meaning in

of

Sphota

with

All these

many

other

the terms of structural semantics, and 6) Vyaktisphota and Jatisphota

together represent a holistic view in defining

To answer the question whether Sphoa is

all possible approaches to meaning within

particular oruniversal there are two

the grammatical structures (morphology and

theories

syntax).

the

Jatisphoavada.

Vyaktisphoavadaand The

Jatisphoavada

maintains thatnon difference in the varied

VII. Back to Bhartrhari's Philosophy

individual elements is generic, while vyaktisphoavada says that difference is

Bhartrhari has had a deeper dimension to the

associative. For the Jatisphoavada the

theory of sphota than the linguistic aspect on

meaning bearing word is the class(as for

which we analyse it. For him it is the sabda

instance: gotva, cowness) which is

brahman, the ultimate reality. The entire

revealed by the individual instances

universe is the manifestation of this speech-

(vyaktis).

principle.

The

individuals

are

not

meaning bearers. CLEAR March 2014

15


The whole world as it is, has a Meaning

pasyanti, madyama and vaikari. They are

which can be grasped only as an indivisible

respectively the sphota, prakrta dvani and

unity. This meaning is inherent in the

vaikrta dvani of grammarians. A fourth

consciousness of man from his very birth,

stage, called para, has been identified by

with which he later finds its partial

another school of thought, but according to

correspondence

Bhartrhari pasyanti is the supreme Reality

in

his

language

and

reproduces it through articulation, and that is

Sabdabrahman.

Sphota. On semantic level, as it was developed by latter grammarians, Sphota

VIII. Conclusion

makes the text correspond with a universal Text-Totality, sabda-brahman, and therefore

The theory of sphota is one that has been

the text can be easily understood as such.

actively discussed and debated upon since

And once the inner perception (pratibha) of

ancient times to the present day. Some

the hearer flashes out, reflecting something

philosophers and grammarians say that the

from that totality, the Sphota, the revelation

theory is the most complete investigation

of the meaning of the text, takes place in his

into the profundities of language, making a

consciousness.

considerable contribution to the Philosophy

So,

the

Sphota

a

of Language, the Psychology of Speech, and

communication-device based on recognition

especially Semiotics. But the metaphysical

of the truth of existence through a word/text

superstructure built on the basis of this

in the hearer-speaker, (satta). It therefore is

theory has made many modern linguists from

of a psychological nature, as any human

fully appreciating the importance of the

speech is, for the recognition of the meaning

sphota doctrine in language-symbolism.

of the text is perceived by a consciousness

Bhartrhari developed a monistic doctrine of

which lies beyond the analytic capacity of

philosophy that this mystical Speech-essence

the external mind, and carries in itself all

is the first principle of the universe, that is,

meanings;

the entire universe is the manifestation of

and

understanding

can

be

as

such,

requires

a

seen

its

as

proper

psychological

this

Speech-essence

or

Sabda-brahman.

experience.

Setting aside this metaphysical aspect of the

According to Bhartrhai the speech principle

theory, we can concentrate on the linguistic

has three stages in its manifestation, namely

facet. It gives us a new insight about the

CLEAR March 2014

16


semantic level of language. It says that

3. Ravi Sheorey “Bhartrhari‟s Sphota

meaning is not something that can be

Theory:

An

Exploration

inferred, but it is actually being perceived at

semantics” in The Emporia State

the sentence level or expression level. Also

Research

the meaning correspond to a universal text

Publication of the Emporia Kansas

totality. This text-totality is something that is

State College.

Studies,

The

in

Graduate

inherent in the consciousness of man from 4. Harold

his very birth.

George

philosophical

and

Coward

“A

Psychological

This text-totality refers to a semantic

Analysis of the Sphota Theory of

representation that can represent anything

Language as Revelation” in Open

and everything that is cognizable to human

Access Dissertations and Theses

mind. It can be mapped to an expression in

Paper 2943 (1973)

any language (prakrta dvani) once we know how it can be done for one language. This would be the ultimate solution for all worries in

language

computation.

Language

generation,

translation

understanding,

everything would be solved.

References: 1. K. Kunjunni Raja “Indian Theories of Meaning” in Adyar Library and Research Centre, (1963) 2. K

Raghavan

Pillai

Vakyapadiya

Vol

“Studies I,

in

THE

VAKYAPADIYA” Critical Text of Cantos

I

and

II

in

Motilal

Banarsidass,First Edition (1971). CLEAR March 2014

17


Natural Language Processing: A Paninian Perspective Anagha Manoharan,

Reji Rahmath K,

Rekha Raj C.T,

M.Tech Computational Linguistics G.E.C Sreekrishnapuram anaghamanoharan3@gmail.com

M.Tech Computational Linguistics G.E.C Sreekrishnapuram rrahmathrejik@gmail.com

M.Tech Computational Linguistics G.E.C Sreekrishnapuram rekahrajct@gmail.com

A majority of human languages including Indian and other languages have relatively free word order. Most existing computational grammars are based on context free grammars which are basically positional grammars. It is important to develop suitable computational grammar formalism for free word order languages. The Paninian framework is such a framework that has been successfully applied to Indian languages. Paninian grammar uses the notion of karaka relations between verbs and nouns in a sentence. This paper describes Paninian framework applied to Indian languages. Paninian grammar uses the notion of karaka relations between verbs and nouns in a sentence. This paper describes Paninian framework applied to the processing ofmodern Indian languages and developing a parser using Paninian grammar. The paper also describes a machine translation system called Anusaraka developed based on Paninian grammar.

I. Introduction 1.1 Panini and Ashtadhyayi:

gives formal production rules and definitions to describe Sanskrit grammar. He gave a comprehensive and scientific

theory of

Panini [4] was a Sanskrit grammarian who

phonetics,

lived during the fifth or sixth century B.C.

Starting with about 1700 basic elements like

Panini's grammar Ashtadhyayi (The Eight

nouns, verbs, vowels, consonants he put

Chapters) deals with the Sanskrit language;

them into classes.

however, it presents the framework for a

sentences, compound nouns etc. is explained

universal grammar that may (and probably

as ordered rules operating on underlying

does) apply to any language. His book

structures in a manner similar to modern

consists of about under 4000 rules and

theory. In many ways Panini's constructions

aphorisms. In this work Panini distinguishes

are similar to the way that a mathematical

between the language of sacred texts and the

function is defined today.

phonology,

and morphology.

The construction of

usual language of communication. Panini CLEAR March 2014

18


1.2 Paninian Grammar and Free Word

This paper describes Paninian framework

Order Languages

applied to the processing of modern Indian

A majority of human languages including Indian and other languages have relatively free word order. In free word order languages, order of words contains only secondary information such as emphasis etc. Primary information relating to meaning is contained

elsewhere.

Most

languages and developing a parser using Paninian grammar. The paper also describes a

machine

Anusaraka

translation developed

system using

called Paninian

grammar. II. Literature Survey

existing

grammars are based on

Bharati Akshar, Vineet Chaitanya and

context free grammars which are basically

Rajeev Sangal [1] presents a Paninian

positional grammars. It is important to

perspective

develop suitable computational grammar

processing. The unique aspect of the

formalism for free word order languages for

computational grammar described is that it is

two reasons: First, a suitably designed

designed for free word order languages and

formalism is likely to be more efficient.

make special use of vibhakti. It takes the

Second, such a formalism is also likely to be

concept of vibhakti and karaka relations

linguistically more elegant and satisfying.

from Paninian framework, and uses them to

The

a

give an elegant account of Indian languages.

framework that has been successfully applied

The notions of karaka charts, karaka

to Indian languages.

assignment, etc. are discussed. It is argued

computational

Paninian

framework

is

such

towards

natural

language

and shown that a grammar based on these Paninian grammar uses the notion of karaka

notions,

relations between verbs and nouns in a

relations, control, active-passives etc. A

sentence. The notion of karaka relations is

parser developed based on the Paninian

central to the Paninian model. The karaka

framework turns out to be elegant and

relations

efficient.

are

syntactico-semantic

(or

successfully

Anusaraka

handles

is

a

karaka

Machine

semantico-syntactic) relations between the

Translation system developed based on

verbals and other related constituents in a

Paninian framework. It has been shown that

sentence.

it is possible to overcome the language barriers in India using Anusaraka. Paninian

CLEAR March 2014

19


grammar frame-work is compared with

information between speakers (or writers)

existing

and hearers (or readers). The main problem

grammar

modern

western

framework

computational

such

as

Lexical

that the Paninian approach addresses is how

Functional Grammar (LFG), Tree Adjoining

to extract karaka relations from a sentence.

Grammar (TAG), and Government and

As it is inspired by Sanskrit. it emphasizes

Binding (GB).

the roles of case endings and markers such as post-positions(or pre-positions). Positions or

Rick

Briggs

[3]

compares

a

typical

knowledge representation scheme using

word order is brought into consideration only when necessary.

Semantic Nets with the method based on karaka relations used by the ancient Indian Grammarians

to

analyze

sentences

unambiguously. Finally, the clear parallelism between the two is demonstrated.

Subhash C. Kak [4] in his article reviews the Paninian approach to natural language processing (NLP) and compares it with the current systems.

computer-based The

current

understanding

3.1 Paninian Theory Paninian grammar is particularly suited to free word order languages. It makes use of vibhakti

information

semantic

relations,

for and

mapping uses

to

position

information only secondarily. As the Indian languages have relatively free word order, they are eminently suited to be described by Paninian grammar.

knowledge

representation systems of AI agree with the requirements of the Paninian approach and therefore, it is argued that Paninian-style generative rules and meta-rules could assist in further advances in NLP.

3.1.1 Karaka Relations A karaka [2] is a semantic relation between the verb and a noun. Panini isolated six such relations, namely 1. kartr - Agent 2. karman - Patient

III. Paninian grammar The goal of the Paninian approach is to construct a theory of human natural language communication that answers questions like how natural language is used to convey CLEAR March 2014

3. karana - Instrument 4. sampradana - Target 5. apadana - Donor 6. adhikarana – Place Consider the sentence 20


In the kitchen, Rama cooks rice

transformation rule depending on its TAM

for Sita with firewood from the

label.

forest.

IV. Paninian Parser The verb in the sentence denotes an event,

This chapter discusses how a parser can be

viz. a cooking, and the denotation of each of

built using the Paninian framework. It turns

the nouns stands in a semantic relation with

out that the Paninian theory is extremely

that event. Thus, the event has an Agent -

suitable from the computational viewpoint.

Rama, a Patient - rice, an Instrument -

The theory can be used in a natural manner

firewood, a Donor - the forest, a Target -

for structuring a parser which is extremely

Sita, and a Location - the kitchen.

efficient. Figure 1 shows the structure of the

In Paninian framework, a mapping is

Paninian Parser.

specified between karaka relations and vibhakti. This mapping depends on the verb

4.1 Morphological Analyzer

and its tense aspect modality (TAM) label. The

mapping

is

represented

by

two

structures: default karaka charts and karaka chart transformations. The default karaka chart for a verb or a class of verbs gives the map-ping for the TAM label `tA hE' called basic. It specifies the vibhakti permitted for the applicable karaka relations for a verb when the verb has the basic TAM label. This basic TAM label roughly corresponds to present indefinite tense and is purely syntactic in nature. For other TAM labels there are karaka chart transformation rules. Thus, for a given verb with some TAM label, appropriate karaka chart can be obtained using its basic karaka chart and the

Figure 1: Structure of the parser

The morphological analyzer takes as its input a sentence, that is a sequence of words. For

CLEAR March 2014

21


each of its words, it look up a lexicon and retrieves information such as the root of the word, its lexical category, gender, number, person, tense, etc. In case a word has multiple meanings, grammatical information is returned for each of the meanings. Figure 2: A Parse Structure

4.2 Local Word Grouper (LWG)

Given the local word groups in a sentence,

The function of this block is to form the word

groups

on

the

basis

of

the task of the core parser is two-fold:

local

1. To identify karaka relations among

information. These are the word groups at

the word groups,

the vibhakti level (i.e., typically each word group is a noun or verb with its vibhakti, TAM label, etc.). These involve grouping post-positional

markers

with

nouns,

auxiliaries with main verbs etc. Rules for local word grouping are given by finite state machines.

2. To identify senses of words. The first task requires knowledge of karakavibhakti mapping, optionality of karakas, and transformation rules. A data structure called karaka chart stores these information. The second task requires lakshan charts for nouns and verbs.

4.3 Core Parser

V. Machine Translation

The function of the core parser is to accept

A

story written

in

a

language

(say

the local word groups produced by LWG,

Malayalam) is fed into a computer system

and produce the parse structure. Figure 2

and out come its translation in other

shows the parse structure for the sentence

languages. It is inexpensive, immediate and simultaneous. The language barriers melt

laDake boy

ne ergative

paanii

piyaa

away. The richness of other literatures opens

water

drink

up to everyone. Texts in any language is accessible to common man without prior knowledge of that language. These are the goals of Machine Translation (MT). 5.1 Anusaraka or Language Accessor

CLEAR March 2014

22


Anusaraka is a Machine Translation system

the word paradigms to see whether the input

based on Paninian grammar. It is possible to

word can be derived from a root and its

overcome the language barrier in India today

paradigm. If the derivation is possible, it

using anusaraka. Anusaraka tries to take

returns the grammatical features associated

advantage of the relative strengths of the

with the word form (obtained from the root

computer and the human reader, where the

and the paradigm). In case, the input word

computer takes the language load and leaves

cannot be derived, it is possibly a compound

the world knowledge load on the reader [5].

word and is given to the sandhi package to

It

split it into two or more words, which are

is

particularly

effective

when

the

languages are close, as is the case with

then again analyzed by morph.

Indian languages. It bridges the gap between

The output of morph is given as input to the

languages by choosing the most appropriate

local word grouper. Its main task is to group

or nearest construction available in the target

function words with the content words based

language together with suitable additional

on local information such as post-position

notation.

markers that follow a noun, or auxiliary

There are only three major differences in the

verbs following a main verb. This grouping

south Indian languages and Hindi. All these

(or case endings in case of in inflectional

can be bridged by simple additional notation

languages), identifies vibhakti of nouns and

in Hindi. The resulting language can be

verbs. The vibhakti of verbs is also called

viewed as a southern dialect of Hindi.

TAM (tense-aspect-modality) label.

Anusaraka uses this dialect to make source text in southern languages accessible to

The next stage of processing is that of the

Hindi readers.

mapping block. This stage uses a noun

5.1.1 Structure of Anusaraka System

vibhakti dictionary, a TAM dictionary, and a

Structure of the Anusaraka is shown in

bilingual dictionary. For each word group,

Figure 3. A source language sentence is rst

the system finds a suitable root and vibhakti

processed

in the target language. Thus, it generates a

by

morphological

analyzer

(morph). The morph considers a word at a

local word group in the target language.

time, and for each word it checks whether the word is in the dictionary. If found, it returns its grammatical features. It also uses CLEAR March 2014

23


Anusaraka output is usually not the target language, but close to it. Thus, the KannadaHindi anusaraka produces a dialect of Hindi, that does not have agreement etc. It can be called a sort of Dakshini (southern) Hindi. Some additional notation may also be used in the output. Certain amount of training is needed for a user to get used to the anusaraka output language. The role of the anusaraka interface in Figure 4 is to facilitate the reading of output by the reader.

Figure 3: Block Schematic of Anusaraka

The local word groups in the target language are passed on to a local word splitter (LWS) followed by a morphological synthesizer (GEN). LWS splits the local word groups into elements consisting of root and features. Finally, GEN generates the words from a root and the associated grammatical features. Figure 4: Different Interfaces for Anusaraka

5.1.2 User Interface

CLEAR March 2014

24


differences in the source and target language.

5.1.3 Pre-editing and Post-editing Anusaaraka system has been designed so that the combination of man and machine together can perform translations. There are two principal points in this whole process at which the user can help: pre-editing the input and post-editing the output.

red to as vibhakti of nouns. 5.2 Implications If the anusarakas enter into common use, it has

major

implications

for

national

integration. The users of anusaraka will learn the features of the source languages they

Pre-editing: In the pre-editing task, the input

read. Thus, a reader of anusaraka Hindi will

text is corrected and edited by the user:

learn features of the South Indian language if

Words spelt with non-standard spellings are

he uses a Telugu to Hindi anusaraka. Many

changed to their standard spellings, external

new constructions will also enter into the

sandhi between words is broken (unless it

language. Thus, on the one hand while it will

changes meaning), etc. This is an important

encourage people to work in their own

task for Indian Languages because of lack of

languages, and thus strengthen the various

standardization and consequent variation.

Indian languages; on the other hand, it will

Post-editing: The ansuaraka output is close to the target language, and in general is not grammatical

from

the

target

language

viewpoint. Therefore it would normally be post-edited by a person before distribution or

further contribute to the mixing of languages. Government can support this activity; but what is needed is for volunteers to come forward for the task. This should happen for the love of our languages.

publication. In post-editing, the output is

VI.

corrected

Adjoining Grammar

considering

grammatically, the

cultural

stylistically and

social

background of the reader. 5.1.4 Training The reader of the Anusaraka output would need to undergo training. Besides covering the special symbols used in the output, the training would also familiarize him with the CLEAR March 2014

Paninian

Grammar

v/s

Tree

For Indian languages which do not have long distance dependencies, PG should perform better than Tree Adjoining Grammar (TAG) because of the price paid for adjunction. To handle free word order, TAGs have to relax ordering which may lead to a further price in efficiency. PG, on the other hand, utilizes the 25


vibhakti constraints to build an efficient

Processing: A Paninian Perspective,

system. It remains an open question as to

Prentice Hall of India, Delhi, 1995.

how well does PG handle long distance dependencies.

2. Jonardon Ganeri, Artha: Meaning, Oxford University Press 2006.

VII. Conclusion and Future Works Indian languages have relatively free word order. They also have a rich system of case-

3. Rick

Briggs, in

Knowledge

endings and post-positions. The unique

Representation

Sanskrit

and

aspect of Paninian grammar is that it is

Artificial Intelligence, AI Magazine,

designed for free word order languages and

1985; Volume 6 Number 1: 32-39.

make use of vibhakti. It takes the concept of vibhakti and karaka relations. The Paninian

4.

Subhash C. Kak, The Paninian

parser turns out to be elegant as well as

Approach

to

Natural

Language

efficient. It is able to handle diverse

Processing, International Journal of

phenomena like karaka assignment, active-

Approximate Reasoning, 1987; 1:

passives, and control in a unified manner.

117-130.

Anusaraka is a Machine Translation system developed based on Paninian frame-work. It

5.

Kulkarni, Amba P., Design and

is possible to overcome the language barriers

Architecture

in India using Anusaraka. Anusaraka tries to

Approach to Machine Translation,

take advantage of the relative strength of the

Satyam Technical Review vol 3, Oct

computer and the human reader, where

2003, pp 57-64.

of

anusAraka:

An

computer takes the language load and leaves the world knowledge role on the reader. It is particularly effective when the languages are similar as in the case of Indian languages. References 1. Bharati Akshar, Vineet Chaitanya, jeev

Sangal,

CLEAR March 2014

Natural

Language 26


Anaphora Resolution - An Overview Athira S

Lekshmi T S

M.Tech Computational Linguistics GEC Sreekrishnapuram

M.Tech Computational Linguistics GEC Sreekrishnapuram

Anaphora is the phenomenon of a linguistic expression which acts as a substitute or reference to some other linguistic form, which generally precedes it. Anaphora Resolution (AR) is a process of automatically finding the pairs of pronouns or noun phrases in a text that refer to same incidence, thing, person, etc. called referent syntax, semantics and pragmatics of a

I. Introduction Anaphora is the phenomenon of a linguistic

language.

expression which acts as a substitute or

AR system can be helpful in the following

reference to some other linguistic form,

areas:

which generally precedes it. Resolution

(AR)

is

a

Anaphora process

of



Machine Translation. In machine

automatically finding the pairs of pronouns

translating,

or noun phrases in a text that refer to same

resolved for languages that mark the

incidence, thing, person, etc. called referent.

gender of pronouns.

The first member of the pair is called

drawback with most current machine

antecedent and the next member is called

translation

anaphora. Antecedents and anaphoras can

translation usually does not

co-occur in the same sentence or can be in

beyond sentence level, and so does

different sentences. Anaphora Resolution can

not deal with discourse understanding

be done in two steps. First, the pronouns or

successfully.

noun phrases that may be used for indicating

anaphora resolution would thus be a

a referent are identified. These are called

great assistance to the development

markables. Then, the pairs of such markables

of machine translation systems.

that have the antecedent − anaphora relation are

identified.

AR

involves

a

good

understanding of the interaction between the CLEAR March 2014



Text

anaphora

systems

be

One main

is

that

the go

Inter-sentential

Summarization.

automatic

must

text

Many

of

summarization

systems apply a scoring mechanism 27


to identify the most salient sentences.

a reduction of the success rate of the

However, the task result is not always

anaphora

guaranteed to be coherent with

Resolution is a challenging task particularly

eachother. It could lead to errors if

for resource poor languages like Indian

the

Languages.

selected

sentence

contains

resolution

system.

Anaphora

anaphoric expressions. To improve the accuracy of extracting important

II. Properties of Anaphora

sentences, it is essential to solve the

A

problem of anaphoric references in

properties of anaphora are the following. 

advance.



number

of

proposed

characteristic

Dependency of Interpretation. An expression is anaphoric only if it

Question-Answering

system.

QA

depends for its interpretation on a

system can be improved by resolving anophara in queries and retrieved documents.

vitally depends on the efficiency of the preprocessing tools which analyse the input before feeding it to the resolution algorithm. Inaccurate pre-processing could lead to a considerable drop in the performance of the system, however accurate an anaphora resolution algorithm may be. In the preprocessing stage a number of hard preprocessing problems such as morphological analysis / POS tagging, named entity recognition, unknown word recognition, NP parsing,



Type of Antecedent. The item on which the anaphor depends for its

A real-world anaphora resolution system

extraction,

contextually given item.

identification

of

pleonastic pronouns, selectional constraints, etc. have to be dealt with. Each one of these

interpretation is called the antecedent. 1. Linguistics expressions,

e.g.

full noun phrases; 2. Representations of objects. These representations can be thought of as

theoretical

models

of

constructs

or

the

mental

representations

which

hearers

employ

interpreting

when

a

discourse. 3. Objects in the real world, in particular, that part of the world in

which

the

communication

takes place, i.e., the utterance situation.

tasks introduces error and thus contributes to CLEAR March 2014

28


 Location of Antecedent. It is, however,

interest, for alternative and/or data-enriched

possible to have the anaphor precede the

approaches. Last, but not least, application-

antecedent. In such situations, it is called

driven research in areas such as automatic

cataphora.

abstracting and information extraction, has

 Type of Relation. Coreference does not

independently identified the importance of

seem to be a necessary property of

anaphora and coreference resolution.

anaphora.

There

exists

nonreferring 

expressions: these are normally not used to single out

one or

more real

world

Rule-Based Approaches

Rule based approaches integrate knowledge

individuals. Therefore they can impossibly

sources/factors

be coreferential with some other expression.

candidates until a minimal set of plausible

Two expressions are coreferential if their

candidates is obtained. Here constraints work

real

identical.

as a filter to eliminate unlikely candidates

Alternatively, we can adopt a notion of

within a set of defined rules. Thereafter,

coreference which applies to individuals on

preference- based factors are applied

the level of mental representations.

.

world

referents

are

 Constraints on the Relation. In general,

that

discount

unlikely

 Corpus Based Approaches

an indefinite NP establishes a permanent

Corpus oriented approaches makes use of the

discourse referent just in case the quantifier

available corpus. These corpuses have been

associated with it is attached to a sentence

created specifically for the discourse task.

that is asserted, implied, or presupposed to be true, and there are no higher quantifiers involved.

 Knowledge Poor Approaches Knowledge poor approaches are inexpensive, fast and reliable. It helps in the emergence of

III.

Existing

Anaphora

Resolution

cheaper and more reliable corpus-based NLP

Approaches

tools. This approach is suitable for vast

Discourse-orientated theories and formalisms

category of languages as they do not rely on

such as DRT and Centering have inspired

syntax and semantic of the language. It does

new research on the computational treatment

not rely on linguistic or domain knowledge

of anaphora. The drive towards corpus-based

and work even without parsing.

robust NLP solutions has further stimulated CLEAR March 2014

29


ďƒ˜ Discourse Based Approaches

In this paper, a data driven approach

Discourse is modeled through a sequence of

for Anaphora resolution of three Indian

utterances. A single entity is centered at any

languages: Bengali, Hindi, and Tamil is

given point in the discourse and it has to be

discussed. The work consists of two steps:

distinguished from all other entities that have

identifying markables and links. Markable

been evoked. In order to resolve anaphora,

identification is done using Conditional

the world knowledge and inferencing are

Random Field. The identifications of links

also employed in this approach.

between markables is done using Decision Tree Algorithm. Weka is used to implement

IV. Works in Indian Languages

4.1 In

Sanskrit

Texts[Sobha

decision tree algorithm. L

and

Pravin,2010]

and Sobha,2012]

In this work, the computational grammar implemented uses very familiar concepts such as clause, subject, object etc., which are identified with the help of morphological information and concepts such as precede and follow. It makes limited use of grammatical

rules

4.3 Using CRF in Tamil[Akilandeswari

and

uses

Conditional Random Fields are used in this work since it contain a no: of feature functions. Salience features and other features are trained into the system

for resolving

anaphora.

only

morphological markings to identify subject, object, clause etc. It uses limited parsing: the information required from the parser is limited to parts of speech tagging, clause identification, subject of the clauses and person-number-gender of the noun phrases.

4.4GuiTAR

Based

Resolution

in

Bengali[Senapati and Utpal,2013] This AR system used an off-the-shelf system.

Preprocessing

and

anaphora

resolution system are the two modules in the architecture of this system. VASISTH[Sobha L and Patnaik,1999]

4.2 RandomTreeAlgorithm[Chatterji,Dha s,Barik,Sartar and Basu,2011] from

two

different

presently handles two different languages families:

from Indo-Aryan. It can easily be extended

Malayalam, from, Indo-Dravidian and Hindi

to handle other Indian languages, more

CLEAR March 2014

language

VASISTH is a multilingual system, which

30


generally,

other

morphologically

rich

languages.

[5] S Chatterji, A Dhar, B Barik,S Sartar,B Basu, Anaphora Resolution for Bengali, Hindi Tamil Using Rendom Tree Algorithm in Weka, Proceedings of the ICON, 2011

References

[1] Shalom Lappin, Herbert J Leass, An Algorithm

for

Resolution,

Pronominal

Computational

Anaphora

[6] Akilandeswari A, Sobha Lalitha Devi,

Linguistics

Resolution for Pronouns in Tamil Using

Journal, J94-4002, Pgno:535-561, 1994 [2] Sobha L, Patnaik B N, VASISTH- An Anaphora Resolution System, Unpublished Doctral Dissertion, MG University, 1999 [3] Daniel Jurafsky, James H Martin, Speech and Language Processing, 2000

CRF, Proceedings of the workshop on machine translation and parsing in Indian languages, Pg no: 103-112, 2012 [7] Apurbalab Senapati, Utpal Garain, GuiTAR-based

Pronominal

Anaphora

Resolution in Bengali, Proceedings of the 51st annual Meeting of the Association for

[4] Sobha L, Pravin Pralayankar, Algorithm

Computational Linguistics, Pg no: 126-136,

for Anaphora Resolution in Sanskrit Texts,

2013

Proceedings Computational

of

the

2nd

Linguistics

Sanskrit

Symposium,

Brown University, USA, 2008

CLEAR March 2014

31


Indian Language Computing Platforms Sreejith C M.Tech Computational Linguistics

Hi all ď Š, this edition of CLEAR focuses on

barrier; creating and accessing multilingual

various

language

knowledge resources; and integrating them

computing. So in this article I would like to

to develop innovative user products and

introduce some tools and standards available

services.

aspects

of

Indian

for various Indian Language computing.

Department

of

Information

Technology

India is a multilingual country with as many

launched another major initiative called

as 22 scheduled languages and computer

National Rollout Plan to aggregate these

technology breaks the language barrier and

software tools and to make these available

bridges the gap between the various sections

through a web based Indian Language Data

of the society through easier access to

Centre (ILDC - http://www.ildc.gov.in/).

information using their respective languages

This activity is being executed in close

and hence language computing becomes

coordination with CDAC, GIST, Pune.

central to the exchange of information across

Under this user friendly software tools and

speakers of various languages. A lot of

fonts are being made available free for public

Indian language Computing projects are

through language CDs and web downloads

going on. They involve some government

for the benefit of masses.

sector companies, some volunteer groups and individual people The

Department

The availability of these software tools, fonts and resources in local languages at no cost is

of

Electronics

and

intended to motivate general public to use

Information Technology, India initiated the

ICT tools and technology in their day to day

TDIL

(Technology

work like Word Processing, Presentation

Development for Indian Languages) with the

preparation, Spread Sheets preparation, Web

objective

Information

Page Surfing & Designing, Messaging etc. in

Processing Tools and Techniques to facilitate

local languages. Further, the consolidated

human-machine interaction without language

availability of linguistic resources and tools

(http://tdil.mit.gov.in)

of

developing

CLEAR March 2014

32


at one place will help researchers to carry out

GIST Labs have carved their expertise with

their research in a smooth and efficient

technologies as varied as Natural Language

manner.

Processing

Project Bhasha is a key milestone in Microsoft's

(http://www.bhashaindia.com/)

effort to stimulate local language computing and take IT to the masses, driven by the fact that 95 percent of Indians use their local language rather than English in their work and personal life. Being a comprehensive program, it aims to localize (provide local interfaces) to Microsoft's flagship products, Windows and Office.

(NLP),

Video,

Embedded

Systems, Word-processing to name only a few.

This

tradition

of

cutting-edge

technologies is continually upheld at the GIST Labs where new tools compatible with the needs and requirements of today's fast developing digital world are being developed (http://pune.cdac.in/html/gist/research-areas/ nlp.aspx). C-DAC GIST has participated in various

standardization

activities

for

language technology. GIST is also involved in standardization of heritage scripts of India

Bhasha is a cohesive program for bringing

such as Vedic, Grantha, Samavedic, etc.

together the governments, the academia and

Heritage scripts based tools are used for

the research institutions, the local ISVs and

digitising and preserving data for making

developers and the industry associations on a

web-pages or data mining applications,

common

especially by researchers. Contributions of

ground

for

promoting

local

language computing. Microsoft has localized Windows and Office (provided localized User Interface as well as User Assistance) in

GIST: 

Malayalam, Marathi, Oriya, Punjabi, Tamil and Telugu. Microsoft also provides Indian languages' support throughout the platform and across the range of products.

for

language

12 Indian languages that include Assamese, Bengali, Gujarati, Hindi, Kannada, Konkani,

Need

Standards

for

computingW3C

Indian (Indian

Languages on Web) 

Internationalized

Domain

names

(IDN) 

E-Governance in Indian languages

Linguistics formats

Storage standards

C-DAC GIST has always been at the

Input standards for Indian languages

forefront of the development of new tools

Indian language Display Fonts

and technologies. A leader in the area, the CLEAR March 2014

33


Swathanthra Malayalam Computing (SMC)

Madras. These experiences relate to the

is a free software collective engaged in

project undertaken at the laboratory where

development, localization, standardization

Software tools for developing interactive

and popularization of various Free and Open

computer applications in different regional

Source Softwares in Malayalam language.

languages of India have been set up.

SMC (http://smc.org.in/) has been active

Center

since October 2002 and has been working to

Technology(CFILT) was set up with a

provide Malayalam language tools that work

generous grant from the Department of

on all layers of computing including and not

Information Technology (DIT), Ministry of

limited to rendering fixes, fonts, input

Communication

mechanisms, translations (localization), text-

Technology, Government of India in 2000 at

to-speech

spell

the Department of Computer Science and

checkers and other indic script based

Engineering, IIT Bombay. Prior to this the

language computing specific tools across

Natural Language Processing (NLP) activity

operating systems. We are the upstream for

of the CSE Department, IIT Bombay took

Malayalam fonts and tools for popular

off in 1996 with a grant from the United

GNU/Linux based operating systems such as

Nations University, Tokyo to create a

Fedora and Debian. We also maintain

multilingual information exchange system

localizations for popular Free Software

for the web. The project called Universal

Desktops

Networking

engines,

dictionaries,

(GNOME/KDE),

popular

for

Indian

Language

and

Information

Language

(UNL;

applications such as Firefox and Libre

www.undl.org) was participated in by 15

Office.

research

The

Acharya

Web

Site

groups

across

continents.(

http://www.cfilt.iitb.ac.in)

(http://www.acharya.gen.in:8080/about.html)

Today

disseminates

to

reaching out for help from varied sources;

computing with Indian languages. The

we are heading towards a boom in Indian

information presented at this site reflects the

language computing. I hope that SIMPLE

experiences

groups

information

gained

at

relating

the

Systems

Indian-language

will

also

computing

contribute

for

is

the

Development Laboratory in the Department

development of various Indian language

of Computer Science and Engineering at IIT

computing aspects in the coming future.

CLEAR March 2014

34


M.Tech Computational Linguistics Dept. of Computer Science and Engg, Govt. Engg. College, Sreekrishnapuram Palakkad www.simplegroups.in simplequest.in@gmail.com

SIMPLE Groups Students Innovations in Morphology Phonology and Language Engineering

Article Invitation for CLEAR- June-2014 We are inviting thought-provoking articles, interesting dialogues and healthy debates on multifaceted aspects of Computational Linguistics, for the forthcoming issue of CLEAR (Computational Linguistics in Engineering And Research) magazine, publishing on June 2014. The suggested areas of discussion are:

The articles may be sent to the Editor on or before 10th June, 2014 through the email simplequest.in@gmail.com. For more details visit: www.simplegroups.in

Editor,

Representative,

CLEAR Magazine

SIMPLE Groups

CLEAR March 2014

35


Hello World, With this issue of CLEAR magazine, we are bringing an edition on INDIAN LANGUAGE COMPUTING. As the term conveys, it involves developing software in Indian languages i.e., Localization of computer applications, web development etc. in Indian languages. There are several organisations and volunteer groups who have come forth with Indian Language Computing Projects. Few of them have already been discussed by authors in their articles in this edition. Microsoft is one among them, which works in close collaboration with state government bodies and nodal IT agencies to create glossaries which are used to create local language interface. Their key milestone Project Bhasha aims to localize to Microsoft's flagship products, Windows and Office. CDAC, IIITM-K, SMC etc. are some organisations working on several Language Computing projects in Malayalam. This reveals a lot of opportunities in Language Computing area.

Reshma O.K.

CLEAR March 2014

36


CLEAR March 2014

37


CLEAR March 2014

38


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.