CLEAR March 2014
1
CLEAR March 2014
2
C
CLEAR September 2013 Volume-3 Issue-1 CLEAR Magazine (Computational Linguistics in Engineering And Research) M. Tech Computational Linguistics Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad 678633 www.simplegroups.in simplequest.in@gmail.com Chief Editor Dr. P. C. Reghu Raj Professor and Head Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad Editors Reshma O K Sreejith C Gopalakrishnan K Neethu Jhonson
Editorial
4
SIMPLE News & Updates
5
CLEAR June 2014 Invitation
35
Last word
36
Indian Language Computing: From Dream to Reality Mr. Manu Madhavan
6
The Sphota Theory of Bhartrhari Ms. Kavitha Raju, Mr. Manu V.Nair
8
Natural Language Processing: A Paninian Perspective Ms. Anagha Manoharan, Ms. Reji Rahmath, Ms. Reka Raj C.T. 18
Anaphora Resolution - An Overview Ms. Athira S, Ms. Lekshmi T.S.
27
Indian Language Computing Platforms Mr. Sreejith C
32
Cover Page and Layout Sreejith C
CLEAR March 2014
3
Greetings! India has a well-acclaimed history of excellence in mathematical and philosophical domains of knowledge. Scientific theories related to syntax, semantics and other aspects of language flourished in ancient India. Long before ACSII and EBCDIC were invented, Indians used the 'katapayaadi' notation to denote numbers in scientific literature and even in poetry. Messages and text were compressed to very short codes, which were used to specify scientific principles and rules of the grammar. Shiksha and Niruktham were two important branches of study in the Indian system of education. This edition of CLEAR is a humble tribute to those great philosophers and scientists who gave a sound logical framework to the science of language. The articles deal with the Sphota theory and Paninian Grammatical framework among others. It is therefore aptly tagged as a Special Edition on Indian Language Computing. Please send in your valuable feedback on the content so that we keep on improving in the times to come. Warm Regards Dr. P. C. Reghu Raj (Chief Editor)
CLEAR March 2014
4
NEWS & UPDATES
Workshop on Soft Computing The three day workshop on Soft computing, was held at GEC Sreekrishnapuram from 9-11 th January 2014. Eminent personalities from IIST, IIT Delhi and M.S.U Baroda gave sessions on topics Genetic algorithms, neural networks, Fuzzy logic and Data-mining. The workshop was inaugurated by Dr.P.C.Reghu Raj, Principal of GEC Sreekrishnapuram. First session was based on Fuzzy logic by Dr.Deepak Mishra, IIST Trivandrum and is followed by another wonderful session on Google Page Ranking algorithm by Dr.Sumitra.S.Nair, IIST Trivandrum. First day of workshop came to an end with a session on Artificial Neural Networks by Prof.Raju.K.George , IIST Trivandrum followed by a lab session. On Second day there was a great session on Genetic algorithms by Prof.V.D Pathak, M.S.U Baroda. And then there was a session on „Theory of Kernal Methods‟ by Dr.Sumitra.S.Nair , IIST Trivandrum. Second day ended with a lab session on ANNs.Third day session mainly focused on Data Mining and its applications. The session was by Prof.B.Chandra, IIT Delhi. It was an interesting session .The three day workshop came to an end with the distribution of certificates to all the participants by Prof.B.Chandra and Prof.V.D Pathak.
Publications Sreejith C, Nibeesh K, and P. C. Reghu Raj, " CHODYOTHARI : Question Answering System for Malayalam ", in proceedings of 4th NATCON, Kerala, 2014. Nibeesh K, Sreejith C, P. C. Reghu Raj, " Text Classification for Effective News Filtering Using Support Vector Machine", in proceedings of 4th NATCON, Kerala, 2014. Christopher Augustine, P. C. Reghu Raj, “Agglomerative Sentence Clustering Approach for Discourse Segmentation", International Journal of Engineering Research & Technology (IJERT), Vol. 2 Issue 12, December - 2013. SIMPLE Groups Congratulates all of them for their achievement s!!!
CLEAR March 2014
5
Indian Language Computing: From Dream to Reality Manu Madhavan Asst. Professor, SIMAT
scheduled Nowadays, Language computing and related technologies
brought
some
notable
contributions in researches. But emerging it as an application for the common man is still a dream. Of Course, there are many language enthusiasts, who are correlating the gap between
applications
and
research
in
languages
and
computer
technology breaks the language barrier and bridges the gap between the various sections of the society through easier access to information using their respective languages and hence language computing becomes central to the exchange of information across speakers of various language.
language oriented computation. Especially in Indian scenario, this field is still in its
However innovative the product is, its reach
childhood. Indian engineers and scientists
depends on how it inspire the common man.
are a dominant force in the IT world, but
The main barrier between technology and the
have also faced criticism for being grossly
common man is language. All the main
negligent of the needs of the common man
information from the technology side is
from their own region. "This has pushed
English centric. The great innovators may
India to the top of the list of countries
some time biased by some wrong assumption
suffering from the Digital Divide," argue
that most of their requirement can be
campaigners
achieved by English only interface. Real
article
like
positively
challenges
and
Venkataramanan.This review
applications
the
scope,
world is Multilingual and majority of the
of
Indian
world does not know English as well.
language computing.
Quoting
the
words
of
Sri
Santhosh
Thottingal, popularly known as Malayalamâ€&#x;s The first point to be checked is the reach of Indian Language Computing. India is a multilingual country with as many as 22 CLEAR March 2014
Wiki Warrior “ now a days it is like if you are writing a program that can work only with English, it rarely meet the requirements. 6
Such assumption that the input will be only
Thus we cannot have a generic tool,
English or we will have to process only
especially for translation, and all tools have
English
multi-language
to be developed for all of the languages, adds
computing is becoming part of software
Shukla. In a C-DAC report he mentions that
engineering, not as a different field� .
although all Indian languages have emerged
is
wrong.
So
Researchers working on Indian language computing soon realized that the tools present in the global market cannot be replicated in India owing to the complexity of multiple languages that exist in the country. (India comprises not only 22 major languages with as many dialects as 1,652,
out of Sanskrit, the core ancient language, and mostly all of them follow Paninian grammar, but that itself is a problem as different languages depend on Sanskrit and Panini
in
different
manner.
Therefore
accuracy for any of these systems is not 100 percent.
but there are also 11 scripts to represent
Most of the language computing works in
these languages.) Swarn Lata, head of the
India are concentrating on translation. A
TDIL program and director of Human
direct translation between Indian languages
Centred Computing group in the DIT,
is not simple, but can be made easy by
explains that “in Indian languages one-to-one
considering their etymology and historical
mapping or translation of each word as it is
relations with Sanskrit. Other areas like OCR,
to form a sentence is not workable. The
speech;
methodology to be followed here is to first
conceptualization are yet not be fully
process the source language, convert words
developed and have large scope.
according to the target language, and then process it all again with respect to the target language for the conversion to make sense.
sentiment
analysis
and
Even though the difficulties exist, the enthusiasm shown by the young techies towards language computing gives a positive
Apart from the typical nature of Indian
future. The works done by Open source
languages, cultures also affect our language
active groups like Swathanthra Malayalam
usage and pronunciation. For example, in
Computing (SMC), institutions like IIITH,
northern parts of India, Hindi is spoken in
IIITM-k, and C-DAC are highly appreciable
varied forms across different states and cities.
and are a good guideline for new comers.
CLEAR March 2014
7
The Sphota Theory of Bhartrhari Kavitha Raju
Manu V Nair
Dept. of Computer Science and Engineering Govt. Engg. College Sreekrishnapuram Palakkad, India -678 633 kavitharaju18@gmail.com
Dept. of Computer Science and Engineering Govt. Engg. College Sreekrishnapuram Palakkad, India -678 633 manunair1990@gmail.com
.
Sphota theory is one of the major contribution of Indian philosophers(mainly Bhartrhari) to the Linguistics. It deals with mainly the semantic aspects of the language phenomenon. Sphota theory originates with Bhartrhari, but the term has a usage from the literatures of Vyasa. In sanskrit, sphota is etymologically derived from the root ’sphut’ which means ’to burst’. Sounds have spatial and temporal relations; they are produced differently by different speakers. But the word as meaning bearer has to be regarded as having no size or temporal dimension. It is indivisible and eternal. Distinguished from the sphota are the abstract sound pattern (prakritadhvani) and the utterances (vikritadhvani). Bhartrihari held that the sentence is not a collection of words or an ordered series of them. A word is rather an abstraction from a sentence; thus, the sentence-sphota is the primary unit of meaning. A word is also grasped as a unity by an instantaneous flash of insight (pratibha).
I. Introduction ancient India derives its impetus from the Sphota is an important concept in the Indian grammatical tradition of Vyakarana, relating to the problem of speech production, how the mind orders linguistic units into coherent discourse and meaning. Its a milestone to the linguistics in india. Across cultures, the early history of linguistics is associated with a need to disambiguate discourse, especially for ritual texts or in arguments. Linguistics in CLEAR March 2014
need to correctly recite and interpret the Vedic texts. Already in the oldest Indian text, the Rigveda, vak (”speech”) is deified. By 1200 BCE, the oral performance of these texts becomes standardized, and treatises on ritual recitation suggest splitting up the Sanskrit compounds into words, stems, and phonetic units, providing an impetus for morphology and phonetics. 8
In India, Hindu philosophy is traditionally
Katyayana, Kumarila bhatta, Mandana misra,
divided into six astika schools of thought, or
Panini,
darsanam , which accept the Vedas as
Sakatayana, Vacaspati Misra.
Patanjali,
Pingala,
Prabhakara,
supreme revealed scriptures.These darsanas deals with almost all the things that human
II. The meaning of ‘MEANING’
want.Thus it includes linguistics also. The
The most common Sanskrit term for
six systems of Indian philosophy are:
meaning is artha. In the Western literature on
Nyaya, Vasiseshika, Sankhya, Yoga, Purva
the notion of meaning in the Indian tradition,
Mimamsa, Uttara Mimamsa(vedanta).
various terms, such as ‟sense‟, ‟reference‟, ‟denotation‟,
‟connotation‟,
‟designatum‟
Each of these systems differs in one way or
and ‟intension‟, have been used to render the
the
concepts,
Sanskrit. Artha basically refers to the object
phenomena, laws and dogmas. Each system
signified by a word. In numerous contexts, it
has it‟s own founder as well. It is important
stands for an object in the sense of an
to know that the founders of these systems of
element of external reality.
philosophy are sages of the highest order that
According to Indian thought there are two
have devoted their lives for the study and
main approaches to the study of the problem
propagation of philosophy. Each system of
of meaning.
Indian philosophy is called a Darshana. Thus
1) Khandapaksa
the Sanskrit word Shad-Darshana refers to
2) Akhandapaksa
other
in
terms
of
its
the six systems of philosophy. In these six schools of thoughts, Nyaya, Vaisheshika,
According
to
khandapaksa
word
is
Purva Mimamsaka are almost directly related
considered as an autonomous unit of thought
to linguistics. Naiyayika is the name given to
and sense and the sentence is taken as the
followers of the Nyya School. Mimamsakas
concatenation of words. In khandapaksa, the
is the name given to followers of the Jaimini.
conditions for syntactic relation between words in a sentence are mutual expectancy
There
are
many
philosophers
and
(akanksa),
consistency
(yogyata)
and
grammarians in India, who contributed many
proximity (samnidhi) and also recognized the
ideas to the world of linguistics. Some of
importance of contextual factors and the
them
are
Vyasa,
CLEAR March 2014
Yaska,
Bhartrhari, 9
intention of the speaker in determining the
indivisible meaning bearing units. The word
meaning of words.
or sentence, thus considered as a single meaning bearing unit, is called the ‟sphota‟.
According to akhandapaksa, fundamental
The articulated sounds used in linguistic
linguistic fact is the sentence. Bhartrhari,
discourse are merely the means by which the
who accept this thought define the sentence
symbol is revealed; it is the symbol which is
as „a single integral symbol‟ (eko navayavah
the meaning bearer.
sabdah). Sentence is revealed by individual letters and words. Meaning conveyed by the
Though this theory was fully developed and
sentence is an „instantaneous flash of insight
systematized by Bhartrhari, in his work
or intuition‟ (pratibha). Words are only hints
Vakyapadiya, some of the ideas underlying
that help the listener to arrive at the meaning.
this
theory
can
be
found
in
earlier
grammatical works. Also there are works According
to
theory
of
vyanjana
by
Anandavardhana, artha or meaning means
that mention the term Sphota in different senses.
not only cognitive and logical meaning but also emotive elements and the social-cultural significance
of
utterances
which
are
suggested with the help of contextual factors.
It is believed that it was the ancient grammarian Vyadi, who was earlier to Patanjali and Katyayana who might have started the discussion on Sphota theory in his
III. The History of Sphota
lost work, the Samgraha. The school of thought started by Audumbarayana may be
The theory of Sphota is one of the ancient Indian linguistic concepts. It was mainly formulated by the philosopher Bhartrhari of
considered as the forerunner of the Sphota theory of Bhartrhari. According to Yaska, Audumbarayana held the view that only the
5th century CE. The word Sphota means ‟to
sentence is found in the minds of the listener
burst‟ in Sanskrit.
and the speaker. Pannini seems not to know of this theory.
According to this theory, a sentence should not be considered as being made up of words and words in turn by letters. They are single CLEAR March 2014
But he mentions sage Sphotayana as an authority. commentry
Katyayana for
who
Panninisutras
wrote
the
had
the
10
distinction of permanent varnas and vrtti. But
time. It is explained as analogous to a flame
he never used the term Sphota.
and its light. There is also a third philosophy
The Mimamsakas say that varnas are
that defines the Sphota as a class and the
permanent and it is in varnas that the
dhvanis as its members.
meaning is carried. Patanjali is not sure as to whether it is in varnas or words the meaning
The modern linguist de Saussure says, there
is contained. He is of the opinion that the
is a significant and a signifie. His concepts
articulated sound is ofcourse ephemeral and
match with the dhvani and Sphota concepts
doesn't
somehow
of Bhartrhari. The present day psychologists
intelligent humans graps them together in
accepts the existence of an expression plane
mind and it reveals the Sphota. The
and a meaning plane in the mind. This is also
Nyayaikas say varnas do not have meaning
in support of the Sphota theory.
exist
together.
But
and are ephemeral. According to them, the last sound heard aided by the memory of the preceding sounds presents the meaning of
IV. The Concept of Sphota; As Stated by Bhartrhari
the word to the mind. They use the term samskara, the traces left on the mind by experience that can produce a recollection of what was experienced, to explain this phenomenon. The concept of samskara is similar to the concept of engrams in modern
According to Bhartrhari the speech and thought are only two aspects of the same speech principle. A sentence is to be considered as a single undivided utterance and its meaning as an instantaneous flash of insight (pratibha). The sentence is the
psychology.
fundamental linguistic fact and words are Bhartrhari lists three different concepts of sphota
that
prevailed
among
different
schools, in Vakyapadiya. The first one is that the Sphota happens first. It leads to the
only its unreal abstractions. The sentence meaning is to be grasped as a unity. The division into words and word-meanings are only useful in the study of language.
generation of dhvani. This dhvani has the ability to propagate by generation of wavelets. The second theory is that both the sphota and dhvani is created at the same CLEAR March 2014
Bhartrhari begins his discussion on sphota theory with the observation that words or sentences can be considered under two
11
aspects, as sound patterns or as meaning
tempo, pich etc. depending on the
bearing symbols. The sound pattern, which is
individual speakers.
the external facet of the language symbol, is
2) Prakrtha
dvani,
the
normal
called the prakrta dvani. It is the abstract
phonological
pattern
sound pattern, with the time sequence still
expression in mind. It is indicated by
attached to it. The second meaning-bearing
vaikrta
aspect of the language symbol is its semantic
linguistic personal variations are
facet. This artha, which is a partless integral
eliminated at this stage. Both the
linguistic symbol, is called the Sphota.
speaker and the listener are conscious
dvani,but
all
or
the
the
non-
of the normal phonological pattern The external aspect is what is relevant in
alone. The time sequence is still
grammatical context. The semantic aspect is
present in this.
of no relevance there. Whereas in cognition
3) The Sphota, the integral linguistic
of speech it is only the semantic aspect that
symbol, which is the unit of meaning.
is perceived not the sound pattern. When we
It cannot be pronounced or written.
say we add an affix, say 'van', to a word
This is revealed by the prakrta dvani.
'maram', we refer to the external facet of the
It is indivisible and timeless. There is
the word not to the actual thing symbolized
no sphota without meaning. It is the
by it. Thus a word has double power, one to
meaning-bearing
indicate itself and the other to indicate the
expression that makes it a sphota.
nature
of
an
thing symbolized by it. It is like the power of a fire to reveal itself and at the same time
V. How Sphota is Comprehended
reveal other things. The Sphota-the word or the sentence located Bhartrhari's analysis envisages three aspects
in the minds (of the speaker and listener) and
of the language situation:
taken as an integral symbol - is revealed by
1) The vaikrta dvani, the actual sound
the sound produced in a fixed order. The
patten pronounced and heard by a
sounds are only the manifesting agencies and
person during speech. It includes all
have no function other than of revealing the
the various differences in intonation,
symbol. Each symbol helps in manifesting this sphota, the first one vaguely and the next
CLEAR March 2014
12
one more clearly and so on, until the last one,
that constitute it, and the parts may be
aided by the impressions of the preceding
considered as irrelevant and illusory. Its
perceptions, reveals it clearly and distinctly.
function is only in differentiating the word from other words. It doesn't carry any
This sphota is one and indivisible; the sounds
meaning.
uttered to reveal this sphota cannot be considered as parts of the essential word or
The cognition that takes place by means of
sphota, but only as diacritical marks to reveal
letters, is a series of errors finally leading to
the identity of the whole word. The process
the truth. Even invalid cognitions can
of revelation of the word by the sound is
sometimes
from the indeterminate stage; it begins from
Bhartrhari explains it by means of some
complete ignorance, passes though partial
examples. When we see a tree from a
knowledge and ends in complete knowledge.
distance we may mistake it for an elephant.
lead
to
valid
knowledge.
But as we get closer to it we will get a The process of comprehension of sphota is
clearer view and would lead to a valid
illustrated by grammarians by means of
cognition. Similar is the case of mistaking a
various analogies. They explain it as similar
rope for a snake. When we hear each syllable
to the way a jeweler examines a diamond.
of the word kamalam, it is just like we are
Bhartrhari gives the example of a student by-
getting closer to the tree and getting a better
hearting a verse by repeated reading. It is the
clearer view. By the perception of the last
last reading aided by the impressions left
letter, we reach at a valid cognition.
behind by the previous readings that helps him to know the verse fully.
VI. Classifications of Sphota
The sphota is the object of cognition. It is
The
revealed with the help of the parts(letters). It
different approaches to Sphota must be made
is not the existence of cognition of parts that
here in order to show the richness and the
is denied; for we do undoubtedly cognize
precision of the topics being discussed
individual letters; it is their significance that
among ancient and medieval grammarians in
is in question. The whole taken as an integral
India. There are eight major approaches to
symbol is something different from the parts
the theory of Sphota.
CLEAR March 2014
general overview of the concepts and
13
the sentence can be understood. But since 1) Varnasphota
the meaning of the sentence is the final
It is defined as denotative, vacaka, when
meaning which is to be understood then
a single phoneme or a stem or affix is
the pada sphota theory is found in
found to be so, and therefore the varna
sufficient in the description of perception
sphota is taking place. This theory
of meaning and leads to the next level of
utilizes the analysis from â€&#x;bottom to top,
synthesis:vakyasphoa.
which is mainly found in grammatical treatises such as Paninis descriptive
3) Vakyasphota
grammar.
it
It maintains that the sentence is a unique
sdifficulties in the immediate application
entity which conveys the meaning. The
to the analysis of the word, especially
sentence in itself is a unit of meaning.
when the synthetic forms of the word are
Vakya sphota however does not claim
examined such as ghaena, with the pot,
that the constituents of the sentence do
for it cannot clearly define them into
not have meaning. The main point of this
separate and meaningful units.
theory is that the word should be always
Varna
sphota
has
seen and understood in a context. The 2) Padasphota
words have their meaning only when
It maintains that the finished word as a
they form a part of sentence.
unique entity conveys the meaning,and the division into the morphological
4) Akhadapadasphoa
components into suffixes, stems etc. does
It maintains that the word is perceived as
not occur when the speaker or the listener
undivided single meaning bearing unit. It
understands the speech. This theory
is not perceived by its parts: suffixes,
claims that the text can be described by
stems etc., but as a single and undivided
listening to the words and their meaning,
meaningful entity.
as well as by perceiving the relation between them in a syntactic structure of
5) Akhadavakyasphoa
the sentence. It is by listening to the
It says that it is insufficient to perceive
meaning of every word and linking it
the separate word, for in ordinary
with another word that the meaning of
communications the sentence as the
CLEAR March 2014
14
whole is perceived as meaningful and not
There was one more distinction important to
a separate word. Bhartrihari thinks that
mention here, which
such division of the sentence into words
different approaches to the understanding of
and stems etc. does not exist in the
Sphota:
ordinary
anvitabhidhanavada
perception
of
speech.
In
the
formulated the two
abhihitanvayavada theories.
and The
common use of speech the meaning is
abhihitanvayavada theory maintains that the
taken as a whole, including the context. It
words and grammatical units have their own
is only when the utterance is made that
meaning and by joining together through
the speaker can dwell on it and analyse it
their syntactic relation build up the meaning
in parts as words, stems etc., but not
of the sentence. The anvitabhidhanavada
when he is speaking. And if he is able to
theory on the contrary affirms that the
grasp the parts of speech, such as
meaning of the word can be understood only
syllables, he will loose the meaning of it
in the context of the sentence.
all. According to this theory the varna
theories
and pada sphota describe language in its
variations and commentaries make a rich
functions, but not in its use.
layout for the linguistic studies of meaning in
of
Sphota
with
All these
many
other
the terms of structural semantics, and 6) Vyaktisphota and Jatisphota
together represent a holistic view in defining
To answer the question whether Sphoa is
all possible approaches to meaning within
particular oruniversal there are two
the grammatical structures (morphology and
theories
syntax).
the
Jatisphoavada.
Vyaktisphoavadaand The
Jatisphoavada
maintains thatnon difference in the varied
VII. Back to Bhartrhari's Philosophy
individual elements is generic, while vyaktisphoavada says that difference is
Bhartrhari has had a deeper dimension to the
associative. For the Jatisphoavada the
theory of sphota than the linguistic aspect on
meaning bearing word is the class(as for
which we analyse it. For him it is the sabda
instance: gotva, cowness) which is
brahman, the ultimate reality. The entire
revealed by the individual instances
universe is the manifestation of this speech-
(vyaktis).
principle.
The
individuals
are
not
meaning bearers. CLEAR March 2014
15
The whole world as it is, has a Meaning
pasyanti, madyama and vaikari. They are
which can be grasped only as an indivisible
respectively the sphota, prakrta dvani and
unity. This meaning is inherent in the
vaikrta dvani of grammarians. A fourth
consciousness of man from his very birth,
stage, called para, has been identified by
with which he later finds its partial
another school of thought, but according to
correspondence
Bhartrhari pasyanti is the supreme Reality
in
his
language
and
reproduces it through articulation, and that is
Sabdabrahman.
Sphota. On semantic level, as it was developed by latter grammarians, Sphota
VIII. Conclusion
makes the text correspond with a universal Text-Totality, sabda-brahman, and therefore
The theory of sphota is one that has been
the text can be easily understood as such.
actively discussed and debated upon since
And once the inner perception (pratibha) of
ancient times to the present day. Some
the hearer flashes out, reflecting something
philosophers and grammarians say that the
from that totality, the Sphota, the revelation
theory is the most complete investigation
of the meaning of the text, takes place in his
into the profundities of language, making a
consciousness.
considerable contribution to the Philosophy
So,
the
Sphota
a
of Language, the Psychology of Speech, and
communication-device based on recognition
especially Semiotics. But the metaphysical
of the truth of existence through a word/text
superstructure built on the basis of this
in the hearer-speaker, (satta). It therefore is
theory has made many modern linguists from
of a psychological nature, as any human
fully appreciating the importance of the
speech is, for the recognition of the meaning
sphota doctrine in language-symbolism.
of the text is perceived by a consciousness
Bhartrhari developed a monistic doctrine of
which lies beyond the analytic capacity of
philosophy that this mystical Speech-essence
the external mind, and carries in itself all
is the first principle of the universe, that is,
meanings;
the entire universe is the manifestation of
and
understanding
can
be
as
such,
requires
a
seen
its
as
proper
psychological
this
Speech-essence
or
Sabda-brahman.
experience.
Setting aside this metaphysical aspect of the
According to Bhartrhai the speech principle
theory, we can concentrate on the linguistic
has three stages in its manifestation, namely
facet. It gives us a new insight about the
CLEAR March 2014
16
semantic level of language. It says that
3. Ravi Sheorey “Bhartrhari‟s Sphota
meaning is not something that can be
Theory:
An
Exploration
inferred, but it is actually being perceived at
semantics” in The Emporia State
the sentence level or expression level. Also
Research
the meaning correspond to a universal text
Publication of the Emporia Kansas
totality. This text-totality is something that is
State College.
Studies,
The
in
Graduate
inherent in the consciousness of man from 4. Harold
his very birth.
George
philosophical
and
Coward
“A
Psychological
This text-totality refers to a semantic
Analysis of the Sphota Theory of
representation that can represent anything
Language as Revelation” in Open
and everything that is cognizable to human
Access Dissertations and Theses
mind. It can be mapped to an expression in
Paper 2943 (1973)
any language (prakrta dvani) once we know how it can be done for one language. This would be the ultimate solution for all worries in
language
computation.
Language
generation,
translation
understanding,
everything would be solved.
References: 1. K. Kunjunni Raja “Indian Theories of Meaning” in Adyar Library and Research Centre, (1963) 2. K
Raghavan
Pillai
Vakyapadiya
Vol
“Studies I,
in
THE
VAKYAPADIYA” Critical Text of Cantos
I
and
II
in
Motilal
Banarsidass,First Edition (1971). CLEAR March 2014
17
Natural Language Processing: A Paninian Perspective Anagha Manoharan,
Reji Rahmath K,
Rekha Raj C.T,
M.Tech Computational Linguistics G.E.C Sreekrishnapuram anaghamanoharan3@gmail.com
M.Tech Computational Linguistics G.E.C Sreekrishnapuram rrahmathrejik@gmail.com
M.Tech Computational Linguistics G.E.C Sreekrishnapuram rekahrajct@gmail.com
A majority of human languages including Indian and other languages have relatively free word order. Most existing computational grammars are based on context free grammars which are basically positional grammars. It is important to develop suitable computational grammar formalism for free word order languages. The Paninian framework is such a framework that has been successfully applied to Indian languages. Paninian grammar uses the notion of karaka relations between verbs and nouns in a sentence. This paper describes Paninian framework applied to Indian languages. Paninian grammar uses the notion of karaka relations between verbs and nouns in a sentence. This paper describes Paninian framework applied to the processing ofmodern Indian languages and developing a parser using Paninian grammar. The paper also describes a machine translation system called Anusaraka developed based on Paninian grammar.
I. Introduction 1.1 Panini and Ashtadhyayi:
gives formal production rules and definitions to describe Sanskrit grammar. He gave a comprehensive and scientific
theory of
Panini [4] was a Sanskrit grammarian who
phonetics,
lived during the fifth or sixth century B.C.
Starting with about 1700 basic elements like
Panini's grammar Ashtadhyayi (The Eight
nouns, verbs, vowels, consonants he put
Chapters) deals with the Sanskrit language;
them into classes.
however, it presents the framework for a
sentences, compound nouns etc. is explained
universal grammar that may (and probably
as ordered rules operating on underlying
does) apply to any language. His book
structures in a manner similar to modern
consists of about under 4000 rules and
theory. In many ways Panini's constructions
aphorisms. In this work Panini distinguishes
are similar to the way that a mathematical
between the language of sacred texts and the
function is defined today.
phonology,
and morphology.
The construction of
usual language of communication. Panini CLEAR March 2014
18
1.2 Paninian Grammar and Free Word
This paper describes Paninian framework
Order Languages
applied to the processing of modern Indian
A majority of human languages including Indian and other languages have relatively free word order. In free word order languages, order of words contains only secondary information such as emphasis etc. Primary information relating to meaning is contained
elsewhere.
Most
languages and developing a parser using Paninian grammar. The paper also describes a
machine
Anusaraka
translation developed
system using
called Paninian
grammar. II. Literature Survey
existing
grammars are based on
Bharati Akshar, Vineet Chaitanya and
context free grammars which are basically
Rajeev Sangal [1] presents a Paninian
positional grammars. It is important to
perspective
develop suitable computational grammar
processing. The unique aspect of the
formalism for free word order languages for
computational grammar described is that it is
two reasons: First, a suitably designed
designed for free word order languages and
formalism is likely to be more efficient.
make special use of vibhakti. It takes the
Second, such a formalism is also likely to be
concept of vibhakti and karaka relations
linguistically more elegant and satisfying.
from Paninian framework, and uses them to
The
a
give an elegant account of Indian languages.
framework that has been successfully applied
The notions of karaka charts, karaka
to Indian languages.
assignment, etc. are discussed. It is argued
computational
Paninian
framework
is
such
towards
natural
language
and shown that a grammar based on these Paninian grammar uses the notion of karaka
notions,
relations between verbs and nouns in a
relations, control, active-passives etc. A
sentence. The notion of karaka relations is
parser developed based on the Paninian
central to the Paninian model. The karaka
framework turns out to be elegant and
relations
efficient.
are
syntactico-semantic
(or
successfully
Anusaraka
handles
is
a
karaka
Machine
semantico-syntactic) relations between the
Translation system developed based on
verbals and other related constituents in a
Paninian framework. It has been shown that
sentence.
it is possible to overcome the language barriers in India using Anusaraka. Paninian
CLEAR March 2014
19
grammar frame-work is compared with
information between speakers (or writers)
existing
and hearers (or readers). The main problem
grammar
modern
western
framework
computational
such
as
Lexical
that the Paninian approach addresses is how
Functional Grammar (LFG), Tree Adjoining
to extract karaka relations from a sentence.
Grammar (TAG), and Government and
As it is inspired by Sanskrit. it emphasizes
Binding (GB).
the roles of case endings and markers such as post-positions(or pre-positions). Positions or
Rick
Briggs
[3]
compares
a
typical
knowledge representation scheme using
word order is brought into consideration only when necessary.
Semantic Nets with the method based on karaka relations used by the ancient Indian Grammarians
to
analyze
sentences
unambiguously. Finally, the clear parallelism between the two is demonstrated.
Subhash C. Kak [4] in his article reviews the Paninian approach to natural language processing (NLP) and compares it with the current systems.
computer-based The
current
understanding
3.1 Paninian Theory Paninian grammar is particularly suited to free word order languages. It makes use of vibhakti
information
semantic
relations,
for and
mapping uses
to
position
information only secondarily. As the Indian languages have relatively free word order, they are eminently suited to be described by Paninian grammar.
knowledge
representation systems of AI agree with the requirements of the Paninian approach and therefore, it is argued that Paninian-style generative rules and meta-rules could assist in further advances in NLP.
3.1.1 Karaka Relations A karaka [2] is a semantic relation between the verb and a noun. Panini isolated six such relations, namely 1. kartr - Agent 2. karman - Patient
III. Paninian grammar The goal of the Paninian approach is to construct a theory of human natural language communication that answers questions like how natural language is used to convey CLEAR March 2014
3. karana - Instrument 4. sampradana - Target 5. apadana - Donor 6. adhikarana – Place Consider the sentence 20
In the kitchen, Rama cooks rice
transformation rule depending on its TAM
for Sita with firewood from the
label.
forest.
IV. Paninian Parser The verb in the sentence denotes an event,
This chapter discusses how a parser can be
viz. a cooking, and the denotation of each of
built using the Paninian framework. It turns
the nouns stands in a semantic relation with
out that the Paninian theory is extremely
that event. Thus, the event has an Agent -
suitable from the computational viewpoint.
Rama, a Patient - rice, an Instrument -
The theory can be used in a natural manner
firewood, a Donor - the forest, a Target -
for structuring a parser which is extremely
Sita, and a Location - the kitchen.
efficient. Figure 1 shows the structure of the
In Paninian framework, a mapping is
Paninian Parser.
specified between karaka relations and vibhakti. This mapping depends on the verb
4.1 Morphological Analyzer
and its tense aspect modality (TAM) label. The
mapping
is
represented
by
two
structures: default karaka charts and karaka chart transformations. The default karaka chart for a verb or a class of verbs gives the map-ping for the TAM label `tA hE' called basic. It specifies the vibhakti permitted for the applicable karaka relations for a verb when the verb has the basic TAM label. This basic TAM label roughly corresponds to present indefinite tense and is purely syntactic in nature. For other TAM labels there are karaka chart transformation rules. Thus, for a given verb with some TAM label, appropriate karaka chart can be obtained using its basic karaka chart and the
Figure 1: Structure of the parser
The morphological analyzer takes as its input a sentence, that is a sequence of words. For
CLEAR March 2014
21
each of its words, it look up a lexicon and retrieves information such as the root of the word, its lexical category, gender, number, person, tense, etc. In case a word has multiple meanings, grammatical information is returned for each of the meanings. Figure 2: A Parse Structure
4.2 Local Word Grouper (LWG)
Given the local word groups in a sentence,
The function of this block is to form the word
groups
on
the
basis
of
the task of the core parser is two-fold:
local
1. To identify karaka relations among
information. These are the word groups at
the word groups,
the vibhakti level (i.e., typically each word group is a noun or verb with its vibhakti, TAM label, etc.). These involve grouping post-positional
markers
with
nouns,
auxiliaries with main verbs etc. Rules for local word grouping are given by finite state machines.
2. To identify senses of words. The first task requires knowledge of karakavibhakti mapping, optionality of karakas, and transformation rules. A data structure called karaka chart stores these information. The second task requires lakshan charts for nouns and verbs.
4.3 Core Parser
V. Machine Translation
The function of the core parser is to accept
A
story written
in
a
language
(say
the local word groups produced by LWG,
Malayalam) is fed into a computer system
and produce the parse structure. Figure 2
and out come its translation in other
shows the parse structure for the sentence
languages. It is inexpensive, immediate and simultaneous. The language barriers melt
laDake boy
ne ergative
paanii
piyaa
away. The richness of other literatures opens
water
drink
up to everyone. Texts in any language is accessible to common man without prior knowledge of that language. These are the goals of Machine Translation (MT). 5.1 Anusaraka or Language Accessor
CLEAR March 2014
22
Anusaraka is a Machine Translation system
the word paradigms to see whether the input
based on Paninian grammar. It is possible to
word can be derived from a root and its
overcome the language barrier in India today
paradigm. If the derivation is possible, it
using anusaraka. Anusaraka tries to take
returns the grammatical features associated
advantage of the relative strengths of the
with the word form (obtained from the root
computer and the human reader, where the
and the paradigm). In case, the input word
computer takes the language load and leaves
cannot be derived, it is possibly a compound
the world knowledge load on the reader [5].
word and is given to the sandhi package to
It
split it into two or more words, which are
is
particularly
effective
when
the
languages are close, as is the case with
then again analyzed by morph.
Indian languages. It bridges the gap between
The output of morph is given as input to the
languages by choosing the most appropriate
local word grouper. Its main task is to group
or nearest construction available in the target
function words with the content words based
language together with suitable additional
on local information such as post-position
notation.
markers that follow a noun, or auxiliary
There are only three major differences in the
verbs following a main verb. This grouping
south Indian languages and Hindi. All these
(or case endings in case of in inflectional
can be bridged by simple additional notation
languages), identifies vibhakti of nouns and
in Hindi. The resulting language can be
verbs. The vibhakti of verbs is also called
viewed as a southern dialect of Hindi.
TAM (tense-aspect-modality) label.
Anusaraka uses this dialect to make source text in southern languages accessible to
The next stage of processing is that of the
Hindi readers.
mapping block. This stage uses a noun
5.1.1 Structure of Anusaraka System
vibhakti dictionary, a TAM dictionary, and a
Structure of the Anusaraka is shown in
bilingual dictionary. For each word group,
Figure 3. A source language sentence is rst
the system finds a suitable root and vibhakti
processed
in the target language. Thus, it generates a
by
morphological
analyzer
(morph). The morph considers a word at a
local word group in the target language.
time, and for each word it checks whether the word is in the dictionary. If found, it returns its grammatical features. It also uses CLEAR March 2014
23
Anusaraka output is usually not the target language, but close to it. Thus, the KannadaHindi anusaraka produces a dialect of Hindi, that does not have agreement etc. It can be called a sort of Dakshini (southern) Hindi. Some additional notation may also be used in the output. Certain amount of training is needed for a user to get used to the anusaraka output language. The role of the anusaraka interface in Figure 4 is to facilitate the reading of output by the reader.
Figure 3: Block Schematic of Anusaraka
The local word groups in the target language are passed on to a local word splitter (LWS) followed by a morphological synthesizer (GEN). LWS splits the local word groups into elements consisting of root and features. Finally, GEN generates the words from a root and the associated grammatical features. Figure 4: Different Interfaces for Anusaraka
5.1.2 User Interface
CLEAR March 2014
24
differences in the source and target language.
5.1.3 Pre-editing and Post-editing Anusaaraka system has been designed so that the combination of man and machine together can perform translations. There are two principal points in this whole process at which the user can help: pre-editing the input and post-editing the output.
red to as vibhakti of nouns. 5.2 Implications If the anusarakas enter into common use, it has
major
implications
for
national
integration. The users of anusaraka will learn the features of the source languages they
Pre-editing: In the pre-editing task, the input
read. Thus, a reader of anusaraka Hindi will
text is corrected and edited by the user:
learn features of the South Indian language if
Words spelt with non-standard spellings are
he uses a Telugu to Hindi anusaraka. Many
changed to their standard spellings, external
new constructions will also enter into the
sandhi between words is broken (unless it
language. Thus, on the one hand while it will
changes meaning), etc. This is an important
encourage people to work in their own
task for Indian Languages because of lack of
languages, and thus strengthen the various
standardization and consequent variation.
Indian languages; on the other hand, it will
Post-editing: The ansuaraka output is close to the target language, and in general is not grammatical
from
the
target
language
viewpoint. Therefore it would normally be post-edited by a person before distribution or
further contribute to the mixing of languages. Government can support this activity; but what is needed is for volunteers to come forward for the task. This should happen for the love of our languages.
publication. In post-editing, the output is
VI.
corrected
Adjoining Grammar
considering
grammatically, the
cultural
stylistically and
social
background of the reader. 5.1.4 Training The reader of the Anusaraka output would need to undergo training. Besides covering the special symbols used in the output, the training would also familiarize him with the CLEAR March 2014
Paninian
Grammar
v/s
Tree
For Indian languages which do not have long distance dependencies, PG should perform better than Tree Adjoining Grammar (TAG) because of the price paid for adjunction. To handle free word order, TAGs have to relax ordering which may lead to a further price in efficiency. PG, on the other hand, utilizes the 25
vibhakti constraints to build an efficient
Processing: A Paninian Perspective,
system. It remains an open question as to
Prentice Hall of India, Delhi, 1995.
how well does PG handle long distance dependencies.
2. Jonardon Ganeri, Artha: Meaning, Oxford University Press 2006.
VII. Conclusion and Future Works Indian languages have relatively free word order. They also have a rich system of case-
3. Rick
Briggs, in
Knowledge
endings and post-positions. The unique
Representation
Sanskrit
and
aspect of Paninian grammar is that it is
Artificial Intelligence, AI Magazine,
designed for free word order languages and
1985; Volume 6 Number 1: 32-39.
make use of vibhakti. It takes the concept of vibhakti and karaka relations. The Paninian
4.
Subhash C. Kak, The Paninian
parser turns out to be elegant as well as
Approach
to
Natural
Language
efficient. It is able to handle diverse
Processing, International Journal of
phenomena like karaka assignment, active-
Approximate Reasoning, 1987; 1:
passives, and control in a unified manner.
117-130.
Anusaraka is a Machine Translation system developed based on Paninian frame-work. It
5.
Kulkarni, Amba P., Design and
is possible to overcome the language barriers
Architecture
in India using Anusaraka. Anusaraka tries to
Approach to Machine Translation,
take advantage of the relative strength of the
Satyam Technical Review vol 3, Oct
computer and the human reader, where
2003, pp 57-64.
of
anusAraka:
An
computer takes the language load and leaves the world knowledge role on the reader. It is particularly effective when the languages are similar as in the case of Indian languages. References 1. Bharati Akshar, Vineet Chaitanya, jeev
Sangal,
CLEAR March 2014
Natural
Language 26
Anaphora Resolution - An Overview Athira S
Lekshmi T S
M.Tech Computational Linguistics GEC Sreekrishnapuram
M.Tech Computational Linguistics GEC Sreekrishnapuram
Anaphora is the phenomenon of a linguistic expression which acts as a substitute or reference to some other linguistic form, which generally precedes it. Anaphora Resolution (AR) is a process of automatically finding the pairs of pronouns or noun phrases in a text that refer to same incidence, thing, person, etc. called referent syntax, semantics and pragmatics of a
I. Introduction Anaphora is the phenomenon of a linguistic
language.
expression which acts as a substitute or
AR system can be helpful in the following
reference to some other linguistic form,
areas:
which generally precedes it. Resolution
(AR)
is
a
Anaphora process
of

Machine Translation. In machine
automatically finding the pairs of pronouns
translating,
or noun phrases in a text that refer to same
resolved for languages that mark the
incidence, thing, person, etc. called referent.
gender of pronouns.
The first member of the pair is called
drawback with most current machine
antecedent and the next member is called
translation
anaphora. Antecedents and anaphoras can
translation usually does not
co-occur in the same sentence or can be in
beyond sentence level, and so does
different sentences. Anaphora Resolution can
not deal with discourse understanding
be done in two steps. First, the pronouns or
successfully.
noun phrases that may be used for indicating
anaphora resolution would thus be a
a referent are identified. These are called
great assistance to the development
markables. Then, the pairs of such markables
of machine translation systems.
that have the antecedent − anaphora relation are
identified.
AR
involves
a
good
understanding of the interaction between the CLEAR March 2014

Text
anaphora
systems
be
One main
is
that
the go
Inter-sentential
Summarization.
automatic
must
text
Many
of
summarization
systems apply a scoring mechanism 27
to identify the most salient sentences.
a reduction of the success rate of the
However, the task result is not always
anaphora
guaranteed to be coherent with
Resolution is a challenging task particularly
eachother. It could lead to errors if
for resource poor languages like Indian
the
Languages.
selected
sentence
contains
resolution
system.
Anaphora
anaphoric expressions. To improve the accuracy of extracting important
II. Properties of Anaphora
sentences, it is essential to solve the
A
problem of anaphoric references in
properties of anaphora are the following. 
advance.

number
of
proposed
characteristic
Dependency of Interpretation. An expression is anaphoric only if it
Question-Answering
system.
QA
depends for its interpretation on a
system can be improved by resolving anophara in queries and retrieved documents.
vitally depends on the efficiency of the preprocessing tools which analyse the input before feeding it to the resolution algorithm. Inaccurate pre-processing could lead to a considerable drop in the performance of the system, however accurate an anaphora resolution algorithm may be. In the preprocessing stage a number of hard preprocessing problems such as morphological analysis / POS tagging, named entity recognition, unknown word recognition, NP parsing,

Type of Antecedent. The item on which the anaphor depends for its
A real-world anaphora resolution system
extraction,
contextually given item.
identification
of
pleonastic pronouns, selectional constraints, etc. have to be dealt with. Each one of these
interpretation is called the antecedent. 1. Linguistics expressions,
e.g.
full noun phrases; 2. Representations of objects. These representations can be thought of as
theoretical
models
of
constructs
or
the
mental
representations
which
hearers
employ
interpreting
when
a
discourse. 3. Objects in the real world, in particular, that part of the world in
which
the
communication
takes place, i.e., the utterance situation.
tasks introduces error and thus contributes to CLEAR March 2014
28
Location of Antecedent. It is, however,
interest, for alternative and/or data-enriched
possible to have the anaphor precede the
approaches. Last, but not least, application-
antecedent. In such situations, it is called
driven research in areas such as automatic
cataphora.
abstracting and information extraction, has
Type of Relation. Coreference does not
independently identified the importance of
seem to be a necessary property of
anaphora and coreference resolution.
anaphora.
There
exists
nonreferring
expressions: these are normally not used to single out
one or
more real
world
Rule-Based Approaches
Rule based approaches integrate knowledge
individuals. Therefore they can impossibly
sources/factors
be coreferential with some other expression.
candidates until a minimal set of plausible
Two expressions are coreferential if their
candidates is obtained. Here constraints work
real
identical.
as a filter to eliminate unlikely candidates
Alternatively, we can adopt a notion of
within a set of defined rules. Thereafter,
coreference which applies to individuals on
preference- based factors are applied
the level of mental representations.
.
world
referents
are
Constraints on the Relation. In general,
that
discount
unlikely
Corpus Based Approaches
an indefinite NP establishes a permanent
Corpus oriented approaches makes use of the
discourse referent just in case the quantifier
available corpus. These corpuses have been
associated with it is attached to a sentence
created specifically for the discourse task.
that is asserted, implied, or presupposed to be true, and there are no higher quantifiers involved.
Knowledge Poor Approaches Knowledge poor approaches are inexpensive, fast and reliable. It helps in the emergence of
III.
Existing
Anaphora
Resolution
cheaper and more reliable corpus-based NLP
Approaches
tools. This approach is suitable for vast
Discourse-orientated theories and formalisms
category of languages as they do not rely on
such as DRT and Centering have inspired
syntax and semantic of the language. It does
new research on the computational treatment
not rely on linguistic or domain knowledge
of anaphora. The drive towards corpus-based
and work even without parsing.
robust NLP solutions has further stimulated CLEAR March 2014
29
ďƒ˜ Discourse Based Approaches
In this paper, a data driven approach
Discourse is modeled through a sequence of
for Anaphora resolution of three Indian
utterances. A single entity is centered at any
languages: Bengali, Hindi, and Tamil is
given point in the discourse and it has to be
discussed. The work consists of two steps:
distinguished from all other entities that have
identifying markables and links. Markable
been evoked. In order to resolve anaphora,
identification is done using Conditional
the world knowledge and inferencing are
Random Field. The identifications of links
also employed in this approach.
between markables is done using Decision Tree Algorithm. Weka is used to implement
IV. Works in Indian Languages
4.1 In
Sanskrit
Texts[Sobha
decision tree algorithm. L
and
Pravin,2010]
and Sobha,2012]
In this work, the computational grammar implemented uses very familiar concepts such as clause, subject, object etc., which are identified with the help of morphological information and concepts such as precede and follow. It makes limited use of grammatical
rules
4.3 Using CRF in Tamil[Akilandeswari
and
uses
Conditional Random Fields are used in this work since it contain a no: of feature functions. Salience features and other features are trained into the system
for resolving
anaphora.
only
morphological markings to identify subject, object, clause etc. It uses limited parsing: the information required from the parser is limited to parts of speech tagging, clause identification, subject of the clauses and person-number-gender of the noun phrases.
4.4GuiTAR
Based
Resolution
in
Bengali[Senapati and Utpal,2013] This AR system used an off-the-shelf system.
Preprocessing
and
anaphora
resolution system are the two modules in the architecture of this system. VASISTH[Sobha L and Patnaik,1999]
4.2 RandomTreeAlgorithm[Chatterji,Dha s,Barik,Sartar and Basu,2011] from
two
different
presently handles two different languages families:
from Indo-Aryan. It can easily be extended
Malayalam, from, Indo-Dravidian and Hindi
to handle other Indian languages, more
CLEAR March 2014
language
VASISTH is a multilingual system, which
30
generally,
other
morphologically
rich
languages.
[5] S Chatterji, A Dhar, B Barik,S Sartar,B Basu, Anaphora Resolution for Bengali, Hindi Tamil Using Rendom Tree Algorithm in Weka, Proceedings of the ICON, 2011
References
[1] Shalom Lappin, Herbert J Leass, An Algorithm
for
Resolution,
Pronominal
Computational
Anaphora
[6] Akilandeswari A, Sobha Lalitha Devi,
Linguistics
Resolution for Pronouns in Tamil Using
Journal, J94-4002, Pgno:535-561, 1994 [2] Sobha L, Patnaik B N, VASISTH- An Anaphora Resolution System, Unpublished Doctral Dissertion, MG University, 1999 [3] Daniel Jurafsky, James H Martin, Speech and Language Processing, 2000
CRF, Proceedings of the workshop on machine translation and parsing in Indian languages, Pg no: 103-112, 2012 [7] Apurbalab Senapati, Utpal Garain, GuiTAR-based
Pronominal
Anaphora
Resolution in Bengali, Proceedings of the 51st annual Meeting of the Association for
[4] Sobha L, Pravin Pralayankar, Algorithm
Computational Linguistics, Pg no: 126-136,
for Anaphora Resolution in Sanskrit Texts,
2013
Proceedings Computational
of
the
2nd
Linguistics
Sanskrit
Symposium,
Brown University, USA, 2008
CLEAR March 2014
31
Indian Language Computing Platforms Sreejith C M.Tech Computational Linguistics
Hi all ď Š, this edition of CLEAR focuses on
barrier; creating and accessing multilingual
various
language
knowledge resources; and integrating them
computing. So in this article I would like to
to develop innovative user products and
introduce some tools and standards available
services.
aspects
of
Indian
for various Indian Language computing.
Department
of
Information
Technology
India is a multilingual country with as many
launched another major initiative called
as 22 scheduled languages and computer
National Rollout Plan to aggregate these
technology breaks the language barrier and
software tools and to make these available
bridges the gap between the various sections
through a web based Indian Language Data
of the society through easier access to
Centre (ILDC - http://www.ildc.gov.in/).
information using their respective languages
This activity is being executed in close
and hence language computing becomes
coordination with CDAC, GIST, Pune.
central to the exchange of information across
Under this user friendly software tools and
speakers of various languages. A lot of
fonts are being made available free for public
Indian language Computing projects are
through language CDs and web downloads
going on. They involve some government
for the benefit of masses.
sector companies, some volunteer groups and individual people The
Department
The availability of these software tools, fonts and resources in local languages at no cost is
of
Electronics
and
intended to motivate general public to use
Information Technology, India initiated the
ICT tools and technology in their day to day
TDIL
(Technology
work like Word Processing, Presentation
Development for Indian Languages) with the
preparation, Spread Sheets preparation, Web
objective
Information
Page Surfing & Designing, Messaging etc. in
Processing Tools and Techniques to facilitate
local languages. Further, the consolidated
human-machine interaction without language
availability of linguistic resources and tools
(http://tdil.mit.gov.in)
of
developing
CLEAR March 2014
32
at one place will help researchers to carry out
GIST Labs have carved their expertise with
their research in a smooth and efficient
technologies as varied as Natural Language
manner.
Processing
Project Bhasha is a key milestone in Microsoft's
(http://www.bhashaindia.com/)
effort to stimulate local language computing and take IT to the masses, driven by the fact that 95 percent of Indians use their local language rather than English in their work and personal life. Being a comprehensive program, it aims to localize (provide local interfaces) to Microsoft's flagship products, Windows and Office.
(NLP),
Video,
Embedded
Systems, Word-processing to name only a few.
This
tradition
of
cutting-edge
technologies is continually upheld at the GIST Labs where new tools compatible with the needs and requirements of today's fast developing digital world are being developed (http://pune.cdac.in/html/gist/research-areas/ nlp.aspx). C-DAC GIST has participated in various
standardization
activities
for
language technology. GIST is also involved in standardization of heritage scripts of India
Bhasha is a cohesive program for bringing
such as Vedic, Grantha, Samavedic, etc.
together the governments, the academia and
Heritage scripts based tools are used for
the research institutions, the local ISVs and
digitising and preserving data for making
developers and the industry associations on a
web-pages or data mining applications,
common
especially by researchers. Contributions of
ground
for
promoting
local
language computing. Microsoft has localized Windows and Office (provided localized User Interface as well as User Assistance) in
GIST:
Malayalam, Marathi, Oriya, Punjabi, Tamil and Telugu. Microsoft also provides Indian languages' support throughout the platform and across the range of products.
for
language
12 Indian languages that include Assamese, Bengali, Gujarati, Hindi, Kannada, Konkani,
Need
Standards
for
computingW3C
Indian (Indian
Languages on Web)
Internationalized
Domain
names
(IDN)
E-Governance in Indian languages
Linguistics formats
Storage standards
C-DAC GIST has always been at the
Input standards for Indian languages
forefront of the development of new tools
Indian language Display Fonts
and technologies. A leader in the area, the CLEAR March 2014
33
Swathanthra Malayalam Computing (SMC)
Madras. These experiences relate to the
is a free software collective engaged in
project undertaken at the laboratory where
development, localization, standardization
Software tools for developing interactive
and popularization of various Free and Open
computer applications in different regional
Source Softwares in Malayalam language.
languages of India have been set up.
SMC (http://smc.org.in/) has been active
Center
since October 2002 and has been working to
Technology(CFILT) was set up with a
provide Malayalam language tools that work
generous grant from the Department of
on all layers of computing including and not
Information Technology (DIT), Ministry of
limited to rendering fixes, fonts, input
Communication
mechanisms, translations (localization), text-
Technology, Government of India in 2000 at
to-speech
spell
the Department of Computer Science and
checkers and other indic script based
Engineering, IIT Bombay. Prior to this the
language computing specific tools across
Natural Language Processing (NLP) activity
operating systems. We are the upstream for
of the CSE Department, IIT Bombay took
Malayalam fonts and tools for popular
off in 1996 with a grant from the United
GNU/Linux based operating systems such as
Nations University, Tokyo to create a
Fedora and Debian. We also maintain
multilingual information exchange system
localizations for popular Free Software
for the web. The project called Universal
Desktops
Networking
engines,
dictionaries,
(GNOME/KDE),
popular
for
Indian
Language
and
Information
Language
(UNL;
applications such as Firefox and Libre
www.undl.org) was participated in by 15
Office.
research
The
Acharya
Web
Site
groups
across
continents.(
http://www.cfilt.iitb.ac.in)
(http://www.acharya.gen.in:8080/about.html)
Today
disseminates
to
reaching out for help from varied sources;
computing with Indian languages. The
we are heading towards a boom in Indian
information presented at this site reflects the
language computing. I hope that SIMPLE
experiences
groups
information
gained
at
relating
the
Systems
Indian-language
will
also
computing
contribute
for
is
the
Development Laboratory in the Department
development of various Indian language
of Computer Science and Engineering at IIT
computing aspects in the coming future.
CLEAR March 2014
34
M.Tech Computational Linguistics Dept. of Computer Science and Engg, Govt. Engg. College, Sreekrishnapuram Palakkad www.simplegroups.in simplequest.in@gmail.com
SIMPLE Groups Students Innovations in Morphology Phonology and Language Engineering
Article Invitation for CLEAR- June-2014 We are inviting thought-provoking articles, interesting dialogues and healthy debates on multifaceted aspects of Computational Linguistics, for the forthcoming issue of CLEAR (Computational Linguistics in Engineering And Research) magazine, publishing on June 2014. The suggested areas of discussion are:
The articles may be sent to the Editor on or before 10th June, 2014 through the email simplequest.in@gmail.com. For more details visit: www.simplegroups.in
Editor,
Representative,
CLEAR Magazine
SIMPLE Groups
CLEAR March 2014
35
Hello World, With this issue of CLEAR magazine, we are bringing an edition on INDIAN LANGUAGE COMPUTING. As the term conveys, it involves developing software in Indian languages i.e., Localization of computer applications, web development etc. in Indian languages. There are several organisations and volunteer groups who have come forth with Indian Language Computing Projects. Few of them have already been discussed by authors in their articles in this edition. Microsoft is one among them, which works in close collaboration with state government bodies and nodal IT agencies to create glossaries which are used to create local language interface. Their key milestone Project Bhasha aims to localize to Microsoft's flagship products, Windows and Office. CDAC, IIITM-K, SMC etc. are some organisations working on several Language Computing projects in Malayalam. This reveals a lot of opportunities in Language Computing area.
Reshma O.K.
CLEAR March 2014
36
CLEAR March 2014
37
CLEAR March 2014
38