33
CLEAR September 2013
1
CLEAR September 2013
2
CLEAR C
Editorial
4
SIMPLE News & Updates
5
CLEAR Mar 2014 Invitation
34
Last word
35
CLEAR December 2013 Volume-2 Issue-4 CLEAR Magazine (Computational Linguistics in Engineering And Research) M. Tech Computational Linguistics Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad 678633 www.simplegroups.in simplequest.in@gmail.com Chief Editor Dr. P. C. Reghu Raj Professor and Head Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad Editors Sreejith C Reshma O K Neethu Johnson Gopalakrishnan G
1 HTK - The HMM Toolkit for Speech Processing 8 Ms Abitha Anto, Ms Neethu Johnson, Ms Sincy V Thambi, Ms Varsha K V 2 Lucene Mr Nibeesh K
16
3 FrameNet Mr Sreejith C
21
4 A Fruitful Data Mining using ORANGE 25 Mr Manu Madhavan 5 R Programming Language Mr Robert Jesuraj
27
6 Malayalam Morphological Analyser using Apertium Tool kit Ms Deepa C A , Ms Ancy Antony
30
Cover page and Layout Sreejith C
CLEAR September 2013
3
Greetings! This edition of CLEAR, the last one of 2013, focuses on some of the latest tools that are used in Computational linguistics and allied disciplines. They include HTK, a tool kit becoming popular among the speech processing community, and Apache Lucene, a free/open source information retrieval software library, among many others. The latter was originally created in Java by Doug Cutting. The Lucene search engine is an open source, Jakarta project used to build and search indexes. Lucene can index any text-based information you like. There is also an article on FrameNet corpus, which is a lexical database of English that is both human and machine-readable, based on annotating examples. Other articles peep into data mining (with ORANGE), a new programming paradigm (with R Language), and Malayalam morphological analysis (with Apertium). As the New Year approaches, let us hope that these gentle introductions to new tools will help the readers to chart out exciting journeys into newer domains resulting in useful findings. The CLEAR team wish all readers a happy and prosperous 2014! With Best Wishes, Dr. P. C. Reghu Raj (Chief Editor)
CLEAR September 2013
4
NEWS & UPDATES Publications Semantic Role Extraction and General Concept Understanding in Malayalam using Paninian Grammar, Radhika K T, P. C. Reghu Raj, International Journal of Engineering Research and Development, Volume 9, Issue 3 December 2013. Extraction of Disease Relationship from Medical Records: Vector Based Approach, Ayisha Noori, P C Reghu Raj, International Journal of Latest Trends in Engineering and Technology (IJLTET), Volume 3, Issue 2, November 2013 Malayalam Wordnet: A Relational Database Approach, Mujeeb Rehman, P C Reghu Raj, International Journal of Latest Trends in Engineering and Technology (IJLTET), Volume 3, Issue 2, November 2013 Morphological Analyzer for Malayalam: Probabilistic Method Vs Rule Based Method, Rinju OR., Rajeev R. R., Reghu Raj P.C., Elizabeth Sherly, International Journal of Computational Linguistics and Natural Language Processing (IJCLNLP), Volume 2, issue 10, October 2013 Architecture of an Ontology-Based Domain-Specific Natural Language Question Answering System, Athira P. M., Sreeja M. and P. C. Reghuraj, International Journal of Web & Semantic Technology (IJWesT), Volume.4, No.4, October 2013 Structured Information Extraction from On-line Advertisements - A Bayesian Approach , Saani H, Reghu Raj P C. International Journal of Advanced Research in Computer Science and Software Engineering (IJARCSSE), Volume 3, Issue 9, September 2013 LALITHA : A Light Weight Malayalam Stemmer Using Suffix Stripping Method , Prajitha U, Sreejith C, Reghu Raj P C , Proceedings of the IEEE International Conference on Control, Communication and Computing , December 1315, 2013
CLEAR September 2013
5
An Effective Malayalam Information Retrieval System Using Query Expansion, Reshma O K , Sreejith C, Reghu Raj P C, Proceedings of the IEEE International Conference on Control, Communication and Computing , December 13-15, 2013 STHREE: Stemmer for Malayalam Using Three Pass Algorithm, Pragisha , Reghu Raj P C, Proceedings of the IEEE International Conference on Control, Communication and Computing , December 13-15, 2013 Design of a Scalable Natural Language Report Management System, Manu Madhavan, Robert Jesuraj, Reghu Raj P C. Proceedings of the International Conference on Natural Language Processing, 18-20 December, 2013
Workshop on Multilingual Computing M.Tech Computational Linguistics Department of Computer Science and Engineering GEC Sreekrishnapuram organized an one day workshop on ―Multilingual Computing" (sponsored by TEQIP Phase 2) in association with Swathanthra Malayalam Computing (SMC) on October 7 2013. The purpose of the workshop was to get familiarized with various challenges and issues related to language engineering and computing. The workshop was inaugurated by Dr. P C Reghu Raj, Head of Department of Computer Science and Engineering. The keynote speech was delivered by Mr. Santhosh Thottingal, Senior Software Engineer at Language Engineering team Wikimedia Foundation and project admin of SMC. Mr. Santhosh presented the need for greater emphasis on software localization and need of multilingual computing systems. The workshop was attended by more than 70 participants.
3
CLEAR September 2013
6
Workshop on Research Methodology and Intellectual Property Rights Three day training program-cum-workshop on Research Methodologies and Intellectual Property Rights was held at GEC Sreekrishnapuram during 21st - 23rd November, 2013. The program was organized by CERD (Centre for Engineering Research and Development) and KSCSTE (Kerala State Council for Science, Technology and Environment) Trivandrum. Eminent personalities from CERD, KSCSTE, academicians from reputed institutions gave sessions on various aspects of research. More than 30 faculties and PG students attended the workshop. The workshop started with a welcome address by Dr.P.C.Reghu Raj, Principal of GEC Sreekrishnapuram. The workshop was inaugurated by Dr. K.Balan, Director, CERD Trivandrum followed by his inaugural speech. The first day of the workshop came to an end with a motivational talk by Dr. K.K.Sasi, Professor Dept of E.E.E., Amrita Viswa Vidyapeetham, Coimbatore. On the second day of the workshop the session commenced with a talk on various funding schemes available from KSCSTE, for promoting technical education and research activities by Dr. V.Ajith Prabhu and Mr. Shafikh S. The final day of the workshop was handled by two eminent academicians par excellence, Dr.K.P.Mohandas Dean-Academic, MES College of Engg., Kuttipuram and Dr. K.B.M. Nambudiripad, Former Professor of Mechanical Dept., NIT Calicut.
Prof. C. Nazeer and Asst. Prof. R. Binu, followed by other staff members from CS
CLEAR September 2013
7
HTK: The HMM Toolkit for Speech Processing Abitha Anto M.Tech IIIrd Sem, GEC, Palakkad.
Neethu Johnson M.Tech IIIrd Sem, GEC, Palakkad.
Sincy V Thambi M.Tech IIIrd Sem, GEC, Palakkad
Varsha K.V. M.Tech IIIrd Sem, GEC, Palakkad
Abstract: The Hidden Markov Model Toolkit (HTK) is a portable toolkit for building and manipulating hidden Markov models (HMMs). HMMs can be used to model any time series and the core of HTK is similarly generalpurpose. HTK is primarily designed for building HMM-based speech processing tools, in particular recognizers. Thus, much of the infrastructure support in HTK is dedicated to this task. There are two major processing stages involved. First, the HTK training tools are used to estimate the parameters of a set of HMMs using training utterances and their associated transcriptions. Second, unknown utterances are transcribed using the HTK recognition tools.
system consists of three major stages. First, the
I. Introduction Speech recognition is the process of transcribing a recorded speech utterance into its corresponding sequence of words. Speech recognition systems generally assume that the speech signal as a realization of some message encoded as a sequence of one or more symbols. To effect the reverse operation of recognizing the underlying symbol sequence given a spoken utterance, the continuous speech waveform is first converted to a sequence of equally spaced discrete parameter vectors. This sequence of parameter vectors is assumed to form an exact
feature extraction stage, second the phone recognition stage, and the decoding stage. HTK consists of a set of library modules and tools available in C source form. The tools provide sophisticated facilities for speech analysis, HMM training, testing and results analysis. The software
supports
HMMs
using
both
continuous density mixture Gaussians and discrete distributions and can be used to build complex HMM systems. The HTK release contains
extensive
documentation
and
examples.
representation of the speech waveform on the
HTK can help in the three stages of
basis that for the duration covered by a single
speech recognition. HTK is a toolkit for
vector, the speech waveform can be regarded as
building Hidden Markov Models (HMMs). It is
being stationary.
primarily designed for building HMM-based
Every speech recognition
CLEAR December 2013
Page 8
speech
processing
tools,
particular
can be used in training, it must be converted
recognizers. The HTK training tools are used to
into the appropriate parametric form and any
estimate the parameters of a set of HMMs using
associated transcriptions must be converted to
training
associated
have the correct format and use the required
transcriptions. The HTK recognition tools also
phone or word labels. If the speech needs to be
transcribe the unknown utterances.
recorded, then the tool HSLab can be used both
utterances
and
in
their
to record the speech and to manually annotate it with any required transcriptions.
II THE TOOLKIT There are mainly 4 phases for HTK
Hcopy is used to copy one or more
toolkit such as data preparation, training, testing
source files to an output file. Normally, HCopy
and analysis. These processing stages are
copies the whole file, but a variety of
depicted in fig. 1.
mechanisms
are
provided
for
extracting
segments of files and concatenating files. The tool HList can be used to check the contents of any speech file and since it can also convert input on-the-fly, it can be used to check the results of any conversions before processing large quantities of data. The tool HLEd is a script-driven label editor which is designed to make the required transformations to label files. HLEd can also output files to a single Master Label File MLF which is usually more convenient for subsequent processing. HDMan is used to create dictionary. HParse creates word network from the grammar used. Fig. 1
2. Training
1. Data Preparation
Initialization procedure depends on the
In order to build a set of HMMs, a set of speech
data
files
and
their
associated
information
available
at
that
time.
The
commands used are:
transcriptions are required. Before the HMM CLEAR December 2013
Page 9
• HCompV: computes the overall mean and
either a list of stored speech files or on direct
variance.
audio input. Testing uses the dictionary and
Input: a prototype HMM.
word network created in the data preparation
• HInit: Viterbi segmentation + parameter
step.
estimation. For mixture distribution uses K-
4. Analysis
means.
Once the HMM-based recognizer has
Input:
a
prototype
HMM,
time
aligned
been built, it is necessary to evaluate its
transcriptions.
performance. This is usually done by using it to
• HRest: Baum-Welch re-estimation.
transcribe some prerecorded test sentences and
Input: an initialized model set, time aligned
match the recognizer output with the correct
transcriptions.
reference transcriptions. This comparison is
• HERest: performs embedded Baum-Welch
performed by a tool called HResults which uses
training.
dynamic programming to align the two
Input: an initialized model set, timeless
transcriptions and then count substitution,
transcriptions.
deletion and insertion errors.
• HSmooth: smooths a set of context-dependent models according to the context-independent
III ALGORITHMS USED IN HTK
counterpart. • HHEd: Increases the number of Guassians.
1. EM algorithm: Expectation-Maximization (EM) is a technique used in point estimation. Given a set
3. Testing HVite implements Viterbi algorithm and
of observable variables X and unknown (latent)
takes a network as input, describing the
variables Z we want to estimate parameter θ in
allowable
a model.
word
sequences,
a
dictionary
defining how each word is pronounced and a
The EM algorithm is a general method
set of HMMs. It operates by converting the
of finding the maximum-likelihood estimate of
word network to a phone network and then
the parameters of an underlying distribution
attaching
from a given data set when the data is
the appropriate HMM definition to each phone
incomplete or has missing values .It takes two
instance. Recognition can then be performed on
steps. The expectation step and a maximization step. Expectation step assumes values for the
CLEAR December 2013
Page 10
unknown
parameters,
whereas
the
log probability of any path is computed simply
maximization step maximizes the probability of
by summing the log transition probabilities and
the value expected.
the log output probabilities along that path. The paths are grown from left-to-right column-by
2. Baum-Welch algorithm:
column.
To determine the parameters of a HMM it is first necessary to make a rough guess at
IV AN EXAMPLE: SIMPLE PHONE
what they might be. Once this is done, more
RECOGNIZER
accurate (in the maximum likelihood sense)
Step 1: Create the grammar
parameters can be found by applying the socalled Baum-Welch re-estimation formulae. Baum-Welch algorithm
updates
algorithm the
model
uses
HTK provides a grammar definition language for specifying simple task grammars.
EM
It consists of a set of variable definitions
parameters
followed by a regular expression describing the
iteratively until convergence. It uses forward
words to recognize.
algorithm which calculates the probability of the partial sequence of speech from the start time to time t, ending up in state i. It also uses backward
probabilities
which
are
the
probability of the ending partial sequence of speech given that we started at state i, at time t.
3. Viterbi algorithm: This algorithm as in fig.2 can be visualized as finding the best path through a matrix where the vertical dimension represents the states of the HMM and the horizontal dimension represents the frames of speech (i.e. time). Each large dot in the picture represents the log probability of observing that frame at that
time
and
each
arc
between
dots
corresponds to a log transition probability. The CLEAR December 2013
Fig. 2$word = aa | ae | ‌.. The HTK recognizer actually requires a word network to be defined using a low level Page 11
notation called HTK Standard Lattice Format
labeling tool. If we do not have preexisting
(SLF) in which each word instance and each
training sentences (such as those from the
word-to-word transition is listed explicitly. This
TIMIT database) you can create them either
word network can be created automatically
from preexisting text (as described above) or by
from the grammar above using the HParse tool,
labeling your training utterances using HSLab.
thus assuming that the file gram contains the
HSLab is invoked by typing
above grammar, executing
HSLab noname This will cause a window to appear with a
HParse gram wdnet will create an equivalent word network in the
waveform display area in the upper half and a
file wdnet.
row of buttons, including a record button in the lower half. When the name of a normal file is
Step 2: The dictionary
given as argument, HSLab displays its contents.
The dictionary contains all the phones used for recognizer. It can be built from a
Here, the special file name noname indicates that new data is to be recorded.
standard source using HDMan. By executing,
To train a set of HMMs, every file of
HDMan -m -w wlist -n monophones1 -l dlog
training data must have an associated phone
dict beep names
level transcription. We create a Master Label File (MLF), file containing a complete set of
Training sentences can be extracted from some
phone
prompts used with the TIMIT database. The
transcriptions are made out of word level
desired training word list (wlist) could then be
transcriptions using HLed command.
extracted
automatically
from
these.
level
transcriptions.
Phone
level
The
command will create a new dictionary called
Step 4: Extracting features
dict by searching the source dictionaries beep
The final stage of data preparation is to
and names to find pronunciations for each word
parameterize the raw speech waveforms into
in wlist.
sequences of feature vectors. HTK support both FFT-based and LPC-based analysis. Here Mel
Step 3: Recording the data and creating the
Frequency Cepstral Coefficients (MFCCs) [5],
transcription files then training and test data
which are derived from FFT-based log spectra,
will be recorded using the HTK tool HSLab.
will be used. Coding can be performed using
This is a combined waveform recording and
the tool HCopy configured to automatically
CLEAR December 2013
Page 12
convert its input into MFCC vectors. To do
monophone HMMs in which every mean is set
this, a configuration file (config) is needed
to zero and variance to one.
which
models are then retrained.
specifies
all
of
the
conversion
These initial
parameters. HCopy will copy one or more data files to
a
designated
output
file,
Step 6: Re-training
optionally
The HTK tool HCompV will scan a set
converting the data into a parameterized form.
of data files, compute the global mean and
While the source files can be in any supported
variance and set all of the Gaussians in a given
format, the output format is always HTK. By
HMM to have the same mean and variance.
default, the whole of the source file is copied to
Hence, assuming that a list of all the training
the target but options exist to only copy a
files is stored in train.scp, the command
specified segment. Hence, this program is used
HCompV -C config -f 0.01 -m -S train.scp -M
to convert data files in other formats to the
hmm0 proto
HTK format, to concatenate or segment data
will create a new version of proto in the
files, and to parameterize the result.
directory hmm0 in which the zero means and
Here, HCopy is used to convert the parameter kind of a file, for example from
unit variances above have been replaced by the global speech means and variances.
WAVEFORM to MFCC, depending on the
Given this new prototype model stored
configuration options. To run HCopy, a list of
in the directory hmm0, a Master Macro File
each source file and its corresponding output
(MMF) called hmmdefs containing a copy for
file is needed. This list is created and acts as the
each of he required monophone HMMs is
script file. Executing the command,
constructed by manually copying the prototype
HCopy -T 1 -C config -S script
and relabeling it for each required monophone.
The MFCC features will be created at the
Phones stored in the directory hmm0 are re-
location specified in the script file.
estimated using the embedded re-estimation tool HERest. The command is,
Step 5: Making initial models The first step in HMM training is to
HERest -C config -I phones0.mlf -S train.scp H hmm0/hmmdefs -M hmm1 monophones0
define a prototype model. Prototype HMMSets
where
(initial models) of phones is created. The
transcription,
starting point will be a set of identical
containing mfc paths, hmmdefs are the initial
CLEAR December 2013
phones.mlf
is
train.scp
the is
the
phone
level
script
file
Page 13
models and hmm1 is the new directory being
HVite -b silence -C config -H hmm7/hmmdefs
created. Each time HERest is run it performs a
-i out.mlf -S test.scp -w wdnet dummy dict
single re-estimation. Each new HMM set is stored in a new directory. Execution of HERest
This command uses the HMMs stored
should be repeated twice more, changing the
in hmm7 to transform the test data according to
name of the input and output directories.
the final HMM model to the new phone level
After
appropriate
iterations
using
transcription out.mlf using the pronunciations
HERest, the number of Guassians can be
stored in the dictionary dict.
Here, the
increased using the command HHEd. HHEd
recogniser considers all pronunciations for each
works in a similar way to HLEd. It applies a set
word and outputs the pronunciation that best
of commands in a script to modify a set of
matches the acoustic data.
HMMs. In this case, it is executed as follows: HHEd -H hmm4/hmmdefs -M hmm5 sil.hed
Step 8: Analysis HResults is the HTK performance
monophones1 where hmm4 is the input directory and hmm5 is
analysis tool. It reads in a set of label files
the output directory. Again, retrain the model
(typically output from a recognition tool such
using HERest using th increased Guassian
as HVite) and compares them with the
count as appropriate.
corresponding reference transcription files. If that
Step 7: Testing
out.mlf
contains
transcriptions,
ref.mlf
recogniser
output
contains
the
The phone models created so far can be
corresponding reference transcriptions and
used to realign the training data and create new
dict.txt contains a list of all labels appearing in
transcriptions. This can be done with a single
these files. Then typing the command,
invocation of the HTK recognition tool Hvite.
HResults -I ref.mlf dict.txt out.mlf
HVite is a general-purpose Viterbi word
will return the performance analysis of the
recogniser. It will match a speech file against a
phone recognizer.
network of HMMs and output a transcription for each. When performing N-best recognition
V CONCLUSION
a word level lattice containing multiple
The Hidden Markov Model Toolkit
hypotheses can also be produced. HVite is
(HTK) is a portable toolkit for building and
invoked via the command line:
manipulating hidden Markov models. HTK is
CLEAR December 2013
Page 14
primarily used for speech recognition research although it has been used for numerous other applications including research into speech synthesis, character recognition and DNA sequencing. HTK is in use at hundreds of sites worldwide. References [1] Christopher D. Manning and Hinrich Schütze, ―Foundations of Statistical Natural Language Processing‖, May 1999, MIT PRESS.
Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics‖,2009, 2nd edition. Prentice-Hall. [3] Shanthi Therese S, Chelpa Lingam, ―Review of Feature Extraction Techniques in Automatic Speech Recognition‖, International Journal of Scientific Engineering and Technology, Volume No.2, Issue No.6, pp : 479-484 [4] http://htk.eng.cam.ac.uk/docs/docs.shtml [5]http://www.ling.ohiostate.edu/~bromberg/ht k_problems.html
[2] Jurafsky, Daniel, and James H. Martin, ―Speech and Language Processing: An
Jibbigo is a mobile offline language translation application that was developed by Mobile Technologies, LLC and Dr. Alex Waibel, a professor at Carnegie Mellon. Jibbigo is an offline voice translator and does not need phone or data connectivity to function. Spanish-English Jibbigo was released in September, 2009 as the first offline Speech Translation application. The company has since expanded its offerings to include ten language pairs sold on both Apple's App Store and Google Play. In Jibbigo, the user holds down a record button and says a phrase. The phrase then appears as text in both languages and is spoken aloud in the target language. The app also includes an add name function, a background dictionary, and other features. On iOS, it is compatible with Voice Over for vision impaired users. Ref: http://jibbigo.com/
CLEAR December 2013
Page 15
Nibeesh K M.Tech Computional Linguistic GEC, Palakkad nibeesh@gmail.com Abstract: Apache Lucene is a free/open source information retrieval software library, originally created in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene has been ported to other programming languages including Delphi, Perl, C#, C++, Python, Ruby, and PHP. Doug Cutting originally wrote Lucene in 1999. It was initially available for download from its home at the SourceForge web site. It joined the Apache Software Foundation's Jakarta family of open-source Java products in September 2001 and became its own toplevel Apache project in February 2005. Until recently, it included a number of sub-projects, such as Lucene.NET, Mahout, Solr and Nutch. Solr has merged into the Lucene project itself and Mahout, Nutch, and Tika have moved to become independent top-level projects. Version 4.0 was released on 12 October 2012
APIs, hit highlighting, faceted search,
I. Introduction
caching, replication, and a web admin
Apache Lucene is an open-source project that provides
Java-based
indexing
and
interface.
search
technology. Using its API, it is easy to
implement full-text search.
with
Core,
provides
Java-based
of
collecting
and
as spellchecking, hit highlighting and advanced
analysis/tokenization
capabilities.
server built using Lucene Core, with XML/HTTP
One of the key factors behind Lucene‘s popularity is its simplicity. You don‘t need indepth
SolrTM is a high performance search
and
CLEAR December 2013
PyLucene is a Python port of the Core
project.
indexing and search technology, as well
aim
testing and performance.
following: Lucene
the
distributing free materials for relevance
The Apache Lucene project including the
Open Relevance Project is a subproject
knowledge
about
how
Lucene‘s
information indexing and retrieval work in order to start using it. Moreover, Lucene‘s
JSON/Python/Ruby Page 16
straightforward API requires using only a
The following classes to perform the simplest
handful of classes to get started.
indexing procedure:
Lucene is used in a surprisingly diverse and growing number of places: NetFlix, Digg, MySpace,
LinkedIn,
Ticketmaster,
Fedex,
SalesForce.com,
Apple, the
Encyclopedia Britannica CD-ROM/DVD, the Eclipse IDE, the Mayo Clinic, New Scientist magazine, Atlassian (JIRA), Epiphany, MIT‘s
Figure 1. Classes used when indexing documents
OpenCourseWare and DSpace, the Hathi Trust Digital
Library,
and
Akamai‘s
Edge-
Computing platform.
1. IndexWriter 2. Directory 3. Analyze
II. What Lucene can do
4. Document Lucene allows you to add search capabilities to your application. Lucene can index and make searchable any data that you can extract text from. This means you can index and search data stored in files: web pages on remote web servers, documents stored in local file systems, simple text files, Microsoft Word documents,
[1] IndexWriter:
This
is
the
central
component of the indexing process. This class creates a new index or opens an existing one, and adds, removes, or updates documents in the index. [2] Directory:
The
Directory
class
XML or HTML or PDF files, or any other
represents the location of a Lucene
format from which you can extract textual
index. It‘s an abstract class. We can use
information.
FSDirectory.open to get a suitable
With Lucene, you can index and search email messages,
mailing-list
archives,
messenger chats and many. III. The core indexing classes
instant
concrete FSDirectory implementation that stores real files in a directory on the file system, and passed that in turn to Index-Writer‘s constructor. [3] Analyzer: Before text is indexed, it‘s passed
CLEAR December 2013
through
an
analyzer.
The
Page 17
analyzer, specified in the IndexWriter constructor, is in charge of extracting those tokens out of text that should be indexed. [4] Document:
The
Document
class
represents a collection of fields. Fields of a document represent the document
Figure
or
that
documents. Each document has a number of
document. The metadata (such as
fields. The contents of a field can consist of one
author, title, subject and date modified)
or more terms. The number of unique terms is
is indexed and stored separately as
on criteria for the memory requirements of an
fields of a document.
index
metadata
associated
with
A document is simply a container for multiple
2.
A
Lucene
index
consists
of
IV. The core searching classes
fields; Field is the class that holds the textual content to be indexed.
Only a few classes are needed to perform the basic search operation:
[5] Field: Each document in an index contains
1. IndexSearcher
one or more named fields, embodied in a class called Field. A document may have
2. Term
more than one field with the same name. In
3. Query
this case, the values of the fields are
4. TermQuery
appended, during indexing, in the order they
5. TopDocs
were added to the document. [1] IndexSearcher: IndexSearcher as a class that opens an index in a read-only mode. It requires a Directory instance, holding the previously created index, and then offers a number of search methods. [2] Term: This is the basic unit for searching. Similar to the Field object, it consists of a CLEAR December 2013
Page 18
pair of string elements: the name of the field
of all the text documents in the directory. After
and the word (text value) of that field.
that we can search for documents using the
[3] Query : Lucene comes with a number of concrete Query subclasses. These includes BooleanQuery, PhraseQuery, PrefixQuery, PhrasePrefixQuery, TermRangeQuery,NumericRangeQuery, FilteredQuery, and SpanQuery.
[4] TermQuery: This is the most basic type of
java org.apache.lucene.demo.SearchFiles. [4] This will prompted for a query. Type in a swear word and press the enter key. You'll get the search results. Same thing you can implement in web servers such as Apache Tomcat. For this you should type java
org.apache.lucene.demo.IndexHTML
-
query supported by Lucene, and it‘s one of
create -index {index-dir} ..
the primitive query types. It‘s used for
in
matching documents that contain fields with
{tomcat}/webapps directory (make sure you
specific values.
didn't leave off the .. or you'll get a null pointer
any
subdirectory
of
exception). {index-dir} should be a directory [5] TopDocs: The TopDocs class is a simple
that Tomcat has permission to read and write,
container of pointers to the top N ranked
but is outside of a web accessible context. By
search results—documents that match a
default the webapp is configured to look
given query.
in/opt/lucene/index for this index. After that you can use the url http://localhost:8080/luceneweb
V.How to Use
to get a search interface.[4]
We can use Lucene simple as command line or
VI.Advantages of Lucene
by programically. In command line use can use java org.apache.lucene.demo.IndexFiles {fullpath-to-documents-directory}[4]
Scalable, High-Performance Indexing
incremental indexing as fast as batch indexing
index size roughly 20-30% the size of text indexed
to create the search index. This command will create called index which will contain an index
CLEAR December 2013
Page 19
many powerful query types: phrase queries,
[3]http://www.ibm.com/developerworks/java/libra
wildcard queries, proximity queries, range
ry/j-solr-lucene/index.html , visited on 26 Nov
queries and more
2013
fielded searching (e.g. title, author, contents)
Cross-Platform Solution
[4]http://lucene.apache.org/core/2_9_4/demo.ht ml , visited on 26 Nov 2013
References [1]
MICHAEL
[5]http://lucene.apache.org/core/features.html MCCANDLESS,
ERIK
, visited on 26 Nov 2013
HATCHER, OTIS GOSPODNETIC. Lucene in Action
(Second
Edition),
2010.
Manning
Publications Co , Stamford.
[2] http://lucene.apache.org/ ,
Visited on 26
Nov 2013
Yahoo Snags SkyPhrase for Natural-Language Processing.... Yahoo has gone and bought another company, pushing its acquisition tally for the year up to 23. The Internet giant has purchased natural-language outfit SkyPhrase for an undisclosed sum. SkyPhrase and its small team of four will become part of the Yahoo Labs team in New York, where they will presumably bring their natural-language-processing acumen to bear on Yahoo’s search and mail products.
CLEAR December 2013
Page 20
FrameNet Sreejith C M.Tech IIIrd Sem, GEC, Palakkad. Abstract: FrameNet is a project housed at the International Computer Science Institute in Berkeley, California which produces an electronic resource based on a theory of meaning called frame semantics. FrameNet reveals for example that the sentence "John sold a car to Mary" essentially describes the same basic situation (semantic frame) as "Mary bought a car from John", just from a different perspective.
FrameNet
is
a
project
the
manager in 2000.[2] The FrameNet project has
International Computer Science Institute in
been influential in both linguistics and natural
Berkeley,
language processing, where it led to the task of
California
housed
which
at
produces
an
electronic resource based on a theory of meaning called frame semantics. FrameNet reveals for example that the sentence "John sold a car to Mary" essentially describes the same basic situation (semantic frame) as "Mary bought a car from John", just from a different perspective. A semantic frame can be thought of as a conceptual structure describing an event, relation, or object and the participants in it. The FrameNet lexical database contains around 1,200 semantic frames, 13,000 lexical units (a pairing of a word with a meaning; polysemous words are represented by several lexical units) and over 190,000 example sentences. FrameNet is largely the creation of Charles J. Fillmore, who developed the theory of frame semantics that the project is based on, and was initially the project leader when the project began in 1997.[1] Collin Baker became the project CLEAR December 2013
automatic Semantic Role Labeling. The FrameNet project is building a lexical database of English that is both human and machine
readable,
based
on
annotating
examples of how words are used in actual texts. From the student's point of view, it is a dictionary of more than 10,000 word senses, most of them with annotated examples that show the meaning and usage. For the researcher in Natural Language Processing, the more than 170,000 manually annotated sentences provide a unique training dataset for semantic role labeling,
used
in
applications
such
as
information extraction, machine translation, event recognition, sentiment analysis, etc. For students and teachers of linguistics it serves as a valence dictionary, with uniquely detailed evidence for the combinatorial properties of a core set of the English vocabulary. The project Page 21
has been in operation at the International
that is placed (called a Theme) and the location
Computer Science Institute in Berkeley since
in which it is placed (Goal).
1997, supported primarily by the National Science Foundation, and the data is freely available for download; it has been downloaded and used by researchers around the world for a wide variety of purposes.
Many common nouns, such as tree, hat or tower, usually serve as dependents which head FEs, rather than clearly evoking their own frames, so we have devoted less effort to annotating them, since information about them is available from
FrameNet is based on a theory of meaning
other lexicons, such as WordNet (Miller et al.
called Frame Semantics, deriving from the work
1990). We do, however, recognize that such
of Charles J. Fillmore and colleagues. The basic
nouns also have a minimal frame structure of
idea is straightforward: that the meanings of
their own, and in fact, the FrameNet database
most words can best be understood on the basis
contains slightly more nouns than verbs.
of a semantic frame: a description of a type of event, relation, or entity and the participants in it. For example, the concept of cooking typically involves a person doing the cooking (Cook), the food that is to be cooked (Food), something to hold the food while cooking (Container) and a source of heat (Heating_instrument). In the FrameNet project, this is represented as a frame called Apply_heat, and the Cook, Food, Heating_instrument and Container are called frame elements (FEs) . Words that evoke this frame, such as fry, bake, boil, and broil, are called lexical units (LUs) of the Apply_heat frame. Other frames are more complex, such as Revenge, which involves more FEs (Offender, Injury,
Injured_Party,
Avenger,
and
Punishment) and others are simpler, such as Placing, with only an Agent (or Cause), a thing
CLEAR December 2013
Formally, FrameNet annotations are sets of triples that represent the FE realizations for each annotated sentence, each consisting of a frame element
name
(for
example,
Food),
a
grammatical function (say, Object) and a phrase type (say, noun phrase (NP)). We can think of these three types of annotation on each FE as "layers", but the grammatical function and phrase-type layers are not displayed in the webbased report system, to avoid visual clutter. The downloadable XML version of the data includes these three layers (and several more not discussed here) for all of the annotated sentences, along with complete frame and FE descriptions, frame-frame relations, and lexical entries for each annotated LU. Most of the annotations are of separate sentences annotated for only one LU, but there are also a collection Page 22
of texts in which all the frame-evoking words
word order. FrameNet has been used in
have been annotated; the overlapping frames
applications
provide a rich representation of much of the
paraphrasing, recognizing textual entailment,
meaning of the entire text. The FrameNet team
and information extraction, either directly or by
have defined more than 1,000 semantic frames
means of Semantic Role Labeling tools. The
and have linked them together by a system of
first automatic system for Semantic Role
frame relations, which relate more general
Labeling (SRL, sometimes also referred to as
frames to more specific ones and provide a basis
"shallow semantic parsing") was developed by
for reasoning about events and intentional
Daniel Gildea and Daniel Jurafsky based on
actions.
FrameNet in 2002, and Semantic Role Labelling
Because the frames are basically semantic, they are often similar across languages; for example,
like
question
answering,
has since become one of the standard tasks in natural language processing.
frames about buying and selling involve the FEs
Since
Buyer, Seller, Goods, and Money, regardless of
descriptions, they are similar across languages,
the language in which they are expressed.
and several projects have arisen over the years
Several
build
that have relied on the original FrameNet as the
FrameNets parallel to the English FrameNet
basis for additional non-English FrameNets, for
project for languages around the the world,
Spanish, Japanese, and German, among others.
projects
are
underway
to
frames
are
essentially
semantic
including Spanish, German, Chinese, and Japanese, and frame semantic analysis and annotation has been carried out in specialized areas from legal terminology to soccer to tourism. FrameNet has proven useful in a number of computational applications, because computers need additional knowledge in order to recognize that "John sold a car to Mary" and "Mary bought a car from John" describe essentially same situation, despite using two very different verbs, different prepositions and a different CLEAR December 2013
Page 23
The kicktionary : Multilingual electronic
as its arguments and, as the case may be, a
dictionary of football (soccer) language.
support verb or support preposition.
Is it based on FrameNet?
Based on an analysis of their semantics and argument structure, lexical units are grouped into roughly a hundred frames such that lexical units in the same frame share important semantic and syntactic characteristics. The frames, in turn, are assigned to to one of
The kicktionary is a multilingual (German – English – French) electronic dictionary of the
16 scenes, where each scene corresponds to a prototypical event (e.g. a goal or a one-on-one situation) of a football match.
language of football (soccer). It was developed between September 2005 and July 2006 at the
In addition to the scenes-and-frames-hierarchy,
FrameNet project at the International Computer
lexical units are also organised intosynsets, i.e.
Science Institute (ICSI) in Berkeley.
into groups of words with identical or largely
The main aim of the project was (and is) to explore how linguistic theories about lexical semantics, methods from corpus linguistics, technologies for hypertext and hypermedia and techniques from computer language processing can help to make lexical resources that are better than (or: good in a manner different from) traditional paper dictionaries.
similar meanings. Synsets, in turn, are the building blocks of
a
number of concept
hierarchies, each of which organises a set of synsets into a tree via lexical relations such as hypernymy/hyponymy
(X
is-a-kind-of
Y),
holonymy/meronymy (X is-a-part-of Y) and troponymy (to X is to Y in some way). References
The kicktionary currently contains close to 1,900 lexical units (nouns, verbs, adjectives and
[1] https://framenet.icsi.berkeley.edu/fndrupal/ [2] http://www.kicktionary.de/background.html
idiomatic expressions) in German, English and French. For each lexical unit, there are between
[3] http://en.wikipedia.org/wiki/FrameNet
one and ten annotated example sentences from a corpus
of
football
match
reports.
The
annotations identify the lexical unit itself as well CLEAR December 2013
Page 24
source machine learning and data mining A Fruitful Data open Mining Using tool. It includes a set of components for data ORANGE Manu Madhavan Assistant Professor, Dept of CSE SIMAT, Palakkad.
This article is an introduction to Orange data mining tool and its use with python scripts, not in
details,
in
a
brief.
Have
a
fun..
preprocessing, feature scoring and filtering, modeling, model evaluation, and exploration techniques. Here the python scripting using orange package is discussed. Install Orange: To build and install Orange you can use the setup.py in the root orange directory (requires GCC, Python and numpy development headers). The details will be available in the orange documentation
page
(http://orange.biolab.si/download/). Data Mining is the computational process of
Test the installation
discovering patterns in large data sets involving
After installing orange, type the following in
methods
python interactive shell.
at
the
intersection
of
artificial
intelligence, machine learning, statistics, and database
systems.
This
area
has
special
>>>import Orange
applications in areas like medical science,
>>>Orange.version.version
business intelligence, and similar areas of real
'2.6a2.dev-a55510d'
life.
and
If this leaves no error and warning, Orange and
proprietary softwares tools are available for data
Python are properly installed and you are ready
mining applications and research. Orange
to continue.
There
are
many
open
source
(http://orange.biolab.si) is a general-purpose, CLEAR December 2013
Page 25
be predicted by a learning function.
Data Input Orange can read files in native and other data formats. Native format starts with feature (attribute)
names,
their
type
(continuous,
discrete, string). The third line contains meta information to identify dependent features
Orange
have a vast variety of learning functions like, KNN, Least Mean Square Error, Naive- Bayes, Logistic/Linear regression, etc. The following sample code use KNN method to predict the class value of test data set.
(class), irrelevant features (ignore) or meta
Let the training dataset is stored in trainset. tab
features (meta).
and test dataset is stored in testset. tab. The
You may download lenses.tab to a target directory and there open a python shell.
following script will print the class of test data. Let the training dataset is stored in trainset. tab and test dataset is stored in testset. tab. The
>>>import Orange >>>data=Orange.data.Table("lense s")
following script will print the class of test data.
>>>
>>>train=Orange.data.Table("trai nset")
Data mining using Orange-python
>>>test=Orange.data.Table("tests et")
The orange python can be used for all data
>>>learner=Orange.classification .knn.kNNLearner()
mining
applications
like,
classification,
prediction, clustering and learning. Here I am illustrating, how orange can be used
>>>classifier = learner(train) >>>for i in range (len(trainset)):
for classification. ...print i,classifier(test[i]) For classification, you need a training data set and test data set. The data readable by orange methods are stored in a .tab file (stored as Tables). The table contains different features of the data and class value. For training data, the
This will print the class of each vector in test set. You can verify the result by comparing the results with that by using other learning tools. Hope you got it‌enjoy the experiment‌!!!!
class value will be the actual class, in which the feature vector belongs to. In case of testing data set, the class value will be absent, which have to CLEAR December 2013
Page 26
R uses CRAN (The Comprehensive R Archive Network). The capabilities of R are extended
Programming Language Robert Jesuraj R&D Engineer @ Arcadix
R is a free software environment for statistical
through user-created packages, which allow
computing and graphics. It compiles and runs
specialized statistical
on a wide variety of UNIX platforms, Windows
devices, import/export capabilities, reporting
and MacOS. R is an integrated suite of software
tools, etc. These packages are developed
facilities for data manipulation, calculation and
primarily in R, and sometimes in Java, C and
graphical display. Among other things it has
Fortran. A core set of packages is included with
techniques,
graphical
the installation of R, with 5300 additional
an effective data handling and storage
packages.
facility,
a suite of operators for calculations on arrays, in particular matrices,
CRAN task view collects relevant R packages
a large, coherent, integrated collection of
that
intermediate tools for data analysis,
conducting analysis of speech and language on a
graphical facilities for data analysis and display either directly at the computer or on hard-copy, and
Natural Language Processing using R
a well developed, simple and effective
support
computational
linguists
in
variety of levels - setting focus on words, syntax, semantics, and pragmatics. In recent years, we have elaborated a framework to be used in packages dealing with the processing of
programming language (called ‗S‘) which
written material: the package tm.
includes conditionals, loops, user defined
Some list of frameworks
recursive functions and input and output
tm provides a comprehensive text mining
facilities. (Indeed most of the system
framework for R. The Journal of Statistical
supplied functions are themselves written in
Software article Text Mining Infrastructure in
the S language.)
R gives a detailed overview and presents techniques for count-based analysis methods,
CLEAR December 2013
Page 27
text clustering, text classification and string
openNLPmodels.en and openNLPmodels.es,
kernels.
respectively.
tm.plugin.dc allows for distributing corpora
across storage devices (local files or Hadoop
collection of machine learning algorithms
Distributed File System).
for data mining tasks written in Java. Especially useful in the context of natural
tm.plugin.mail helps with importing mail
language processing is its functionality for
messages from archive files such as used in
tokenization and stemming.
Thunderbird (mbox, eml).
RWeka is a interface to Weka which is a
tm.plugin.factiva
allows
importing
press/Web corpora from Dow Jones Factiva.
RcmdrPlugin.temis is an Rcommander plug-in providing an integrated solution to perform a series of text mining tasks such as importing and cleaning a corpus, and analyses like terms and documents counts, vocabulary tables, terms co-occurrences and documents similarity measures, time series analysis,
correspondence
analysis
and
hierarchical clustering.
OpenNLP provides an R interface to OpenNLP , a collection of natural language processing
tools
including
a
detector, tokenizer, pos-tagger, shallow and
Semantics:
sentence
Lsa provides routines for performing a
full syntactic parser, and named-entity
latent semantic analysis with R. The basic
detector, using the Maxent Java package for
idea of latent semantic analysis (LSA) is,
training
that text do have a higher order (=latent
and
using
maximum
entropy
models.
semantic) structure which, however, is
Trained models for English and Spanish to
obscured by word usage (e.g. through the
be used with openNLP are available from
use of synonyms or polysemy). By using
http://datacube.wu.ac.at/ as
conceptual
packages
indices
that
are
derived
statistically via a truncated singular value CLEAR December 2013
Page 28
decomposition (a two-mode factor analysis) over a given document-term matrix, this
for
variability problem can be overcome. The
implements the nine different algorithms
article Investigating Unstructured Texts with
(svm, slda, boosting, bagging, rf, glmnet,
Latent Semantic Analysis gives a detailed
tree, nnet, and maxent) and routines
overview and demonstrates the use of the
supporting the evaluation of accuracy.
package with examples from the are of technology-enhanced learning.
Topicmodels provides an interface to the C
text
classification.
It
Textir is a suite of tools for text and
Textcat provides support for n-gram based text categorization.
models and Correlated Topics Models (CTM) by David M. Blei and co-authors and
Corpora offers utility functions for the statistical analysis of corpus frequency data.
the C++ code for fitting LDA models using
Pragmatics:
Gibbs sampling by Xuan-Hieu Phan and co-
Qdap helps with quantitative discourse
authors.
automatic
sentiment mining.
code for Latent Dirichlet Allocation (LDA)
RTextTools is a machine learning package
analysis of transcripts.
Lda implements Latent Dirichlet Allocation
In Conclusion, Since R is very useful for
and related models similar to LSA and
Natural Language processing. Get into it. To
topicmodels.
start learning use the reference links.
Kernlab allows to create and compute with
References
string kernels, like full string, spectrum, or
bounded range string kernels. It can directly
[1]
http://www.r-project.org/
use the document format used by tm as
[2]
http://cran.r-
input.
project.org/web/views/NaturalLanguageProc
Skmeans helps with clustering providing
essing.html
several algorithms for spherical k-means partitioning.
MovMF
provides
another
clustering
alternative (approximations are fitted with von Mises-Fisher distributions of the unit length vectors).
CLEAR December 2013
Page 29
Morphological Analyser using Apertium Toolkit Ancy Antony, Deepa C A M. Tech Computational Linguistics University of Calicut Abstract : Malayalam morphology consists of wide range of inflections, multiple suffixes and tendency of adjacent words to concatenate etc. The agglutinative nature of Malayalam language creates several challenges in creating reasonable morphological processing results. Here an attempt is made for exploring the malayalam morphology and for creating morphological processed results by using a rule-based machine translation tool known as Apertium.
I.
Introduction
In the root driven method, root or stem is identified at first and the affixes are passed. In the
The Dravidian Language Family is one of the
affix stripping method, the process takes place in
important groups of languages that are spoken by
the
in South India. There are four recognized Dravidian
identified first and the remaining part is assumed
languages of Telugu, Malayalam, Kannada and
as stem or root. This document includes the
Tamil. The most characteristics feature of the
implementation of a morphological analyser for
Dravidian languages is that they are agglutinative
Malayalam using Apertium tool.
reverse
direction.
Here
the
affixes
are
and exhibit the inclusive and exclusive feature. Morphological analysis is the segmentation of
II. Apertium Tool
words
Malayalam morphological analyser can be built
into
(usually)
their the
component assignment
morphemes of
and
grammatical
using
apertium
toolbox.
Apertium
is
a
free
information to grammatical categories and the
software and released under the terms of the GNU
assignment
General public license. It is an open source
of
the
lexical
information
to
a
particular lexeme or lemma. Morphological analysis
shallow-transfer
consists of the identification of parts of
originated
the words, or more technically, constituents of the
Machine Translation for the Languages of Spain”.
words. Malayalam language is inflectionally rich in
Lttoolbox, a module in the Apertium package is the
morphology . The major lexical items like nouns
main tool used for designing the system. It
and verbs are inflected for plural marking, case,
tokenizes the text in surface form and delivers, for
tense, aspect and mood
each surface form, one or more lexical forms
respectively. There are
machine
within
the
translation project
system
”Open-Source
different methods for the morphological analysis of
consisting
natural language processing.
morphological inflection information. For example,
of
lemma,
lexical
category
and
Brute force method
for a malayalam noun it will give its category,
Root driven method
gender, case marker and other features of a noun.
Affix stripping method. A. Lttoolbox
CLEAR December 2013
Page 30
Lttoolbox can be used for lexical processing, morphological
analysis
and
generation
Alphabet Definition : The list of alphabets
etc.
used in the dictionary file.
Lttoolbox performs its processing with the help of a
Definition of symbols : It contains the
lexical dictionary. A finite state transducer (FST)
grammatical symbols that are present in
approach is utilized for lexical processing. The
the files.
class of FST used in lttoolbox is 'letter transducer'.
Definition of paradigms: Several paradigm
The package is split into three programs, lt-comp:
definitions used in the dictionary.
the compiler, lt-proc: the processor and lt-expand: which generates all possible mappings between
C. Paradigms and their definitions
surface forms and lexical forms in the dictionary.
A Paradigm is referred to as a complete set of
Compilation
related inflectional and productive derivational
lt-comp : Compiles the dictionary file (.dix) and
word
produces a binary file. Compiling LR(Left to right)
paradigms for the Noun and Verb classes has been
creates an analyser, and compiling RL(Right to
implemented. The Definition refers to the features
left) creates a generator. The step used in the
and feature values of the root such as category,
compilation is : “lt-comp [ lr | rl ] dictionaryfile
gender, number, person and case marking in the
binaryfile”.
case of nouns and tense, aspect and modal
Processing
category information in the case of verbs so on
lt-proc : For processing the binary file. Two main
and so forth.
modes of lt-proc are: Analysis(-a) converts surface
III Malayalam Morphological Analyser
forms
of
a
given
category.
Here
the
forms into the set of possible lexical form and Generation(-g) converts a lexical form into the
The
Malayalam
morphological
analyser
is
corresponding surface form. It is executed by
implemented by following a paradigm approach. It
using
uses a dictionary that contains the inflections of
”lt-proc
[-a|-g]
binaryfile
[inputfile
[outputfile]]” .
malayalam
words
covering
about
24
noun
Expansion
paradigms and 56 verb paradigms. An exampleof a
lt-expand : Enables to see the complete output of
paradigm definition is shown below:
the dictionary that is, all of the mappings between lexical and surface forms. B. Dictionary Structure A dictionary is used by the Lttoolbox for the
The paradigm can be viewed as small dictionaries
morphological processing. The dictionary is in XML
which specify regularities in the lexical processing
format with .dix or .metadix extentions. It follows
of
a block structure. The dictionary includes alphabet
regularities, each paradigms has list of entries <e>
definition,
of
like ones in the dictionary. The paradigm entries
tokenization
consist of a pair (<p>) with left side (<l>) and
paradigms,
definition and
symbols,
section
(conditional or unconditional).
with
definition
the
dictionary
entries.
specify
these
right side (<r>). These elements contain text or grammatical symbols(<s>).
CLEAR December 2013
To
For example: <s
Page 31
n=”N_NOUN” />. Sometimes a paradigm definition
portion
contains entries of another paradigm definition.
information like case, gender, number can be
The input malayalam text is provided to the
obtained. The main task is identifying the suffix
morphological processing system through the user
and root of the particular suffix form. With the help
interface. The input text is processed by the
of a compiled dictionary file the Lttoolbox will
paradigms defined in the dictionary. The dictionary
perform the morphological analysis provided by a
file which contains all the noun and the verb
particular noun entry present in the dictionary. For
paradigms are compiled by the lt-comp which results in a binary file. It is again processed by the lt-proc and lt-expand commands. The binary file is
of
the
noun
word
example consider the noun
the
grammatical
and the 7
case markers in the suffix part are given below.
morphologically analysed by the lt-proc instruction and forms its lexical form.
All the mappings
between the surface form and the lexical form is given by the
lt-comp. This is given to a post-
processing module which makes the result in a proper format and is displayed through the output interface.
The plural form of the word is the same word itself. Its plural case marker is 'ര'.
Architecture of Malayalam Morphological Analyser A. Noun and Verb paradigms To understand the paradigm design, the surface form of a word can be considered as two parts
This paradigm for can be used for the nouns which have the same inflection pattern.
such as root and suffix. By examining the suffix
CLEAR December 2013
Page 32
The
various
tags
include;
N_NOUN
for
the
IV. Conclusion and Future Works
nominative noun, SG for singular, PL for plural,MF
A
small
for masculine/faminine, NEUTER for neuter, M for
morphological anlyser for malayalam has been
masculine and F for faminine.
done by using the Apertium toolbox. Many issues such
step
as
towards
multiple
developing
suffixes,
large
a
simple
number
of
inflections etc still exist which has to be resolved. The lemma('lm') contains the surface form of the word. The '<i>' element contains the root form of the word and in malayalam it may/maynot be the actual root word. Actual one is obtained after the processing. When an input word is given, 'lt-proc' checks whether it starts with any of the '<i>' element in the paradigm. If a match occurs, the corresponding paradigm is used for processing the word. Malayalam verbs are somewhat complex and have multiple suffixes. There are about 53 verb paradigms [1].
All the rules are provided in the dictionary. So the accuracy
depends
dictionary.
Some
upon
the
other
precision
of
statistical
the
method
combined with the tool is a better method for the morphological analysis of Malayalam. REFERENCES [1]
Vinod
P
M,
Implementation
Jayan
of
V,
Bhadran
Malayalam
V
K,
Morphological
Analyser using Hybrid Approach, Proceedings of Twenty-Fourth
Conference
on
Computational
Linguistics and Speech Processing, 2012.
Example,
[2]
Parameshwari
K,
An
Implementation
of
APERTIUM Morphological Analyzer and Generator for
Tamil,
Language
in
India
www.languageinindia.com
Special
Volume:Problems of Parsing in Indian Languages , In the dictionary, the paradigm for a verb is given
11:5 May 2011 [3]
below.
Vinod
“Malayalam
P
M,
Jayan
Morphological
V,
Sulochana
Analyser:
A
K
G,
Hybrid
Approach with Apertium Lttoolbox”, Proceedings of ICON-2011: 9th International Conference on NLP, Macmillan Publishers, India, page:219-224, 2011. [4] Jisha P Jayan, Rajeev R R, S Rajendran, ”Morphological Analyser for Malayalam- A comparison of Different Approaches”, IJCSIT, Vol. 2, No. 2, Dec 2009, pp. 155-160. All
the
paradigms
are
given
and
they
are
processed by the system which gives the lexical
[5] http://www.apertium.org/, visited on Nov 2013.
forms.
CLEAR December 2013
Page 33
M.Tech Computational Linguistics Dept. of Computer Science and Engg, Govt. Engg. College, Sreekrishnapuram Palakkad www.simplegroups.in simplequest.in@gmail.com
SIMPLE Groups Students Innovations in Morphology Phonology and Language Engineering
Article Invitation for CLEAR- Mar-2014 We are inviting thought-provoking articles, interesting dialogues and healthy debates on multifaceted aspects of Computational Linguistics, for the forthcoming issue of CLEAR (Computational Linguistics in Engineering and Research) magazine, publishing on March 2014. The suggested areas of discussion are:
The articles may be sent to the Editor on or before 10th Dec, 2013 through the email simplequest.in@gmail.com. For more details visit: www.simplegroups.in
Editor, CLEAR Magazine SIMPLE Groups CLEAR December 2013
Page 34
Hello World, Innovation can be considered as a key aspect in the modern society. Our life styles, society and work have been changed by new technological innovations. The academic study of innovation â&#x20AC;&#x201C; its exploitation, policy issues that arise, driving forces, social and economic consequences â&#x20AC;&#x201C; has grown rapidly from a modest start about half a century ago. Present practices are inadequate to meet changes in work, knowledge, while serving a greater number of students with diverse backgrounds and educational objectives. A paradigm shift from instruction to learning is required to adequately serve the clients of educational institutions, which in turn requires an alteration in procedures for improved outcomes. Educational institutions, like all other organisations, require constant monitoring to identify areas for potential improvement. However, educational reforms are often not well implemented. This results in massive wastage of finances, human resources, and lost potential. Educational practices, and the structures that support them, must change in order to ensure that the citizens of the future - our school children of the present - can exist and grow in a world characterised by change, unpredictability and enterprise. Wishing you all a Merry Christmas and Happy New Year ď &#x160;
Sreejith C
CLEAR September 2013
35
CLEAR September 2013
36
CLEAR September 2013
37
CLEAR September 2013
38