Clear december 2013 by Simple Groups

CLEAR September 2013

CLEAR C

Editorial

SIMPLE News & Updates

CLEAR Mar 2014 Invitation

Last word

CLEAR December 2013 Volume-2 Issue-4 CLEAR Magazine (Computational Linguistics in Engineering And Research) M. Tech Computational Linguistics Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad 678633 www.simplegroups.in simplequest.in@gmail.com Chief Editor Dr. P. C. Reghu Raj Professor and Head Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad Editors Sreejith C Reshma O K Neethu Johnson Gopalakrishnan G

1 HTK - The HMM Toolkit for Speech Processing 8 Ms Abitha Anto, Ms Neethu Johnson, Ms Sincy V Thambi, Ms Varsha K V 2 Lucene Mr Nibeesh K

3 FrameNet Mr Sreejith C

4 A Fruitful Data Mining using ORANGE 25 Mr Manu Madhavan 5 R Programming Language Mr Robert Jesuraj

6 Malayalam Morphological Analyser using Apertium Tool kit Ms Deepa C A , Ms Ancy Antony

Cover page and Layout Sreejith C

CLEAR September 2013

Greetings! This edition of CLEAR, the last one of 2013, focuses on some of the latest tools that are used in Computational linguistics and allied disciplines. They include HTK, a tool kit becoming popular among the speech processing community, and Apache Lucene, a free/open source information retrieval software library, among many others. The latter was originally created in Java by Doug Cutting. The Lucene search engine is an open source, Jakarta project used to build and search indexes. Lucene can index any text-based information you like. There is also an article on FrameNet corpus, which is a lexical database of English that is both human and machine-readable, based on annotating examples. Other articles peep into data mining (with ORANGE), a new programming paradigm (with R Language), and Malayalam morphological analysis (with Apertium). As the New Year approaches, let us hope that these gentle introductions to new tools will help the readers to chart out exciting journeys into newer domains resulting in useful findings. The CLEAR team wish all readers a happy and prosperous 2014! With Best Wishes, Dr. P. C. Reghu Raj (Chief Editor)

CLEAR September 2013

NEWS & UPDATES Publications Semantic Role Extraction and General Concept Understanding in Malayalam using Paninian Grammar, Radhika K T, P. C. Reghu Raj, International Journal of Engineering Research and Development, Volume 9, Issue 3 December 2013. Extraction of Disease Relationship from Medical Records: Vector Based Approach, Ayisha Noori, P C Reghu Raj, International Journal of Latest Trends in Engineering and Technology (IJLTET), Volume 3, Issue 2, November 2013 Malayalam Wordnet: A Relational Database Approach, Mujeeb Rehman, P C Reghu Raj, International Journal of Latest Trends in Engineering and Technology (IJLTET), Volume 3, Issue 2, November 2013 Morphological Analyzer for Malayalam: Probabilistic Method Vs Rule Based Method, Rinju OR., Rajeev R. R., Reghu Raj P.C., Elizabeth Sherly, International Journal of Computational Linguistics and Natural Language Processing (IJCLNLP), Volume 2, issue 10, October 2013 Architecture of an Ontology-Based Domain-Specific Natural Language Question Answering System, Athira P. M., Sreeja M. and P. C. Reghuraj, International Journal of Web & Semantic Technology (IJWesT), Volume.4, No.4, October 2013 Structured Information Extraction from On-line Advertisements - A Bayesian Approach , Saani H, Reghu Raj P C. International Journal of Advanced Research in Computer Science and Software Engineering (IJARCSSE), Volume 3, Issue 9, September 2013 LALITHA : A Light Weight Malayalam Stemmer Using Suffix Stripping Method , Prajitha U, Sreejith C, Reghu Raj P C , Proceedings of the IEEE International Conference on Control, Communication and Computing , December 1315, 2013

CLEAR September 2013

An Effective Malayalam Information Retrieval System Using Query Expansion, Reshma O K , Sreejith C, Reghu Raj P C, Proceedings of the IEEE International Conference on Control, Communication and Computing , December 13-15, 2013 STHREE: Stemmer for Malayalam Using Three Pass Algorithm, Pragisha , Reghu Raj P C, Proceedings of the IEEE International Conference on Control, Communication and Computing , December 13-15, 2013 Design of a Scalable Natural Language Report Management System, Manu Madhavan, Robert Jesuraj, Reghu Raj P C. Proceedings of the International Conference on Natural Language Processing, 18-20 December, 2013

Workshop on Multilingual Computing M.Tech Computational Linguistics Department of Computer Science and Engineering GEC Sreekrishnapuram organized an one day workshop on â&#x20AC;&#x2022;Multilingual Computing" (sponsored by TEQIP Phase 2) in association with Swathanthra Malayalam Computing (SMC) on October 7 2013. The purpose of the workshop was to get familiarized with various challenges and issues related to language engineering and computing. The workshop was inaugurated by Dr. P C Reghu Raj, Head of Department of Computer Science and Engineering. The keynote speech was delivered by Mr. Santhosh Thottingal, Senior Software Engineer at Language Engineering team Wikimedia Foundation and project admin of SMC. Mr. Santhosh presented the need for greater emphasis on software localization and need of multilingual computing systems. The workshop was attended by more than 70 participants.

CLEAR September 2013

Workshop on Research Methodology and Intellectual Property Rights Three day training program-cum-workshop on Research Methodologies and Intellectual Property Rights was held at GEC Sreekrishnapuram during 21st - 23rd November, 2013. The program was organized by CERD (Centre for Engineering Research and Development) and KSCSTE (Kerala State Council for Science, Technology and Environment) Trivandrum. Eminent personalities from CERD, KSCSTE, academicians from reputed institutions gave sessions on various aspects of research. More than 30 faculties and PG students attended the workshop. The workshop started with a welcome address by Dr.P.C.Reghu Raj, Principal of GEC Sreekrishnapuram. The workshop was inaugurated by Dr. K.Balan, Director, CERD Trivandrum followed by his inaugural speech. The first day of the workshop came to an end with a motivational talk by Dr. K.K.Sasi, Professor Dept of E.E.E., Amrita Viswa Vidyapeetham, Coimbatore. On the second day of the workshop the session commenced with a talk on various funding schemes available from KSCSTE, for promoting technical education and research activities by Dr. V.Ajith Prabhu and Mr. Shafikh S. The final day of the workshop was handled by two eminent academicians par excellence, Dr.K.P.Mohandas Dean-Academic, MES College of Engg., Kuttipuram and Dr. K.B.M. Nambudiripad, Former Professor of Mechanical Dept., NIT Calicut.

Prof. C. Nazeer and Asst. Prof. R. Binu, followed by other staff members from CS

CLEAR September 2013

HTK: The HMM Toolkit for Speech Processing Abitha Anto M.Tech IIIrd Sem, GEC, Palakkad.

Neethu Johnson M.Tech IIIrd Sem, GEC, Palakkad.

Sincy V Thambi M.Tech IIIrd Sem, GEC, Palakkad

Varsha K.V. M.Tech IIIrd Sem, GEC, Palakkad

Abstract: The Hidden Markov Model Toolkit (HTK) is a portable toolkit for building and manipulating hidden Markov models (HMMs). HMMs can be used to model any time series and the core of HTK is similarly generalpurpose. HTK is primarily designed for building HMM-based speech processing tools, in particular recognizers. Thus, much of the infrastructure support in HTK is dedicated to this task. There are two major processing stages involved. First, the HTK training tools are used to estimate the parameters of a set of HMMs using training utterances and their associated transcriptions. Second, unknown utterances are transcribed using the HTK recognition tools.

system consists of three major stages. First, the

I. Introduction Speech recognition is the process of transcribing a recorded speech utterance into its corresponding sequence of words. Speech recognition systems generally assume that the speech signal as a realization of some message encoded as a sequence of one or more symbols. To effect the reverse operation of recognizing the underlying symbol sequence given a spoken utterance, the continuous speech waveform is first converted to a sequence of equally spaced discrete parameter vectors. This sequence of parameter vectors is assumed to form an exact

feature extraction stage, second the phone recognition stage, and the decoding stage. HTK consists of a set of library modules and tools available in C source form. The tools provide sophisticated facilities for speech analysis, HMM training, testing and results analysis. The software

supports

HMMs

using

both

continuous density mixture Gaussians and discrete distributions and can be used to build complex HMM systems. The HTK release contains

extensive

documentation

and

examples.

representation of the speech waveform on the

HTK can help in the three stages of

basis that for the duration covered by a single

speech recognition. HTK is a toolkit for

vector, the speech waveform can be regarded as

building Hidden Markov Models (HMMs). It is

being stationary.

primarily designed for building HMM-based

Every speech recognition

CLEAR December 2013

Page 8

speech

processing

tools,

particular

can be used in training, it must be converted

recognizers. The HTK training tools are used to

into the appropriate parametric form and any

estimate the parameters of a set of HMMs using

associated transcriptions must be converted to

training

associated

have the correct format and use the required

transcriptions. The HTK recognition tools also

phone or word labels. If the speech needs to be

transcribe the unknown utterances.

recorded, then the tool HSLab can be used both

utterances

and

their

to record the speech and to manually annotate it with any required transcriptions.

II THE TOOLKIT There are mainly 4 phases for HTK

Hcopy is used to copy one or more

toolkit such as data preparation, training, testing

source files to an output file. Normally, HCopy

and analysis. These processing stages are

copies the whole file, but a variety of

depicted in fig. 1.

mechanisms

are

provided

for

extracting

segments of files and concatenating files. The tool HList can be used to check the contents of any speech file and since it can also convert input on-the-fly, it can be used to check the results of any conversions before processing large quantities of data. The tool HLEd is a script-driven label editor which is designed to make the required transformations to label files. HLEd can also output files to a single Master Label File MLF which is usually more convenient for subsequent processing. HDMan is used to create dictionary. HParse creates word network from the grammar used. Fig. 1

2. Training

1. Data Preparation

Initialization procedure depends on the

In order to build a set of HMMs, a set of speech

data

files

and

their

associated

information

available

that

time.

The

commands used are:

transcriptions are required. Before the HMM CLEAR December 2013

Page 9

• HCompV: computes the overall mean and

either a list of stored speech files or on direct

variance.

audio input. Testing uses the dictionary and

Input: a prototype HMM.

word network created in the data preparation

• HInit: Viterbi segmentation + parameter

step.

estimation. For mixture distribution uses K-

4. Analysis

means.

Once the HMM-based recognizer has

Input:

prototype

HMM,

time

aligned

been built, it is necessary to evaluate its

transcriptions.

performance. This is usually done by using it to

• HRest: Baum-Welch re-estimation.

transcribe some prerecorded test sentences and

Input: an initialized model set, time aligned

match the recognizer output with the correct

transcriptions.

reference transcriptions. This comparison is

• HERest: performs embedded Baum-Welch

performed by a tool called HResults which uses

training.

dynamic programming to align the two

Input: an initialized model set, timeless

transcriptions and then count substitution,

transcriptions.

deletion and insertion errors.

• HSmooth: smooths a set of context-dependent models according to the context-independent

III ALGORITHMS USED IN HTK

counterpart. • HHEd: Increases the number of Guassians.

1. EM algorithm: Expectation-Maximization (EM) is a technique used in point estimation. Given a set

3. Testing HVite implements Viterbi algorithm and

of observable variables X and unknown (latent)

takes a network as input, describing the

variables Z we want to estimate parameter θ in

allowable

a model.

word

sequences,

dictionary

defining how each word is pronounced and a

The EM algorithm is a general method

set of HMMs. It operates by converting the

of finding the maximum-likelihood estimate of

word network to a phone network and then

the parameters of an underlying distribution

attaching

from a given data set when the data is

the appropriate HMM definition to each phone

incomplete or has missing values .It takes two

instance. Recognition can then be performed on

steps. The expectation step and a maximization step. Expectation step assumes values for the

CLEAR December 2013

Page 10

unknown

parameters,

whereas

the

log probability of any path is computed simply

maximization step maximizes the probability of

by summing the log transition probabilities and

the value expected.

the log output probabilities along that path. The paths are grown from left-to-right column-by

2. Baum-Welch algorithm:

column.

To determine the parameters of a HMM it is first necessary to make a rough guess at

IV AN EXAMPLE: SIMPLE PHONE

what they might be. Once this is done, more

RECOGNIZER

accurate (in the maximum likelihood sense)

Step 1: Create the grammar

parameters can be found by applying the socalled Baum-Welch re-estimation formulae. Baum-Welch algorithm

updates

algorithm the

model

uses

HTK provides a grammar definition language for specifying simple task grammars.

It consists of a set of variable definitions

parameters

followed by a regular expression describing the

iteratively until convergence. It uses forward

words to recognize.

algorithm which calculates the probability of the partial sequence of speech from the start time to time t, ending up in state i. It also uses backward

probabilities

which

are

the

probability of the ending partial sequence of speech given that we started at state i, at time t.

3. Viterbi algorithm: This algorithm as in fig.2 can be visualized as finding the best path through a matrix where the vertical dimension represents the states of the HMM and the horizontal dimension represents the frames of speech (i.e. time). Each large dot in the picture represents the log probability of observing that frame at that

time

and

each

arc

between

dots

corresponds to a log transition probability. The CLEAR December 2013

Fig. 2$word = aa | ae | â&#x20AC;Ś.. The HTK recognizer actually requires a word network to be defined using a low level Page 11

notation called HTK Standard Lattice Format

labeling tool. If we do not have preexisting

(SLF) in which each word instance and each

training sentences (such as those from the

word-to-word transition is listed explicitly. This

TIMIT database) you can create them either

word network can be created automatically

from preexisting text (as described above) or by

from the grammar above using the HParse tool,

labeling your training utterances using HSLab.

thus assuming that the file gram contains the

HSLab is invoked by typing

above grammar, executing

HSLab noname This will cause a window to appear with a

HParse gram wdnet will create an equivalent word network in the

waveform display area in the upper half and a

file wdnet.

row of buttons, including a record button in the lower half. When the name of a normal file is

Step 2: The dictionary

given as argument, HSLab displays its contents.

The dictionary contains all the phones used for recognizer. It can be built from a

Here, the special file name noname indicates that new data is to be recorded.

standard source using HDMan. By executing,

To train a set of HMMs, every file of

HDMan -m -w wlist -n monophones1 -l dlog

training data must have an associated phone

dict beep names

level transcription. We create a Master Label File (MLF), file containing a complete set of

Training sentences can be extracted from some

phone

prompts used with the TIMIT database. The

transcriptions are made out of word level

desired training word list (wlist) could then be

transcriptions using HLed command.

extracted

automatically

from

these.

level

transcriptions.

Phone

level

The

command will create a new dictionary called

Step 4: Extracting features

dict by searching the source dictionaries beep

The final stage of data preparation is to

and names to find pronunciations for each word

parameterize the raw speech waveforms into

in wlist.

sequences of feature vectors. HTK support both FFT-based and LPC-based analysis. Here Mel

Step 3: Recording the data and creating the

Frequency Cepstral Coefficients (MFCCs) [5],

transcription files then training and test data

which are derived from FFT-based log spectra,

will be recorded using the HTK tool HSLab.

will be used. Coding can be performed using

This is a combined waveform recording and

the tool HCopy configured to automatically

CLEAR December 2013

Page 12

convert its input into MFCC vectors. To do

monophone HMMs in which every mean is set

this, a configuration file (config) is needed

to zero and variance to one.

which

models are then retrained.

specifies

all

the

conversion

These initial

parameters. HCopy will copy one or more data files to

designated

output

file,

Step 6: Re-training

optionally

The HTK tool HCompV will scan a set

converting the data into a parameterized form.

of data files, compute the global mean and

While the source files can be in any supported

variance and set all of the Gaussians in a given

format, the output format is always HTK. By

HMM to have the same mean and variance.

default, the whole of the source file is copied to

Hence, assuming that a list of all the training

the target but options exist to only copy a

files is stored in train.scp, the command

specified segment. Hence, this program is used

HCompV -C config -f 0.01 -m -S train.scp -M

to convert data files in other formats to the

hmm0 proto

HTK format, to concatenate or segment data

will create a new version of proto in the

files, and to parameterize the result.

directory hmm0 in which the zero means and

Here, HCopy is used to convert the parameter kind of a file, for example from

unit variances above have been replaced by the global speech means and variances.

WAVEFORM to MFCC, depending on the

Given this new prototype model stored

configuration options. To run HCopy, a list of

in the directory hmm0, a Master Macro File

each source file and its corresponding output

(MMF) called hmmdefs containing a copy for

file is needed. This list is created and acts as the

each of he required monophone HMMs is

script file. Executing the command,

constructed by manually copying the prototype

HCopy -T 1 -C config -S script

and relabeling it for each required monophone.

The MFCC features will be created at the

Phones stored in the directory hmm0 are re-

location specified in the script file.

estimated using the embedded re-estimation tool HERest. The command is,

Step 5: Making initial models The first step in HMM training is to

HERest -C config -I phones0.mlf -S train.scp H hmm0/hmmdefs -M hmm1 monophones0

define a prototype model. Prototype HMMSets

where

(initial models) of phones is created. The

transcription,

starting point will be a set of identical

containing mfc paths, hmmdefs are the initial

CLEAR December 2013

phones.mlf

train.scp

the is

the

phone

level

script

file

Page 13

models and hmm1 is the new directory being

HVite -b silence -C config -H hmm7/hmmdefs

created. Each time HERest is run it performs a

-i out.mlf -S test.scp -w wdnet dummy dict

single re-estimation. Each new HMM set is stored in a new directory. Execution of HERest

This command uses the HMMs stored

should be repeated twice more, changing the

in hmm7 to transform the test data according to

name of the input and output directories.

the final HMM model to the new phone level

After

appropriate

iterations

using

transcription out.mlf using the pronunciations

HERest, the number of Guassians can be

stored in the dictionary dict.

Here, the

increased using the command HHEd. HHEd

recogniser considers all pronunciations for each

works in a similar way to HLEd. It applies a set

word and outputs the pronunciation that best

of commands in a script to modify a set of

matches the acoustic data.

HMMs. In this case, it is executed as follows: HHEd -H hmm4/hmmdefs -M hmm5 sil.hed

Step 8: Analysis HResults is the HTK performance

monophones1 where hmm4 is the input directory and hmm5 is

analysis tool. It reads in a set of label files

the output directory. Again, retrain the model

(typically output from a recognition tool such

using HERest using th increased Guassian

as HVite) and compares them with the

count as appropriate.

corresponding reference transcription files. If that

Step 7: Testing

out.mlf

contains

transcriptions,

ref.mlf

recogniser

output

contains

the

The phone models created so far can be

corresponding reference transcriptions and

used to realign the training data and create new

dict.txt contains a list of all labels appearing in

transcriptions. This can be done with a single

these files. Then typing the command,

invocation of the HTK recognition tool Hvite.

HResults -I ref.mlf dict.txt out.mlf

HVite is a general-purpose Viterbi word

will return the performance analysis of the

recogniser. It will match a speech file against a

phone recognizer.

network of HMMs and output a transcription for each. When performing N-best recognition

V CONCLUSION

a word level lattice containing multiple

The Hidden Markov Model Toolkit

hypotheses can also be produced. HVite is

(HTK) is a portable toolkit for building and

invoked via the command line:

manipulating hidden Markov models. HTK is

CLEAR December 2013

Page 14

primarily used for speech recognition research although it has been used for numerous other applications including research into speech synthesis, character recognition and DNA sequencing. HTK is in use at hundreds of sites worldwide. References [1] Christopher D. Manning and Hinrich Schütze, ―Foundations of Statistical Natural Language Processing‖, May 1999, MIT PRESS.

Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics‖,2009, 2nd edition. Prentice-Hall. [3] Shanthi Therese S, Chelpa Lingam, ―Review of Feature Extraction Techniques in Automatic Speech Recognition‖, International Journal of Scientific Engineering and Technology, Volume No.2, Issue No.6, pp : 479-484 [4] http://htk.eng.cam.ac.uk/docs/docs.shtml [5]http://www.ling.ohiostate.edu/~bromberg/ht k_problems.html

[2] Jurafsky, Daniel, and James H. Martin, ―Speech and Language Processing: An

Jibbigo is a mobile offline language translation application that was developed by Mobile Technologies, LLC and Dr. Alex Waibel, a professor at Carnegie Mellon. Jibbigo is an offline voice translator and does not need phone or data connectivity to function. Spanish-English Jibbigo was released in September, 2009 as the first offline Speech Translation application. The company has since expanded its offerings to include ten language pairs sold on both Apple's App Store and Google Play. In Jibbigo, the user holds down a record button and says a phrase. The phrase then appears as text in both languages and is spoken aloud in the target language. The app also includes an add name function, a background dictionary, and other features. On iOS, it is compatible with Voice Over for vision impaired users. Ref: http://jibbigo.com/

CLEAR December 2013

Page 15

Nibeesh K M.Tech Computional Linguistic GEC, Palakkad nibeesh@gmail.com Abstract: Apache Lucene is a free/open source information retrieval software library, originally created in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene has been ported to other programming languages including Delphi, Perl, C#, C++, Python, Ruby, and PHP. Doug Cutting originally wrote Lucene in 1999. It was initially available for download from its home at the SourceForge web site. It joined the Apache Software Foundation's Jakarta family of open-source Java products in September 2001 and became its own toplevel Apache project in February 2005. Until recently, it included a number of sub-projects, such as Lucene.NET, Mahout, Solr and Nutch. Solr has merged into the Lucene project itself and Mahout, Nutch, and Tika have moved to become independent top-level projects. Version 4.0 was released on 12 October 2012

APIs, hit highlighting, faceted search,

I. Introduction

caching, replication, and a web admin

Apache Lucene is an open-source project that provides

Java-based

indexing

and

interface.

technology. Using its API, it is easy to



implement full-text search.

with

Core,

provides

Java-based



collecting

and

as spellchecking, hit highlighting and advanced

analysis/tokenization

capabilities.

server built using Lucene Core, with XML/HTTP

One of the key factors behind Lucene‘s popularity is its simplicity. You don‘t need indepth

SolrTM is a high performance search

and

CLEAR December 2013

PyLucene is a Python port of the Core

project.

indexing and search technology, as well



aim

testing and performance.

following: Lucene

the

distributing free materials for relevance

The Apache Lucene project including the



Open Relevance Project is a subproject

knowledge

about

how

Lucene‘s

information indexing and retrieval work in order to start using it. Moreover, Lucene‘s

JSON/Python/Ruby Page 16

straightforward API requires using only a

The following classes to perform the simplest

handful of classes to get started.

indexing procedure:

Lucene is used in a surprisingly diverse and growing number of places: NetFlix, Digg, MySpace,

LinkedIn,

Ticketmaster,

Fedex,

SalesForce.com,

Apple, the

Encyclopedia Britannica CD-ROM/DVD, the Eclipse IDE, the Mayo Clinic, New Scientist magazine, Atlassian (JIRA), Epiphany, MIT‘s

Figure 1. Classes used when indexing documents

OpenCourseWare and DSpace, the Hathi Trust Digital

Library,

and

Akamai‘s

Edge-

Computing platform.

1. IndexWriter 2. Directory 3. Analyze

II. What Lucene can do

4. Document Lucene allows you to add search capabilities to your application. Lucene can index and make searchable any data that you can extract text from. This means you can index and search data stored in files: web pages on remote web servers, documents stored in local file systems, simple text files, Microsoft Word documents,

[1] IndexWriter:

This

the

central

component of the indexing process. This class creates a new index or opens an existing one, and adds, removes, or updates documents in the index. [2] Directory:

The

Directory

class

XML or HTML or PDF files, or any other

represents the location of a Lucene

format from which you can extract textual

index. It‘s an abstract class. We can use

information.

FSDirectory.open to get a suitable

With Lucene, you can index and search email messages,

mailing-list

archives,

messenger chats and many. III. The core indexing classes

instant

concrete FSDirectory implementation that stores real files in a directory on the file system, and passed that in turn to Index-Writer‘s constructor. [3] Analyzer: Before text is indexed, it‘s passed

CLEAR December 2013

through

analyzer.

The

Page 17

analyzer, specified in the IndexWriter constructor, is in charge of extracting those tokens out of text that should be indexed. [4] Document:

The

Document

class

represents a collection of fields. Fields of a document represent the document

Figure

that

documents. Each document has a number of

document. The metadata (such as

fields. The contents of a field can consist of one

author, title, subject and date modified)

or more terms. The number of unique terms is

is indexed and stored separately as

on criteria for the memory requirements of an

fields of a document.

index

metadata

associated

with

A document is simply a container for multiple

Lucene

index

consists

IV. The core searching classes

fields; Field is the class that holds the textual content to be indexed.

Only a few classes are needed to perform the basic search operation:

[5] Field: Each document in an index contains

1. IndexSearcher

one or more named fields, embodied in a class called Field. A document may have

2. Term

more than one field with the same name. In

3. Query

this case, the values of the fields are

4. TermQuery

appended, during indexing, in the order they

5. TopDocs

were added to the document. [1] IndexSearcher: IndexSearcher as a class that opens an index in a read-only mode. It requires a Directory instance, holding the previously created index, and then offers a number of search methods. [2] Term: This is the basic unit for searching. Similar to the Field object, it consists of a CLEAR December 2013

Page 18

pair of string elements: the name of the field

of all the text documents in the directory. After

and the word (text value) of that field.

that we can search for documents using the

[3] Query : Lucene comes with a number of concrete Query subclasses. These includes BooleanQuery, PhraseQuery, PrefixQuery, PhrasePrefixQuery, TermRangeQuery,NumericRangeQuery, FilteredQuery, and SpanQuery.

[4] TermQuery: This is the most basic type of

java org.apache.lucene.demo.SearchFiles. [4] This will prompted for a query. Type in a swear word and press the enter key. You'll get the search results. Same thing you can implement in web servers such as Apache Tomcat. For this you should type java

org.apache.lucene.demo.IndexHTML

query supported by Lucene, and it‘s one of

create -index {index-dir} ..

the primitive query types. It‘s used for

matching documents that contain fields with

{tomcat}/webapps directory (make sure you

specific values.

didn't leave off the .. or you'll get a null pointer

any

subdirectory

exception). {index-dir} should be a directory [5] TopDocs: The TopDocs class is a simple

that Tomcat has permission to read and write,

container of pointers to the top N ranked

but is outside of a web accessible context. By

search results—documents that match a

default the webapp is configured to look

given query.

in/opt/lucene/index for this index. After that you can use the url http://localhost:8080/luceneweb

V.How to Use

to get a search interface.[4]

We can use Lucene simple as command line or

VI.Advantages of Lucene

by programically. In command line use can use java org.apache.lucene.demo.IndexFiles {fullpath-to-documents-directory}[4]



Scalable, High-Performance Indexing



incremental indexing as fast as batch indexing



index size roughly 20-30% the size of text indexed

to create the search index. This command will create called index which will contain an index

CLEAR December 2013

Page 19



many powerful query types: phrase queries,

[3]http://www.ibm.com/developerworks/java/libra

wildcard queries, proximity queries, range

ry/j-solr-lucene/index.html , visited on 26 Nov

queries and more

2013



fielded searching (e.g. title, author, contents)



Cross-Platform Solution

[4]http://lucene.apache.org/core/2_9_4/demo.ht ml , visited on 26 Nov 2013

References [1]

MICHAEL

[5]http://lucene.apache.org/core/features.html MCCANDLESS,

ERIK

, visited on 26 Nov 2013

HATCHER, OTIS GOSPODNETIC. Lucene in Action

(Second

Edition),

2010.

Manning

Publications Co , Stamford.

[2] http://lucene.apache.org/ ,

Visited on 26

Nov 2013

Yahoo Snags SkyPhrase for Natural-Language Processing.... Yahoo has gone and bought another company, pushing its acquisition tally for the year up to 23. The Internet giant has purchased natural-language outfit SkyPhrase for an undisclosed sum. SkyPhrase and its small team of four will become part of the Yahoo Labs team in New York, where they will presumably bring their natural-language-processing acumen to bear on Yahoo’s search and mail products.

CLEAR December 2013

Page 20

FrameNet Sreejith C M.Tech IIIrd Sem, GEC, Palakkad. Abstract: FrameNet is a project housed at the International Computer Science Institute in Berkeley, California which produces an electronic resource based on a theory of meaning called frame semantics. FrameNet reveals for example that the sentence "John sold a car to Mary" essentially describes the same basic situation (semantic frame) as "Mary bought a car from John", just from a different perspective.

FrameNet

project

the

manager in 2000.[2] The FrameNet project has

International Computer Science Institute in

been influential in both linguistics and natural

Berkeley,

language processing, where it led to the task of

California

housed

which

produces

electronic resource based on a theory of meaning called frame semantics. FrameNet reveals for example that the sentence "John sold a car to Mary" essentially describes the same basic situation (semantic frame) as "Mary bought a car from John", just from a different perspective. A semantic frame can be thought of as a conceptual structure describing an event, relation, or object and the participants in it. The FrameNet lexical database contains around 1,200 semantic frames, 13,000 lexical units (a pairing of a word with a meaning; polysemous words are represented by several lexical units) and over 190,000 example sentences. FrameNet is largely the creation of Charles J. Fillmore, who developed the theory of frame semantics that the project is based on, and was initially the project leader when the project began in 1997.[1] Collin Baker became the project CLEAR December 2013

automatic Semantic Role Labeling. The FrameNet project is building a lexical database of English that is both human and machine

readable,

based

annotating

examples of how words are used in actual texts. From the student's point of view, it is a dictionary of more than 10,000 word senses, most of them with annotated examples that show the meaning and usage. For the researcher in Natural Language Processing, the more than 170,000 manually annotated sentences provide a unique training dataset for semantic role labeling,

used

applications

such

information extraction, machine translation, event recognition, sentiment analysis, etc. For students and teachers of linguistics it serves as a valence dictionary, with uniquely detailed evidence for the combinatorial properties of a core set of the English vocabulary. The project Page 21

has been in operation at the International

that is placed (called a Theme) and the location

Computer Science Institute in Berkeley since

in which it is placed (Goal).

1997, supported primarily by the National Science Foundation, and the data is freely available for download; it has been downloaded and used by researchers around the world for a wide variety of purposes.

Many common nouns, such as tree, hat or tower, usually serve as dependents which head FEs, rather than clearly evoking their own frames, so we have devoted less effort to annotating them, since information about them is available from

FrameNet is based on a theory of meaning

other lexicons, such as WordNet (Miller et al.

called Frame Semantics, deriving from the work

1990). We do, however, recognize that such

of Charles J. Fillmore and colleagues. The basic

nouns also have a minimal frame structure of

idea is straightforward: that the meanings of

their own, and in fact, the FrameNet database

most words can best be understood on the basis

contains slightly more nouns than verbs.

of a semantic frame: a description of a type of event, relation, or entity and the participants in it. For example, the concept of cooking typically involves a person doing the cooking (Cook), the food that is to be cooked (Food), something to hold the food while cooking (Container) and a source of heat (Heating_instrument). In the FrameNet project, this is represented as a frame called Apply_heat, and the Cook, Food, Heating_instrument and Container are called frame elements (FEs) . Words that evoke this frame, such as fry, bake, boil, and broil, are called lexical units (LUs) of the Apply_heat frame. Other frames are more complex, such as Revenge, which involves more FEs (Offender, Injury,

Injured_Party,

Avenger,

and

Punishment) and others are simpler, such as Placing, with only an Agent (or Cause), a thing

CLEAR December 2013

Formally, FrameNet annotations are sets of triples that represent the FE realizations for each annotated sentence, each consisting of a frame element

name

(for

example,

Food),

grammatical function (say, Object) and a phrase type (say, noun phrase (NP)). We can think of these three types of annotation on each FE as "layers", but the grammatical function and phrase-type layers are not displayed in the webbased report system, to avoid visual clutter. The downloadable XML version of the data includes these three layers (and several more not discussed here) for all of the annotated sentences, along with complete frame and FE descriptions, frame-frame relations, and lexical entries for each annotated LU. Most of the annotations are of separate sentences annotated for only one LU, but there are also a collection Page 22

of texts in which all the frame-evoking words

word order. FrameNet has been used in

have been annotated; the overlapping frames

applications

provide a rich representation of much of the

paraphrasing, recognizing textual entailment,

meaning of the entire text. The FrameNet team

and information extraction, either directly or by

have defined more than 1,000 semantic frames

means of Semantic Role Labeling tools. The

and have linked them together by a system of

first automatic system for Semantic Role

frame relations, which relate more general

Labeling (SRL, sometimes also referred to as

frames to more specific ones and provide a basis

"shallow semantic parsing") was developed by

for reasoning about events and intentional

Daniel Gildea and Daniel Jurafsky based on

actions.

FrameNet in 2002, and Semantic Role Labelling

Because the frames are basically semantic, they are often similar across languages; for example,

question

answering,

has since become one of the standard tasks in natural language processing.

frames about buying and selling involve the FEs

Since

Buyer, Seller, Goods, and Money, regardless of

descriptions, they are similar across languages,

the language in which they are expressed.

and several projects have arisen over the years

Several

build

that have relied on the original FrameNet as the

FrameNets parallel to the English FrameNet

basis for additional non-English FrameNets, for

project for languages around the the world,

Spanish, Japanese, and German, among others.

projects

are

underway

frames

are

essentially

semantic

including Spanish, German, Chinese, and Japanese, and frame semantic analysis and annotation has been carried out in specialized areas from legal terminology to soccer to tourism. FrameNet has proven useful in a number of computational applications, because computers need additional knowledge in order to recognize that "John sold a car to Mary" and "Mary bought a car from John" describe essentially same situation, despite using two very different verbs, different prepositions and a different CLEAR December 2013

Page 23

The kicktionary : Multilingual electronic

as its arguments and, as the case may be, a

dictionary of football (soccer) language.

support verb or support preposition.

Is it based on FrameNet?

Based on an analysis of their semantics and argument structure, lexical units are grouped into roughly a hundred frames such that lexical units in the same frame share important semantic and syntactic characteristics. The frames, in turn, are assigned to to one of

The kicktionary is a multilingual (German â&#x20AC;&#x201C; English â&#x20AC;&#x201C; French) electronic dictionary of the

16 scenes, where each scene corresponds to a prototypical event (e.g. a goal or a one-on-one situation) of a football match.

language of football (soccer). It was developed between September 2005 and July 2006 at the

In addition to the scenes-and-frames-hierarchy,

FrameNet project at the International Computer

lexical units are also organised intosynsets, i.e.

Science Institute (ICSI) in Berkeley.

into groups of words with identical or largely

The main aim of the project was (and is) to explore how linguistic theories about lexical semantics, methods from corpus linguistics, technologies for hypertext and hypermedia and techniques from computer language processing can help to make lexical resources that are better than (or: good in a manner different from) traditional paper dictionaries.

similar meanings. Synsets, in turn, are the building blocks of

number of concept

hierarchies, each of which organises a set of synsets into a tree via lexical relations such as hypernymy/hyponymy

is-a-kind-of

Y),

holonymy/meronymy (X is-a-part-of Y) and troponymy (to X is to Y in some way). References

The kicktionary currently contains close to 1,900 lexical units (nouns, verbs, adjectives and

[1] https://framenet.icsi.berkeley.edu/fndrupal/ [2] http://www.kicktionary.de/background.html

idiomatic expressions) in German, English and French. For each lexical unit, there are between

[3] http://en.wikipedia.org/wiki/FrameNet

one and ten annotated example sentences from a corpus

football

match

reports.

The

annotations identify the lexical unit itself as well CLEAR December 2013

Page 24

source machine learning and data mining A Fruitful Data open Mining Using tool. It includes a set of components for data ORANGE Manu Madhavan Assistant Professor, Dept of CSE SIMAT, Palakkad.

This article is an introduction to Orange data mining tool and its use with python scripts, not in

details,

brief.

Have

fun..

preprocessing, feature scoring and filtering, modeling, model evaluation, and exploration techniques. Here the python scripting using orange package is discussed. Install Orange: To build and install Orange you can use the setup.py in the root orange directory (requires GCC, Python and numpy development headers). The details will be available in the orange documentation

page

(http://orange.biolab.si/download/). Data Mining is the computational process of

Test the installation

discovering patterns in large data sets involving

After installing orange, type the following in

methods

python interactive shell.

the

intersection

artificial

intelligence, machine learning, statistics, and database

systems.

This

area

has

special

>>>import Orange

applications in areas like medical science,

>>>Orange.version.version

business intelligence, and similar areas of real

'2.6a2.dev-a55510d'

life.

and

If this leaves no error and warning, Orange and

proprietary softwares tools are available for data

Python are properly installed and you are ready

mining applications and research. Orange

to continue.

There

are

many

open

source

(http://orange.biolab.si) is a general-purpose, CLEAR December 2013

Page 25

be predicted by a learning function.

Data Input Orange can read files in native and other data formats. Native format starts with feature (attribute)

names,

their

type

(continuous,

discrete, string). The third line contains meta information to identify dependent features

Orange

have a vast variety of learning functions like, KNN, Least Mean Square Error, Naive- Bayes, Logistic/Linear regression, etc. The following sample code use KNN method to predict the class value of test data set.

(class), irrelevant features (ignore) or meta

Let the training dataset is stored in trainset. tab

features (meta).

and test dataset is stored in testset. tab. The

You may download lenses.tab to a target directory and there open a python shell.

following script will print the class of test data. Let the training dataset is stored in trainset. tab and test dataset is stored in testset. tab. The

>>>import Orange >>>data=Orange.data.Table("lense s")

following script will print the class of test data.

>>>

>>>train=Orange.data.Table("trai nset")

Data mining using Orange-python

>>>test=Orange.data.Table("tests et")

The orange python can be used for all data

>>>learner=Orange.classification .knn.kNNLearner()

mining

applications

like,

classification,

prediction, clustering and learning. Here I am illustrating, how orange can be used

>>>classifier = learner(train) >>>for i in range (len(trainset)):

for classification. ...print i,classifier(test[i]) For classification, you need a training data set and test data set. The data readable by orange methods are stored in a .tab file (stored as Tables). The table contains different features of the data and class value. For training data, the

This will print the class of each vector in test set. You can verify the result by comparing the results with that by using other learning tools. Hope you got itâ&#x20AC;Śenjoy the experimentâ&#x20AC;Ś!!!!

class value will be the actual class, in which the feature vector belongs to. In case of testing data set, the class value will be absent, which have to CLEAR December 2013

Page 26

R uses CRAN (The Comprehensive R Archive Network). The capabilities of R are extended

Programming Language Robert Jesuraj R&D Engineer @ Arcadix

R is a free software environment for statistical

through user-created packages, which allow

computing and graphics. It compiles and runs

specialized statistical

on a wide variety of UNIX platforms, Windows

devices, import/export capabilities, reporting

and MacOS. R is an integrated suite of software

tools, etc. These packages are developed

facilities for data manipulation, calculation and

primarily in R, and sometimes in Java, C and

graphical display. Among other things it has

Fortran. A core set of packages is included with

techniques,

graphical

the installation of R, with 5300 additional 

an effective data handling and storage

packages.

facility,   

a suite of operators for calculations on arrays, in particular matrices,

CRAN task view collects relevant R packages

a large, coherent, integrated collection of

that

intermediate tools for data analysis,

conducting analysis of speech and language on a

graphical facilities for data analysis and display either directly at the computer or on hard-copy, and



Natural Language Processing using R

a well developed, simple and effective

support

computational

linguists

variety of levels - setting focus on words, syntax, semantics, and pragmatics. In recent years, we have elaborated a framework to be used in packages dealing with the processing of

programming language (called ‗S‘) which

written material: the package tm.

includes conditionals, loops, user defined

Some list of frameworks

recursive functions and input and output

tm provides a comprehensive text mining

facilities. (Indeed most of the system

framework for R. The Journal of Statistical

supplied functions are themselves written in

Software article Text Mining Infrastructure in

the S language.)

R gives a detailed overview and presents techniques for count-based analysis methods,

CLEAR December 2013

Page 27

text clustering, text classification and string

openNLPmodels.en and openNLPmodels.es,

kernels.

respectively.



tm.plugin.dc allows for distributing corpora



across storage devices (local files or Hadoop

collection of machine learning algorithms

Distributed File System).

for data mining tasks written in Java. Especially useful in the context of natural

tm.plugin.mail helps with importing mail

language processing is its functionality for

messages from archive files such as used in

tokenization and stemming.

Thunderbird (mbox, eml). 

RWeka is a interface to Weka which is a

tm.plugin.factiva

allows

importing

press/Web corpora from Dow Jones Factiva. 

RcmdrPlugin.temis is an Rcommander plug-in providing an integrated solution to perform a series of text mining tasks such as importing and cleaning a corpus, and analyses like terms and documents counts, vocabulary tables, terms co-occurrences and documents similarity measures, time series analysis,

correspondence

analysis

and

hierarchical clustering. 

OpenNLP provides an R interface to OpenNLP , a collection of natural language processing

tools

including

detector, tokenizer, pos-tagger, shallow and



Semantics:

sentence 

Lsa provides routines for performing a

full syntactic parser, and named-entity

latent semantic analysis with R. The basic

detector, using the Maxent Java package for

idea of latent semantic analysis (LSA) is,

training

that text do have a higher order (=latent

and

using

maximum

entropy

models.

semantic) structure which, however, is

Trained models for English and Spanish to

obscured by word usage (e.g. through the

be used with openNLP are available from

use of synonyms or polysemy). By using

http://datacube.wu.ac.at/ as

conceptual

packages

indices

that

are

derived

statistically via a truncated singular value CLEAR December 2013

Page 28



decomposition (a two-mode factor analysis) over a given document-term matrix, this

for

variability problem can be overcome. The

implements the nine different algorithms

article Investigating Unstructured Texts with

(svm, slda, boosting, bagging, rf, glmnet,

Latent Semantic Analysis gives a detailed

tree, nnet, and maxent) and routines

overview and demonstrates the use of the

supporting the evaluation of accuracy. 

package with examples from the are of technology-enhanced learning. 



Topicmodels provides an interface to the C

text

classification.

Textir is a suite of tools for text and

Textcat provides support for n-gram based text categorization.



models and Correlated Topics Models (CTM) by David M. Blei and co-authors and

Corpora offers utility functions for the statistical analysis of corpus frequency data.

the C++ code for fitting LDA models using



Pragmatics:

Gibbs sampling by Xuan-Hieu Phan and co-



Qdap helps with quantitative discourse

authors.



automatic

sentiment mining.

code for Latent Dirichlet Allocation (LDA)



RTextTools is a machine learning package

analysis of transcripts.

Lda implements Latent Dirichlet Allocation



In Conclusion, Since R is very useful for

and related models similar to LSA and

Natural Language processing. Get into it. To

topicmodels.

start learning use the reference links.

Kernlab allows to create and compute with

References

string kernels, like full string, spectrum, or



bounded range string kernels. It can directly

[1]

http://www.r-project.org/

use the document format used by tm as

[2]

http://cran.r-

input.

project.org/web/views/NaturalLanguageProc

Skmeans helps with clustering providing

essing.html

several algorithms for spherical k-means partitioning. 

MovMF

provides

another

clustering

alternative (approximations are fitted with von Mises-Fisher distributions of the unit length vectors).

CLEAR December 2013

Page 29

Morphological Analyser using Apertium Toolkit Ancy Antony, Deepa C A M. Tech Computational Linguistics University of Calicut Abstract : Malayalam morphology consists of wide range of inflections, multiple suffixes and tendency of adjacent words to concatenate etc. The agglutinative nature of Malayalam language creates several challenges in creating reasonable morphological processing results. Here an attempt is made for exploring the malayalam morphology and for creating morphological processed results by using a rule-based machine translation tool known as Apertium.

Introduction

In the root driven method, root or stem is identified at first and the affixes are passed. In the

The Dravidian Language Family is one of the

affix stripping method, the process takes place in

important groups of languages that are spoken by

the

in South India. There are four recognized Dravidian

identified first and the remaining part is assumed

languages of Telugu, Malayalam, Kannada and

as stem or root. This document includes the

Tamil. The most characteristics feature of the

implementation of a morphological analyser for

Dravidian languages is that they are agglutinative

Malayalam using Apertium tool.

reverse

direction.

Here

the

affixes

are

and exhibit the inclusive and exclusive feature. Morphological analysis is the segmentation of

II. Apertium Tool

words

Malayalam morphological analyser can be built

into

(usually)

their the

component assignment

morphemes of

and

grammatical

using

apertium

toolbox.

Apertium

free

information to grammatical categories and the

software and released under the terms of the GNU

assignment

General public license. It is an open source

the

lexical

information

particular lexeme or lemma. Morphological analysis

shallow-transfer

consists of the identification of parts of

originated

the words, or more technically, constituents of the

Machine Translation for the Languages of Spain”.

words. Malayalam language is inflectionally rich in

Lttoolbox, a module in the Apertium package is the

morphology . The major lexical items like nouns

main tool used for designing the system. It

and verbs are inflected for plural marking, case,

tokenizes the text in surface form and delivers, for

tense, aspect and mood

each surface form, one or more lexical forms

respectively. There are

machine

within

the

translation project

system

”Open-Source

different methods for the morphological analysis of

consisting

natural language processing.

morphological inflection information. For example,

lemma,

lexical