CLEAR September 2013

Page 1

CLEAR September 2013

1


CLEAR September 2013

2


C

Editorial

5

SIMPLE News & Updates

6

M.Tech Computational Linguistic Batch 2011-2013 35

CLEAR September 2013 Volume-2 Issue-3 CLEAR Magazine (Computational Linguistics in Engineering And Research) M. Tech Computational Linguistics Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad 678633 www.simplegroups.in simplequest.in@gmail.com

CLEAR Dec 2013 Invitation

37

Last word

38

PRISM - a Language for statistical Modelling and Learning Ms. Prajitha U 6 Statistical NLP Toolkits for Various Computational Linguistics Problems Mr Sreejith C 12

Chief Editor Dr. P. C. Reghu Raj Professor and Head Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad

Automatic Headline Generation Using Context Free Grammars Mr Krishnaprasad P, Mr Vinayachandran K K 16

Editors Reshma O K Sreejith C Gopalakrishnan G Neethu Johnson

ScalaNLP Mr Robert Jesuraj

28

Adieu to First Batch of SIMPLE Groups Ms Reshma O K

31

Automatic NLP and Semantic Web for Competitive intelligence Mr Manu Madhavan 24

Cover page and Layout Sreejith C

CLEAR September 2013

3


Greetings! With immense pleasure we inform the readers that CLEAR turns 1 year old, and this September issue coincides with its birthday. This fifth edition also coincides with the commencement of the third batch of M.Tech students in Computational Linguistics. This edition of CLEAR consists of short articles on PRISM, a language for statistical modeling and learning, statistical NLP tool kits, ScalaNLP, semantic web and a method for automatic headline generation. This edition also bids farewell to the first outgoing batch on this platform. We are extremely happy that our multi-pronged approach to make this program more visible in the industrial and research landscapes has started yielding results, in terms of placements, publications, and collaborative research. We also focus on Malayalam computing, with the design of Malayalam Wordnet and Morphological analyzer, which will soon be made available publicly. The readers are invited to give frank opinions about the content and style of CLEAR, and we promise to do our best to improve further. CLEAR wishes a happy ONAM to all. With Best Wishes, Dr. P. C. Reghu Raj (Chief Editor)

CLEAR September 2013

4


NEWS & UPDATES Adieu to First Batch of SIMPLE Groups The pioneer batch (2011-2013) from Simple groups successfully completed their course in Computational Linguistics. The students from the 2012-14 batch had organized a Farewell party for their seniors on 31st July 2013. The function kicked off in the forenoon session with a bundle of games and entertainment. Later, the faculty members including HOD joined the students for a sumptuous lunch. The formal meeting in the afternoon session was inaugurated by Dr. P.C. Reghu Raj, the Head of Computer Science department, with an inaugural speech. To read more go to page: 31

Publications 

" Dysarthric Speech Enhancement using Formant Trajectory Refinement" , Divya Das, Dr. C. Santhosh Kumar, Dr. P. C. Reghu Raj, International Journal of Latest Trends in Engineering and Technology (IJLTET), Volume 2, Issue 4 , July 2013, ISSN 2278-621X

“Rule-Based Grapheme to Phoneme Converter for Malayalam“, Rechitha C R, Sumi S Nair, C Santhosh Kumar , International Journal of Computational Linguistics and Natural Language Processing (IJCLNLP), Vol 2, Issue 7, July 2013, ISSN 2279 – 0756

SIMPLE Groups Congratulates Divya Das, Rechitha C R, Sumi S Nair for their achievement!!!

CLEAR September 2013

5


PRISM - a Language for Statistical Modelling and Learning Prajitha U M.Tech Computational Linguistics GEC Sreekrishnapuram. PRISM is a general programming language intended for symbolic-statistical modeling. It is a new and unprecedented programming language with learning ability for statistical parameters embedded in programs. Its programming system, shortly called “PRISM system” here, is a powerful tool for building complex statistical models. The theoretical background of PRISM system is distribution semantics for parameterized logic programs and EM learning of their parameters from observations. The PRISM system is comprised of two subsystems, one for learning and the other for execution, just like a human being has an artery and a vein. The execution subsystem supports various probabilistic built-in predicates conforming to the distribution semantics and makes it possible to use a program as a random sampler.

I. Introduction

probability or other concepts from statistical

Speech and language processing comprises

theory.

the fields of Computational Linguistics, Natural Language Processing and Speech Recognition

and

Synthesis.

The

main

concern of these fields is to solve the problems in the interaction between a machine and a human via language. Speech and Language processing has been split into two

paradigms

namely,

symbolic

and

stochastic. The most popular statistical models

are

HMMs

(Hidden

Markov

Models), PCFG (Probabilistic Context Free Grammar) and Bayesian networks. Statistical in the sense, they involve the notion of

CLEAR September 2013

HMM is the popular statistical modelling used in speech recognition systems. HMM is nothing more than a probabilistic function of a Markov process where as a Markov chain is a weighted automaton in which the input sequence uniquely determines the states. In a HMM, the state sequence is unknown, only have some probabilistic function. One widespread use of HMM is in tagging. They are one of a class of models where training is possible through EM algorithm. An HMM is specified by a five-tuple (S, K, Π, A, B), where S and K are the set of states and the 6


output alphabet, and Π, A, and B are the

Bayes Theorem:

probabilities for the initial state, state transitions,

and

symbol

emissions,

respectively. PCFG,

a

P(h│D) = P(D│h) P(h) P(D) A Bayesian belief network or a Bayesian

Probabilistic

Context

Free

Network

describes

the

probability

Grammar is the simplest probabilistic, which

distribution governing a set of variables by

is simply a CFG with probabilities added to

specifying a set of conditional independence

the rules, indicating how likely different

assumptions along with a set of conditional

rewritings are. A PCFG is a 5 tuple

probabilities. In contrast to the naive Bayes

(N,∑,P,S,D), where D is a function assigning

classifier, which assumes that all the

probabilities to each rule in P. N is a set of

variables

non-terminal symbols, ∑ is a set of terminal

given the value of the target variable,

symbols and S is the start symbol. A PCFG

Bayesian belief networks allow stating

can be used to estimate a number of useful

conditional independence assumptions that

probabilities concerning a sentence and a

apply to subsets of the variables. Thus,

parse tree. This helps in disambiguation.

Bayesian

Another attribute is that it assigns a

intermediate

probability to the string of words constituting

constraining than the global assumption of

a sentence, which is important in language

conditional independence made by the naive

modelling.

Bayes classifier. They are an active focus of

are

conditionally

belief

networks

approach

independent

provide

that

is

an less

current research, and a variety of algorithms Bayesian reasoning provides a probabilistic

have been proposed for learning them and

approach to inference. Bayes theorem is the

for using them for inference.

cornerstone of Bayesian learning methods because it provides a way to calculate the posterior probability P(h│D), from the prior probability P(h), together with P(D) and P (D│ h ) .

II. PRISM PRISM is an acronym of Programming In Statistical Modelling. PRISM was developed by

T.Sato

programming

and

Y.Kameya.

language

for

It

is

a

symbolic-

statistical modelling. PRISM employs a CLEAR September 2013

7


proof-theoretic approach to learning. It

(;), the cut symbols (!) and if-then (->)

conducts learning in two phases: the first

statements as far as they work as expected

phase searches for all the explanations for

according to the execution mechanism of the

the observed data, and the second phase

programming system. Besides the definitions

estimates the probability distributions by

of probabilistic predicates, we need to make

using the EM algorithm. PRISM is a

some

probabilistic extension of Prolog. PRISM

(coin,[head,tail]) declares the outcome space

programs are executed in a top-down left-to-

of a switch named coin, and the call

right manner just like Prolog. It is clear that,

msw(coin,Face) makes a probabilistic choice

PRISM = Prolog + probability + parameter

(Face will be bound to the result), just like a

learning. The most characteristic feature of

coin tossing. On the other hand, the clause

PRISM is that it provides random switches to

target(direction/1)

make probabilistic choices. A random switch

observable event is represented by the

has a name, a space of possible outcomes,

predicate direction/1. This means that we can

and a probability distribution. The example

observe the direction he/she goes.

declarations.

The

clause

declares

that

values

the

given below uses just one random switch: PRISM is a powerful tool for building Example:

complex statistical models. Its programs are

target(direction/1).

also able to learn from examples with the

values(coin,[head,tail]).

help of EM learning algorithm. Most popular

direction(D):-

probabilistic modelling formalisms such as

msw(coin,Face),

HMM, PCFG and Bayesian networks can be

(Face==head -> D=left;D=right).

described using PRISM. PRISM offers a common vehicle for these diverse research

The predicate direction(D) indicates that a

fields.

We

can

train

arbitrarily

person decides the direction to go as D. The

programs, as we desire in PRISM.

large

decision is made by tossing a coin: D is bound to left if the head is shown, and to

A PRISM program DB can be written as a

right if the tail is shown. In this sense, we

set of definite clauses. DB = F U R: F is a set

can

is

of facts and they are unit clauses and R

probabilistic. It is allowed to use disjunctions

denotes a set of rules which are non unit

say

the

predicate

CLEAR September 2013

direction/1

8


clauses. What makes PRISM programs differ from usual logic programs is a basic joint probability distribution Pf given to F. One of the significant achievements of prism is the elimination of the need for deriving new EM algorithms for new applications.

III. PRISM programming system:

PRISM programming consists of three phases:

1. Programming 2. Learning 3. Execution

The following Figure.1 shows a PRISM programming system. The original program must be treated differently in the learning phase and the execution phase. A translator

The control declarations give the data needed

translates it into two specialized programs. A

for the learning phase. The structure of a

PRISM program comprises three parts: a

PRISM program is shown in Figure. 2. The

model, a utility program and control

structure of the language is simple; most

declarations. The purpose of the model part,

instructions are processed independently.

which is a logic program, is to generate

The power lies in a very rich set of

possible proof trees of a target atom. The

instructions,

utility part also contains a logic program that

variables. Data types are both simple and

makes use of a probability distribution PDB

very powerful.

functions

and

pre-defined

(extension of Pf).

CLEAR September 2013

9


2. Unpack the downloaded package using the tar command. 3.Append

$HOME/prism/bin

to

the

environment variable PATH so PRISM can be started at every working directory.

The package contains binaries for both 32-bit and 64 bit systems. The start-up commands like

prism,

upprism

and

mpprism

automatically choose a binary.

As far as PRISM is concerned NLP is a promising area because of the obvious need for describing statistical correlations between syntactic structures and semantic structures. A cross fertilization of computation power IV. Installing PRISM:

and learning power will give a new dimension to programming, which will be

Windows: To

install

enable through PRISM. Using probabilities PRISM

on

Windows,

the

following steps are needed: 1.Download

the

can sometimes be more effective than using hard rules for handling many NLP problems.

package

Hope that these statistical models together

prism111_win.zip.

with PRISM will pave a way for the success

2. Unzip the downloaded package under C:\

in many areas of NLP.

3. Append C:\prism\bin to the environment variable PATH so PRISM can be started at

References:

every working folder. 1. Tom M Mitchell, “Machine learning” 2. Daniel Jurafsky and James H.Martin,”

Linux: 1.Download

the

package

Speech and Language Processing”

prism111_linux.tar.gz CLEAR September 2013

10


3.Taisuke

Sato

and

Yoshika

5.

Henning

Christiansen,

“Logical-

Kameya,”PRISM: A Language for Symbolic

Statistical Models and Parameter Learning in

Statistical Modeling”

the PRISM system”

4. Taisuke Sato and Neng-Fazhou,” A New Perspective of Statistical Modeling by PRISM”

Technology Development for Indian Languages (TDIL) Technology Development for Indian Languages (TDIL) Programme initiated by

the

Department of Electronics & Information Technology (DeitY), Ministry of Communication & Information Technology (MC&IT), Govt. of India has the objective of developing Information Processing Tools and Techniques to facilitate human-machine interaction without language barrier; creating and accessing multilingual knowledge resources; and integrating them to develop innovative user products and services. The Programme also promotes Language Technology standardization through active participation in International and national standardization bodies such as ISO, UNICODE, World-wide-Web consortium (W3C) and BIS (Bureau of Indian Standards) to ensure adequate representation of Indian languages in existing and future language technology standards. Visit: http://tdil.mit.gov.in/

CLEAR September 2013

11


Statistical NLP Toolkits for Various Computational Linguistics Problems Sreejith C M.Tech Computational Linguistics GEC Sreekrishnapuram

Hi. in this article I would like to introduce

be useful for various language processing

some of the useful and widely used toolkits

tasks especially for text mining.

available for various linguistics problem. Most of them are open source and can be

I. Hidden Markov Model Toolkit (HTK)

freely downloaded from internet. The scope

The Hidden Markov Model Toolkit (HTK) is

of

a

applications

of

natural

language

portable

toolkit

for

building

and

processing is enormous are in several areas

manipulating hidden Markov models. HTK

such as:

is primarily used for speech recognition research although it has been used for

Semantic Search

Morphology, Syntax, Named Entity

numerous

Recognition 

Opinion,

applications

including

research into speech synthesis, character recognition and DNA sequencing. HTK is in

Emotions,

Textual

Entailment

use at hundreds of sites worldwide. HTK consists of a set of library modules and tools

Text and Speech Generation

Machine Translation

Information

other

Retrieval

available in C source form. The tools provide sophisticated facilities for speech analysis,

and

Text

HMM training, testing and results analysis.

Clustering

The software supports HMMs using both

Educational Applications

continuous density mixture Gaussians and

In Clear March 2013 edition Mr..Manu Madhavan have introduced about various Ontology based tools such as protégé, jena, swoop etc. In this section I would like to

discrete distributions and can be used to build complex HMM systems. The HTK release contains extensive documentation and examples.

introduce some other basic tools which will CLEAR September 2013

12


HTK is available for free download but you

including

must first agree to this license. You must

tagging, named entity recognition, parsing,

then register for a username and password

and co reference.

which will allow you to download the HTK Book and source code. Registration is free

tokenization,

part-of-speech

2. Stanford Parser

but does require a valid email address; your

Implementations of probabilistic natural

password for site access will be sent to this

language parsers, both

address.

optimized PCFG and dependency parsers,

highly

and a lexicalized PCFG

Ref : http://htk.eng.cam.ac.uk/

parser in Java.

3. Stanford POS Tagger II.

The

Stanford

Natural

Language A maximum-entropy (CMM) part-of-speech

Processing Group

(POS) tagger for English, Arabic, Chinese, The Stanford NLP Group makes parts of our Natural

Language

Processing

software

French, and German, in Java. 4. Stanford Named Entity Recognizer

available to everyone. These are statistical NLP

toolkits

for

various

major

A Conditional Random Field sequence

computational linguistics problems. They

model,

can be incorporated into applications with

features for Named Entity Recognition in

human language technology needs. All these

English and German.

software distributions are open source,

together

with

well-engineered

5. Stanford Word Segmenter

licensed under the GNU General Public License (v2 or later). Note that this is the full

A CRF-based word segmenter in Java.

GPL, which allows many free uses, but does

Supports Arabic and Chinese.

not allow its incorporation into any type of distributed proprietary software, even in part or in translation. 1. Stanford CoreNLP

IV. Weka: Data Mining Software in Java

Weka is a collection of machine learning

An integrated suite of natural language

algorithms for data mining tasks. The

processing tools for English in Java,

algorithms can either be applied directly to a

CLEAR September 2013

13


dataset or called from your own Java code.

tagging, parsing, and semantic reasoning.

Weka contains tools for data pre-processing,

NLTK is a free, open source, community-

classification,

clustering,

driven project. NLTK has been called “a

association rules, and visualization. It is also

wonderful tool for teaching, and working in,

well-suited for developing new machine

computational linguistics using Python,” and

learning schemes. Weka is open source

“an amazing library to play with natural

software issued under the GNU General

language.”

Public License.

Ref: http://nltk.org/

regression,

Ref: http://www.cs.waikato.ac.nz/ml/weka/ VII. Ngram Statistics Package (NSP)

V. WordFreak

NSP allows you to identify word and linguistic

character N grams that appear in large

annotation tool designed to support human,

corpora using standard tests of association

and automatic annotation of linguistic data as

such as Fisher's exact test, the log likelihood

well as employ active-learning for human

ratio, Pearson's chi-squared test, the Dice

correction of automatically annotated data.

Coefficient, etc. NSP has been designed to

For the latest news about WordFreak and to

allow a user to add their own tests with

participate in discussions, check out Word

minimal effort.

WordFreak

is

a

java-based

Freak's Sourceforge project page.

Ref: http://ngram.sourceforge.net/

Ref: http://wordfreak.sourceforge.net/ VIII. MALLET

VI. NLTK - Natural Language Toolkit

MALLET is a Java-based package for NLTK is a leading platform for building

statistical

Python programs to work with human

document classification, clustering, topic

language

modeling, information extraction, and other

data.

It

provides

easy-to-use

natural

language

processing,

interfaces to over 50 corpora and lexical

machine

resources such as WordNet, along with a

MALLET includes sophisticated tools for

suite

for

document classification: efficient routines

stemming,

for converting text to "features", a wide

of

text

classification,

processing tokenization,

CLEAR September 2013

libraries

learning

applications

to

text.

14


variety of algorithms (including Naïve

prototypes

to

Bayes, Maximum Entropy, and Decision

Ellogon is licensed under the GNU LGPL

Trees), and code for evaluating classifier

license, is easy to install and administer and

performance using several commonly used

is

metrics.

operating

reliable.

commercial

Running systems,

under Ellogon

applications.

all

major

offers

a

comfortable environment for computational

Ref: http://mallet.cs.umass.edu/

linguists, language engineers or plain users.

IX. Emdros

Ref: http://www.ellogon.org/

It is a text database engine for analyzed or annotated text. It is applicable especially in linguistics,

corpus

linguistics,

and

computational linguistics. Emdros is Open Source and Free Software (Libre Software).

These are only some of the natural language processing toolkits available over internet. There are several other tools. Have a look on all those and get familiarized. I hope that this article

will

help

you

to

start

your

“experimentation“ with languages. Happy

Ref: http://emdros.org/

coding... 

X. Ellogon Ellogon is an effort that tries to keep all the excitement and reduce all the complexity. Ellogon is different from other similar software. First of all, it respects the user's time by offering a simple and user friendly graphical interface. But beneath this simple appearance a powerful engine is hidden, that has been proved to be able to support a wide range

of

uses,

from

CLEAR September 2013

simple

research

15


Automatic Headline Generation Using Context Free Grammars Krishnaprasad. P Student, B.Tech (IT) Govt. Engineering College Sreekrishnapuram

Vinayachandran K.K Assistant Professor (IT) Govt. Engineering College Sreekrishnapuram

This article presents a novel way for generating the headline by exploiting the benefits of context free grammar. The system is based on summarizing the given document in order to get condensed form of text and then the content words are identified. Input for sentence generation, is obtained by separating named entities, nouns, verbs etc. from the content words. The system effectively generates the summary of the text document based word-frequency based scoring technique. For generating title, the paper presents a context free grammar which produce suitable sentence from given content words. The experiments showed that the Title generated is efficient and the suggested titles are really helpful in extracting the important document.

I. Introduction

headline of a text, specially an article, is a

Natural language Processing has received a great deal of attention in recent research because of is wide applicability. Research on automatic text summarization provides basis for research on headline generation. The rapid growth of the Internet has resulted in enormous amounts of information that has become increasingly more difficult to access efficiently.

The

ability

to

summarize

information automatically and present results to the end user in a compressed, yet complete form, would help to solve this problem. A CLEAR September 2013

succinct representation of relevant points of the input text. It differs from the task of producing abstracts, in the size of the generated text and focuses on compressing the output. Headlines are terse while abstracts are ex-pressed using relatively more words. While headlines focus on pointing out the most relevant theme expressed in the input text, abstracts summarize the important points. This makes both headline generation and summarization extensively valuable.

16


Early, the problem of generating headlines

text

search and

Information

Retrieval.

for documents and text summarization uses

Automatic headline generation tries to

purely statistical extraction based. Most of

automate the process of providing more

the summarization work done till date is

relevant or reflective insight into the input

based on extraction of sentences from the

text rather than producing catchy lines.

original document. The sentence extraction

Automating in this context has to involve

techniques compute score for each sentence

some form of learning rather than an

based on features such as position of

algorithmic approach given the potentially

sentence in the document, word or phrase

infinite stretch of natural language text.

frequency, key phrases (terms which indicate

Many machine learning techniques have

the importance of the sentence towards

been explored involving varying degree of

summary). There were some attempts to use

use of natural

machine learning (to identify important

techniques.

language understanding

features), use natural language processing (to identify key passages or to use relationship

Context Free Grammars are relatively recent

between

techniques

words

rather

than

bag

of

used

in

natural

language

words).vector representation model is also

processing. In this work we are trying to

used in text summarization techniques.

make natural headline using context free

Headlines are commonly associated with news articles but have wide range of applications. Application areas of headline generation involve generating table of contents for a document to providing support

grammar. The advantage of using CFGs is that it does not require any training data. We also present a summarization technique based on sentence scoring method as a part of content word extraction.

for interactive query refinement in search engines. Headlines extracted from search result web pages can be used to augment a user search query. The resultant query can be used to further re-rank and improve upon the search results. This approach of augmenting a user query with key words extracted from text is being increasingly used in Contextual CLEAR September 2013

II. Context Free Grammar Model For Headline Generation

Here we present model for generating headline for a text document using context free grammars. Our model composed of two parts. The first part is comprised of 17


summarization part and second part is

summarization

headline

follows:

synthesis.

The

algorithm

for

can

be

summarized

as

headline synthesis purely depends on input text

document

only.

Natural

language

1. Sentence marking:

sentence generation is the heart of the algorithm.

This module divides the document into sentences. It appears that using end-ofsentence punctuation marks, such as periods,

A. Summarization

question marks, and exclamation points, is The summarization system has both text

sufficient

for

marking

analysis component and summary generation

boundaries. It should be noted exclamation

component. The text analysis component

point and question mark are somewhat less

used to identify the features associated with

ambiguous.

each sentence. Before the extraction process

appeared in non standard words like web

text normalization is performed (The text

URLs, emails etc.

Whereas

the

periods

sentence

can

be

normalization involves splitting the text into sentences). After text normalization, the

2. Feature extraction:

normalized text is passed through a feature extraction module. Feature extraction include

The system extracts both the sentence level

extracting features associated with the

and word level features. We are actually

sentences and the features associated with

interested only in word level features

words

word

because, we need not require high quality

frequency, characters per word etc. Later in

summary for title generation process. Our

order to summarize the system calculates the

aim is to supply a brief summary input to

score for each sentence based on the features

headline synthesis part. For picking out best

that we already identified in the previous

sentence from given document we follow the

step. Sentence refinement is done on the

sentence scoring technique based on word

sentences with high score, and the resulting

frequency and average number of characters

sentences are selected for the summary in the

per sentence.

such

as

named

entities,

same order as they were found in the input text

document.

Various

CLEAR September 2013

steps

in 18


3. Summary Generation:

B. Headline synthesis

Summary generation include tasks such as

Headline Synthesis involves generating a

calculating the score for each sentence,

suitable headline for given input text file

selecting the sentences with high score, and

based on the content words extracted from

refinement of the selected collection of

the

sentences.

components:

4. Sentence Ranking:

1. Extracting content words:

Summarization system follow simple but

Content words are the word which represents

efficient sentence ranking technique based

over all text. Identification of content words

on word frequency of particular word in the

has special importance, because the quality

sentence and average number of characters

of the title generated will depends on the

in each word. Mathematical model of

exact identification of the content words. We

sentence ranking is discussed later in this

can follow any prominent method for

paper.

extracting out the content words.

5. Sentence Selection:

2. Identification of Elements for title

document.

It

comprised

of

three

generation: After the sentences are scored, we need to select

the

good

Analyzing the headlines, it is noted that the

summary. One strategy is to pick the top N

headline is formed by named entities or/and

sentence towards the summary, but this

frequent nouns or verbs. Out of number of

creates the problem of coherence. The

selected content words we are actually

selection of sentence is dependent upon the

interested only in nouns, verbs, and named

type

entities.

of

sentences

sentences

the are

that

make

summary requested. selected

based

on

The the

percentage of output text required with

3. Generation of Headline:

respect to the input document.

CLEAR September 2013

19


The heart of this paper lays on fact that,

mathematical basis for sentence scoring

context

for

summarization method. Since the key part of

generating title from identified elements. The

summarization lies on sentence scoring and

effectiveness of the title depends on how

sentence selection followed by coherence,

well natural language sentence is generated

the quality of the generated summary can be

from the identified headline elements and the

ensured from the scoring function. The

following refinement. It is worth to note that

sentence scoring can be performed by: Term

the model entirely depends only on given

frequency: It takes into account only the

input text file. The key advantage of this

frequency of a term inside the document:

free

Grammars

are

used

headline synthesis model is that system doesn't need any separate learning process.

TFi,j= number of occurrences of term in document j. document length: It is logical to assume that

III. Mathematical Modeling

terms appear more frequently in bigger _les, A. HS Algorithm

so if a term is relatively more frequent in a short than in a big file, then it is more

Through this we implementing a new

important. To incorporate document length

algorithm named HS(Headline Synthesis)

in the weighting formula we define:

algorithm. We had already mentioned that the context free grammar based algorithm is

DLj= total number of term occurrences.

basis for headline synthesis. The algorithm

This can be generalized to the average length

can be implemented as - summarizing given

of a document:

text from which generating the Headline.

NDLj=DLj= Average DL of all document

The abstract view of the algorithm is given below:

A.2 CWE Algorithm

A.1 SSS Algorithm

The

CWE

(Content

Word

Extraction)

algorithm extracts content words of the text Following the discussion of summarization

document. Once the content words are

system we have to implement Sentence

separated we make dictionaries of noun, verb

Scoring

and named entities.

Summarization

CLEAR September 2013

algorithm

as

20


A.3 HS-CFGs Algorithm

[13] in noun [14] until l1>0

The

heart

algorithm

of

the

lies

on

Headline

Synthesis

HS-CFGs(Headline

[15] If l2 > 0 Do Steps 16 [16] Add production VP -> V, 'the' N

Synthesis using Context Free Grammar)

[17] Repeat Steps 15 to 17

algorithm.

[18] Add production V ->item in verb [19] until l3>0

Input:

[20] Initialize rules

Let l1, l2, l3 be length of the dictionaries

[21] Set expansion list Do step 20 to 23

representing noun, verb and named entities

[22] If the starting rule was in set of rules,

respectively. Let the nouns, verbs and named

then

entities are already extracted out from the

[23] Grab one possible expansion

content words, and let they are stored at

[24] For element in random expansion Do

dictionaries noun, verb and performers list

[25] Expand the element

respectively.

[26] Else Do step 25 to 23 [27] If the rule wasn't found, then

Output: Suitable headline for the text input. Method: Steps [1] If l2 > 0 Add production S -> NP, VP [2] Else Add production S -> NP

[28] It's a terminal: Simply append the string to the expansion [29] For every word in expansion [30] If the word is repeating than [31] Eliminate repeating word [32] Output Headline

[3] Set NP -> 'the', N [4] If l3 > 0 Do Steps 5 to 9

IV. Manual Evaluation Technique

[5] Repeat Steps 6 to 9 [6] Add production N -> item

Manual Evaluation is simple set up in which

[7] in performers list

the machine generated headline is evaluated

[9] until l3 > 0

manually. For a number of documents

[10] else Do steps 11 to 14

machine generated headline is compared

[11] Repeat Steps 12 to 14

against human generated headline. A suitable

[12] Add production N -> item

s core (mark) is assigned for both(human

CLEAR September 2013

21


generated machine generated) the headline.

language contains less number of content

The quality of headline generation system

words than that of scientific document.

can be analyzed using suitable graphical method (bar graph or line graph).

V. Evaluation using vector space analysis

A vector space search involves converting documents into vectors. Each dimension within the vectors represents a term. If a document contains that term then the value within the vector is greater than zero. In this method

both

machine

Generated

and

deviation of Human Generated headlines are converted into document vectors. A plot is Figure 1: Cosine Similarity

performed by:

The size of document has also an impact on cosA = (t1*t2)/(t12) _ (t22)

the

generated

generated Where t1 corresponds document vector of machine generated headline and t2 is for human synthesized headline.

title

headline. also

The

depends

machine on

how

effectively document is summarized. The extraction of content words

can also

influence the headline. The Grammar used

The graph obtained as shown in Figure 1.

for sentence generation can influence the accuracy of the headline. The advantage of

VI. Conclusion

this headline generation system is that it does not require any learning to machine and the

The headline generation system is very successful

for

scientific

and

technical

document, but less powerful for poetic language. The reason is that the poetic

generated title depends entirely on input text file.

Improved

methods

for

keyword

extraction and novel way for generating accurate sentences from input words will make the system powerful.

CLEAR September 2013

22


[6] Natural

References

Python.Steven [1] Automated Natural Language Headline

Language

Processing with

Bird,

Klein,

Ewan

and

Edward Loper.

Generation Using Discriminative Machine Learning Models. Akshay Kishore Gattani

[7] Optimizing Machine Learning Approach

B.E.(Honors). Birla Institute of Technology

Based

and Science Pilani(India) 2004.

Summarization. Farshad Kyoomarsi, Hamid

on

Fuzzy

Logicin

Text

Khosravi, Esfandiar Eslami, and Pooya [2] Automatic Text Summarization using a

Khosravyan

Dehkordy.

Machine Learning Approach. Joel Larocca

University(Shahrekord branch) International

Neto, Alex A.Freitas, Celso A. A.Kaestner.

Center

Pontifical Catholic University of Parana

Environmental

(PUCPR) Rua Imaculada Conceicao, 1155.

Shahid Bahonar Kerman Shahid Bahonar

for

Science

Islamic

High

Sciences

Azad

Technology

,University

of

University of Kerman, The center of [3]

Bengali

Sentence

Text

Summarization

Extraction.

Kamal

By

Sarkar.

Excellence

for

Fuzzy

system

and

applications.

Computer Science Engineering Department Jadavpur University Kolkata 700 032 India.

[8]

Sentence

Extraction

Based

Single

Document Summarization. Jagadeesh J, [4] Challenges and Trends of Automatic

Prasad Pingali, Vasudeva Varma. Workshop

Text Summarization. Oi Mean Foong1, Alan

on Document Summarization, 19th and 20th

Oxley1,

March, 2005, IIIT Allahabad Report No:

Suziah

Sulaiman.

Universiti

Teknologi Petronas, Malaysia.

IIIT/TR/2008/97.

[5] Improved Algorithms For Keyword

[9] Using Machine Learning for Medical

Extraction and Headline Generation From

Document Summarization. Kamal Sarkar,

Unstructured Text. Amit Kumar Mondal and

Mita Nasipuri, Suranjan Ghose .Computer

Dipak Kumar Maji. Department of Computer

Science

Science and Engineering Indian Institute of

Jadavpur University, Kolkata-700 032, India.

and

Engineering

Department,

Technology, Kanpur, India 208016.

CLEAR September 2013

23


NLP and Semantic Web for Competitive Intelligence Manu Madhavan Asst. Professor, SIMAT, Vavannur Although Natural Language Processing and Semantic Web technologies are both “Semantic Technologies,� they are, in a way, opposites. NLP tools focus on unstructured information, such as long-form documents, emails, and web pages, while Semantic Web tools typically deal with more structured information on a much more granular level. However, there are many important problems that span the two worlds of structured and unstructured information where the combination of NLP and Semantic Web tools is highly complementary. In fact, the flexibility of the Semantic Web’s data model is a particularly good fit for problems involving lots of unstructured information, making the combination particularly powerful.

information

I. Competitive Intelligence Competitive Intelligence (CI) is the action of defining,

gathering,

analyzing,

distributing intelligence about

and

products,

centers,

and

competitive

intelligence which is a perspective on developments and events aimed at yielding a competitive edge [1].

customers, competitors and any aspect of the environment needed to support executives and managers in making strategic decisions for an organization. Experts also call this process the early signal analysis. This definition focuses attention on the difference between dissemination of widely available factual information (such as market statistics, financial

reports,

newspaper

clippings)

performed by functions such as libraries and

CLEAR September 2013

24


thousands

II. The Scenario There has been a huge computer technology development and an accelerated growth in information quantity produced in the last two decades of the 20th Century. But how do companies use these published data, mainly in the digital media? What do they use to increase their competitive advantages? It is true

that

most

companies

recognize

information as an asset and believe in its value for strategic planning. The big difficulty

however

is

to

deal

with

information in a changing environment. The temporal

character

becoming

more

of and

information more

is

critical.

Information valid today may not be valid tomorrow anymore. Data are not static blocks to become a building block of a temporal reality. Information analysis is no longer an action; it has become a process [2].

of

articles

and

keep

this

information in their heads, or in workbooks like Excel, or, more likely, nowhere at all. To make a defensible and actionable strategy it is useful to perform Influence Analysis and Network Analysis, which can form the kernel of a competitive intelligence analysis strategy. The data required for analysis must be obtained by identifying and extracting target attribute values in unstructured and often very large (multi-terabyte or petabyte) data stores. This necessitates a scalable infrastructure, distributed parallel computing capability, and fit-for-use natural language processing algorithms. Taking advantage of the large amount of information in the World Wide Web it is propose

a

methodology

to

develop

applications to gather, filter and analyze web data and turn it into usable intelligence

Due to the lack of structure in news

(WeCIM). In order to enhance information

clippings,

search

it

pharmaceutical

is

very

difficult

competitive

for

a

intelligence

and

management

computers

"Which

knowledge domains.

have

published

information in the last 6 months referencing compounds that target a specific pathway that we're targeting this year?" At the moment, the most common approach to this problem is for certain people to read CLEAR September 2013

is

proposed the use of ontologies that allow

officer to get answers to questions such as, companies

quality it

to

“understand�

particular

III. Combining NLP and Semantic Web Functionalities Sean

Martin,

CTO

Cambridge

Semantics,

commented that, although Natural Language Processing and Semantic Web technologies are 25


both “Semantic Technologies,� they are, in a

data located in databases and elsewhere, thus

way, opposites.

bridging the gap between documents and formal, structured data [2]. IV. Text mining functionalities for CI The first supporting functionality to the analyst comes from the limited daily reading capacity of a human being [3]. The Filtering functionality has been created with the purpose of allowing pre-selection of reading contents,

assuming

that

the

important

information is, very likely, within the filtered subset. The technological Event Alert NLP

tools

focus

on

unstructured

information, such as long-form documents, emails, and web pages, while Semantic Web tools typically deal with more structured information on a much more granular level “.

So how can NLP technologies realistically be used in conjunction with the Semantic Web? The answer is that the combination can be utilized in any application where you are contending with a large amount of unstructured information, particularly if you

functionality

was

developed

with

the

objective of advising the analyst as soon as possible of some pre-specified events as being important to his business. The third and last functionality refers to a Semantic Search tool that becomes necessary for adhoc demanded information. This demand comes from the fact that both Filtering and Event Alert are planned and predefined. The objective of this tool is, therefore, to allow the analyst to reach the information required in a particular instance, as soon as possible.

also are dealing with related, structured information stored in conventional databases. Clearly, then, the primary pattern is to use NLP to extract structured data from text based documents. These data are then linked

V. Conclusion The combination of NLP and Semantic Web technology

enables

the

competitive

intelligence officer to ask complicated

via Semantic technologies to pre-existing CLEAR September 2013

26


questions

and

actually

get

reasonable

answers in return. By their very nature, NLP A web of meaningful data

technologies can extract a wide variety of information, and Semantic Web technologies

The Web was designed as an information

are by their very nature created to store such

space, with the goal that it should be

varied and changing data. In cases such as

useful

this, a fixed relational model of data storage

communication, but also that machine

is clearly inadequate.

would be able to participate and help.

not

only

for

human-human

One of the major obstacles to this has been the fact that most information on

References

the

Web

is

designed

for

human

1. http://www.cambridgesemantics.com

consumption, and even if it was derived

/semantic-university/nlp-and-the-

from a database with well defined

semantic-web

meanings (in at least some terms) for its

2. Christian Aranha, Emmanuel Passos,

columns, that the structure of the data is

“Automatic NLP for Competitive

not evident to a robot browsing the web.

Catholic

Leaving aside the artificial intelligence

University of Rio de Janeiro, Brazil,

problem of training machines to behave

2008.

like people, the Semantic Web approach

Intelligence”,

Pontifical

3. Juan Antonio , “Semantic Web meets

instead

develops

languages

for

Competitive Intelligence“, Master

expressing information in a machine

Thesis, University of Granada, 2009.

processable form" [Tim Berners-Lee, Semantic Web Road Map, sept. 1998]

CLEAR September 2013

27


ScalaNLP Robert Jesuraj M.Tech Computational Linguistics Breeze is the new scientific computing library for Scala, supporting linear algebra, numerics, statistics, machine learning, and natural language processing. Breeze merges two formerly separate projects: ScalaNLP and Scalala. Scalala provides linear algebra and numerics, while ScalaNLP provides the rest. The Scalala portions are largely rewritten, with high performance being the priority.

I. An Introduction to Scala Programming language It Scala is an object-functional programming and scripting language for general software applications, statically typed, designed to concisely express solutions in an elegant, type-safe and lightweight (low ceremonial) manner. Scala includes full support for functional programming (including currying, pattern matching, algebraic data types, lazy evaluation, tail recursion, immutability, etc.). It cleans up what are often considered to have been poor design decisions in Java (e.g. type erasure, checked exceptions, the nonunified type system) and adds a number of other features designed to allow cleaner, more concise and more expressive code to be written.

is

intended to be compiled to Java bytecode (the executable JVM) or (DOT) NET. Like Java, Scala is statically typed and objectoriented, uses curly-brace syntax reminiscent of C, and compiles code into Java bytecode, allowing Scala code to be run on the JVM and permitting Java libraries to be freely called from Scala (and vice-versa) without the need for a glue layer in-between. Compared with Java, Scala adds many features

of

functional

programming

languages like Scheme, Standard ML and Haskell, including anonymous functions, type inference, list comprehensions (known in Scala as "for-comprehensions"), lazy initialization, extensive language and library support for side-effect-less code, pattern

CLEAR September 2013

28


matching,

case

classes,

delimited

continuations, higher-order types, much better support for covariance and contra

breeze-process contains tokenization, and other NLP-related stuff.

variance than in Java, etc.

breeze-learn

contains

machine

learning and optimization routines. 

The name Scala is a blend of "scalable" and "language", signifying that it is designed to

breeze-viz

contains

plotting

and

visualization routines. 

grow with the demands of its users. James

breeze-core contains some basic data structures and configuration

Strachan, the creator of Groovy, described Breeze also provides a fairly large number of

Scala as a possible successor to Java.

probability distributions built in. These come with access to either probability density II. ScalaNLP: Natural Language

function (for discrete distributions) or pdf

Processing and Machine Learning

functions

(for

continuous

distributions).

Many distributions also have methods for ScalaNLP is a suite of machine learning and

giving the mean and the variance.

numerical computing libraries. ScalaNLP is the umbrella project for Breeze and Epic. Breeze is a set of libraries for machine

Breeze's

package

includes

several convex optimization routines and a simple

learning and numerical computing.

optimization

linear

optimization

program routines

solver. typically

Convex take

a

DiffFunction[T], which is a Function1 extended to have a gradient method, which returns the gradient at a particular point. Most routines will require a breeze.linalgenabled type: something like a Vector or a Counter. Breeze Data: Most of the classifiers in Breeze-Learn expect Examples for training,

Breeze consists of five parts: 

breeze-math performance

contains linear

numeric. CLEAR September 2013

algebra

high-

which are simply traits that have a label,

and

some features, and an id. Example is generic about what those types are. Observation is 29


Example's parent type, and it differs only in

Breeze

not having a label.

classifiers: a standard Naive Bayes classifier,

There are also some routines for reading in different standard formats of datasets.

classify:

Breeze

provides

4

an SVM, a Perceptron, and a Logistic Classifier (also known as a softmax or maximum entropy classifier). All classifiers

Breeze.text.tokenize: This package provides

(except Naive Bayes) have a Trainer class in

methods for tokenizing text and turning it

their companion object, which can be used to

into more useful forms. For example, we

train

provide routines for segmenting sentences,

breeze.data.Examples. This classifier can

tokenizing text into a form expected by most

then be applied to new observations.

up

a

classifier

from

parsers, and stemming words. Breeze util: util contains a number of useful

References

things. Mostly notable is Index which specifies a mapping from objects to integers. In NLP, we often need to map structured

[1] https://github.com/scalanlp/breeze [2] http://www.scala-lang.org/

objects (strings say) to unique integers for efficiency's sake. This class, along with Encoder, allows for that.

Sandhan Sandhan' (http://www.tdil-dc.in/sandhan), a monolingual search engine of five Indian languages Bangla, Hindi, Marathi, Tamil and Telugu was released on September 20, 2012, at Electronics Niketan, New Delhi. The project 'Sandhan' was taken up under the umbrella of Technology Development in Indian Languages (TDIL) programme of the Ministry of Communication and Information Technology, and executed by institutions such as AUKBC, AUCEG, CDAC Noida (Co-coordination), CDAC Pune, DAICT Gandhinagar, Guahati University, IIT Bombay (Coordination), IIT Kharagpur, IIIT Bhubaneshwar, IIIT Hyderabad, ISI Kolkata, and Jadavpur University.

CLEAR September 2013

30


Adieu to First Batch of SIMPLE Groups Reshma O.K. M.Tech Computational Linguistics GEC Sreekrishnapuram

“ We will never forget them nor the last time we saw them this morning as they prepared for their journey and waved goodbye and 'slipped the surly bonds of earth to touch the face of God.' “ - Ronald Reagan

The pioneer batch (2011-2013) from Simple

had made an effort to participate in various

groups successfully completed their course

conferences, seminars and competitions, to

in Computational Linguistics. The students

share and showcase their understandings and

from the 2012-14 batch had organized

insights in the area of Computational

a Farewell party for them on 31st July 2013.

Linguistics. It is really glad to say that

They were a real role model for all in all

Robert Jesuraj from M.Tech CL (2011-2013)

aspects. It is really proud to say that, the

batch was awarded the esteemed Garuda

SIMPLE groups is the result of their intense

Challenge 2013 award by CDAC for GRID

hard work, determination and enthusiasm.

enthusiasts.

Though it is sad that, they are not with us when the „CLEAR- BIRTHDAY issue‟ is presented, it is a privilege to sustain their fruitful effort. Apart from their team effort, each one of them had put their sole effort for gaining attention towards the SIMPLE group and M.Tech CL at GEC Sreekrishnapuram. They

Robert Jesuraj of SIMPLE Groups won the first prize in GARUDA Challenge 2012

CLEAR September 2013

31


The National Seminar on Relevance of

Computer Application, CUSAT was held on

Malayalam in the Field of Information

19-20 of January 2013.

Technology, jointly conducted by Dept. of Linguistics, Kerala University and KSCSTE, on 1-2 November 2012 was attended by some them. And they had presented their papers on Malayalam computing and thus taking part in the promotion of the use of Malayalam

for

the

dissemination

of

information to common people. Christopher, Sibi , Divya , Rinju, Sumi, Divya Das, Radhika, Ayisha, Renuka, Saani, Athira,

Manu Madhavan, Mujeeb Rehman, Rechitha

Ancy and Pragisha had presented their

and Robert Jesuraj of M.Tech Computational

papers.

Linguistics attended 4th IASNLP workshop at IIIT-H from 5th to 14th July 2012. This gave a good chance to meet a lot of researchers, academicians and industrialists in NLP, and introduce the course to them

Pragisha presenting at National Seminar @ Kerala University

Christopher Augustine and Manu Madhavan had presented their papers on collocation generations

and

Karaka

relations

respectively at the National Conference on

Manu Madhavan, Mujeeb Rehman and

Indian Language Computing (NCILC 2012),

Robert Jesuraj represented SIMPLE groups

conducted by CUSAT on February 19 and

and presented their

20. The NCILC 2013, conducted by Dept. of

Malayalam prosodic patterns and POS

CLEAR September 2013

papers related to

32


tagging respectively. A one day workshop on

Thiruvalla. Divya S from SIMPLE groups

Malayalam Computing, jointly conducted

presented a paper on News Summarization.

by Thunchath

Ezhuthachan

Malayalam

University and Kerala State IT mission was held on February 8th 2013. Radhika K.T. represented SIMPLE groups and presented a paper

on

Conceptual

Indexing

and

Compound word Splitter in Malayalam. Computational

A

paper

Phoneme by

and

Grapheme

to

Converter for Malayalam

Rechitha and Sumi, published in July

issue

of

International

Computational Language

Engineering

on

Linguistics

Processing

paper on

Journal and

Natural

(IJCLNLP).

Dysarthric

The Speech

Networking department of Amrita Viswa

Enhancement by Divya

Vidhyapeedom

week

groups got published in Volume 2 Issue 4–

workshop on Computational Linguistics and

July 2013 of International Journal of Latest

Machine Translation(CLMT) from English

Trends in Engineering and Technology

to Indian Languages. Manu Madhavan,

(IJLTET). Thus each one of them had made

Nibeesh, Robert Jesuraj and Sreejith C

an immense effort in making the SIMPLE

represented SIMPLE groups at CLMT.

groups popular.

organized

a

one

Das of

of

SIMPLE

The farewell function commenced in the forenoon session with a bundle of games and entertainment. Later, the faculties including HOD joined the students for a sumptuous lunch. The afternoon session commenced with a meeting, which was inaugurated by Dr. P.C. Reghu Raj, the Head of Computer Science department, with an inaugural speech. During his speech, he shared some of his memories and appreciated them for their International Conference on Mathematical

remarkable achievements. He concluded by

Modelling in Computer Management and

wishing them good luck in all their future

Medical Sciences 2013 (ICMCMM 2013)

endeavors. After that, the senior faculties

was

Asst. Prof. C. Nazeer and Asst. Prof.

held

on

13-15

CLEAR September 2013

June

2013

at

33


R. Binu, followed by other staff members

Everyone thanked the HOD and other faculty

from CS department shared their memories

members for their endearing efforts in

with the first batch.

helping them during the course. It was a

The faculties also thanked them for all the assistance

they

had

provided

while

conducting various events. There was a

memorable and eventful day for the Simple group with great enthusiasm and a lot of nostalgia.

mixture of emotions among the audience when each one from the senior batch started sharing their experiences.

CLEAR September 2013

34


M.TECH Computational Linguistics 2011-2013

Ancy K Sunny

Athira P M

Christopher Augustine

Manu Madhavan CLEAR September 2013

Divya Das

Mujeeb Rehman O

Ayisha Noori V K

Divya S

Pragisha K 35


M.TECH Computational Linguistics 2011-2013

Radhika K T

Rechitha C R

Rinju O R

Robert Jesuraj K

Sibi S CLEAR September 2013

Renuka Babu T

Saani H

Sumi S Nair 36


M.Tech Computational Linguistics Dept. of Computer Science and Engg, Govt. Engg. College, Sreekrishnapuram Palakkad www.simplegroups.in simplequest.in@gmail.com

SIMPLE Groups Students Innovations in Morphology Phonology and Language Engineering

Article Invitation for CLEAR- Dec-2013 We are inviting thought-provoking articles, interesting dialogues and healthy debates on multifaceted aspects of Computational Linguistics, for the forthcoming issue of CLEAR (Computational Linguistics in Engineering And Research) magazine, publishing on Dec 2013. The suggested areas of discussion are:

The articles may be sent to the Editor on or before 10th Dec, 2013 through the email simplequest.in@gmail.com. For more details visit: www.simplegroups.in

Editor,

Representative,

CLEAR Magazine

SIMPLE Groups

CLEAR September 2013

37


Hello World, While bringing this 5th issue of CLEAR, there are some coincidences with this issue. This September issue is a BIRTHDAY issue, thus celebrating CLEAR„s birthday. It is an edition which witnessed both the admission and farewell of two batches in Computational Linguistics. CLEAR team wishes them the very best in their life ahead. Now, let me share one of my experiences from a workshop at Sri Krishna College. The majority of the participants were research scholars and that too in the area Web Content Mining and Semantic web. This was an eye opening incident that showed the umpteen number of people interested in this area. The tremendous development in computer technology and rapid growth in information in web has resulted in difficulty in handling the information. Thus an increase in need of a methodology to analyze and filter the data is at its peak. Semantic web using ontologies helps to solve this hindrance. Thus it results in a lot of opportunities for exploring in this area. Simple groups welcomes more aspirants in this area.

Wishing you all a Happy and Prosperous ONAM. ď Š

Reshma O.K.

CLEAR September 2013

38


CLEAR September 2013

39


CLEAR September 2013

40


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.