CLEAR September 2013
1
CLEAR September 2013
2
C
Editorial
5
SIMPLE News & Updates
6
M.Tech Computational Linguistic Batch 2011-2013 35
CLEAR September 2013 Volume-2 Issue-3 CLEAR Magazine (Computational Linguistics in Engineering And Research) M. Tech Computational Linguistics Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad 678633 www.simplegroups.in simplequest.in@gmail.com
CLEAR Dec 2013 Invitation
37
Last word
38
PRISM - a Language for statistical Modelling and Learning Ms. Prajitha U 6 Statistical NLP Toolkits for Various Computational Linguistics Problems Mr Sreejith C 12
Chief Editor Dr. P. C. Reghu Raj Professor and Head Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad
Automatic Headline Generation Using Context Free Grammars Mr Krishnaprasad P, Mr Vinayachandran K K 16
Editors Reshma O K Sreejith C Gopalakrishnan G Neethu Johnson
ScalaNLP Mr Robert Jesuraj
28
Adieu to First Batch of SIMPLE Groups Ms Reshma O K
31
Automatic NLP and Semantic Web for Competitive intelligence Mr Manu Madhavan 24
Cover page and Layout Sreejith C
CLEAR September 2013
3
Greetings! With immense pleasure we inform the readers that CLEAR turns 1 year old, and this September issue coincides with its birthday. This fifth edition also coincides with the commencement of the third batch of M.Tech students in Computational Linguistics. This edition of CLEAR consists of short articles on PRISM, a language for statistical modeling and learning, statistical NLP tool kits, ScalaNLP, semantic web and a method for automatic headline generation. This edition also bids farewell to the first outgoing batch on this platform. We are extremely happy that our multi-pronged approach to make this program more visible in the industrial and research landscapes has started yielding results, in terms of placements, publications, and collaborative research. We also focus on Malayalam computing, with the design of Malayalam Wordnet and Morphological analyzer, which will soon be made available publicly. The readers are invited to give frank opinions about the content and style of CLEAR, and we promise to do our best to improve further. CLEAR wishes a happy ONAM to all. With Best Wishes, Dr. P. C. Reghu Raj (Chief Editor)
CLEAR September 2013
4
NEWS & UPDATES Adieu to First Batch of SIMPLE Groups The pioneer batch (2011-2013) from Simple groups successfully completed their course in Computational Linguistics. The students from the 2012-14 batch had organized a Farewell party for their seniors on 31st July 2013. The function kicked off in the forenoon session with a bundle of games and entertainment. Later, the faculty members including HOD joined the students for a sumptuous lunch. The formal meeting in the afternoon session was inaugurated by Dr. P.C. Reghu Raj, the Head of Computer Science department, with an inaugural speech. To read more go to page: 31
Publications
" Dysarthric Speech Enhancement using Formant Trajectory Refinement" , Divya Das, Dr. C. Santhosh Kumar, Dr. P. C. Reghu Raj, International Journal of Latest Trends in Engineering and Technology (IJLTET), Volume 2, Issue 4 , July 2013, ISSN 2278-621X
“Rule-Based Grapheme to Phoneme Converter for Malayalam“, Rechitha C R, Sumi S Nair, C Santhosh Kumar , International Journal of Computational Linguistics and Natural Language Processing (IJCLNLP), Vol 2, Issue 7, July 2013, ISSN 2279 – 0756
SIMPLE Groups Congratulates Divya Das, Rechitha C R, Sumi S Nair for their achievement!!!
CLEAR September 2013
5
PRISM - a Language for Statistical Modelling and Learning Prajitha U M.Tech Computational Linguistics GEC Sreekrishnapuram. PRISM is a general programming language intended for symbolic-statistical modeling. It is a new and unprecedented programming language with learning ability for statistical parameters embedded in programs. Its programming system, shortly called “PRISM system” here, is a powerful tool for building complex statistical models. The theoretical background of PRISM system is distribution semantics for parameterized logic programs and EM learning of their parameters from observations. The PRISM system is comprised of two subsystems, one for learning and the other for execution, just like a human being has an artery and a vein. The execution subsystem supports various probabilistic built-in predicates conforming to the distribution semantics and makes it possible to use a program as a random sampler.
I. Introduction
probability or other concepts from statistical
Speech and language processing comprises
theory.
the fields of Computational Linguistics, Natural Language Processing and Speech Recognition
and
Synthesis.
The
main
concern of these fields is to solve the problems in the interaction between a machine and a human via language. Speech and Language processing has been split into two
paradigms
namely,
symbolic
and
stochastic. The most popular statistical models
are
HMMs
(Hidden
Markov
Models), PCFG (Probabilistic Context Free Grammar) and Bayesian networks. Statistical in the sense, they involve the notion of
CLEAR September 2013
HMM is the popular statistical modelling used in speech recognition systems. HMM is nothing more than a probabilistic function of a Markov process where as a Markov chain is a weighted automaton in which the input sequence uniquely determines the states. In a HMM, the state sequence is unknown, only have some probabilistic function. One widespread use of HMM is in tagging. They are one of a class of models where training is possible through EM algorithm. An HMM is specified by a five-tuple (S, K, Π, A, B), where S and K are the set of states and the 6
output alphabet, and Π, A, and B are the
Bayes Theorem:
probabilities for the initial state, state transitions,
and
symbol
emissions,
respectively. PCFG,
a
P(h│D) = P(D│h) P(h) P(D) A Bayesian belief network or a Bayesian
Probabilistic
Context
Free
Network
describes
the
probability
Grammar is the simplest probabilistic, which
distribution governing a set of variables by
is simply a CFG with probabilities added to
specifying a set of conditional independence
the rules, indicating how likely different
assumptions along with a set of conditional
rewritings are. A PCFG is a 5 tuple
probabilities. In contrast to the naive Bayes
(N,∑,P,S,D), where D is a function assigning
classifier, which assumes that all the
probabilities to each rule in P. N is a set of
variables
non-terminal symbols, ∑ is a set of terminal
given the value of the target variable,
symbols and S is the start symbol. A PCFG
Bayesian belief networks allow stating
can be used to estimate a number of useful
conditional independence assumptions that
probabilities concerning a sentence and a
apply to subsets of the variables. Thus,
parse tree. This helps in disambiguation.
Bayesian
Another attribute is that it assigns a
intermediate
probability to the string of words constituting
constraining than the global assumption of
a sentence, which is important in language
conditional independence made by the naive
modelling.
Bayes classifier. They are an active focus of
are
conditionally
belief
networks
approach
independent
provide
that
is
an less
current research, and a variety of algorithms Bayesian reasoning provides a probabilistic
have been proposed for learning them and
approach to inference. Bayes theorem is the
for using them for inference.
cornerstone of Bayesian learning methods because it provides a way to calculate the posterior probability P(h│D), from the prior probability P(h), together with P(D) and P (D│ h ) .
II. PRISM PRISM is an acronym of Programming In Statistical Modelling. PRISM was developed by
T.Sato
programming
and
Y.Kameya.
language
for
It
is
a
symbolic-
statistical modelling. PRISM employs a CLEAR September 2013
7
proof-theoretic approach to learning. It
(;), the cut symbols (!) and if-then (->)
conducts learning in two phases: the first
statements as far as they work as expected
phase searches for all the explanations for
according to the execution mechanism of the
the observed data, and the second phase
programming system. Besides the definitions
estimates the probability distributions by
of probabilistic predicates, we need to make
using the EM algorithm. PRISM is a
some
probabilistic extension of Prolog. PRISM
(coin,[head,tail]) declares the outcome space
programs are executed in a top-down left-to-
of a switch named coin, and the call
right manner just like Prolog. It is clear that,
msw(coin,Face) makes a probabilistic choice
PRISM = Prolog + probability + parameter
(Face will be bound to the result), just like a
learning. The most characteristic feature of
coin tossing. On the other hand, the clause
PRISM is that it provides random switches to
target(direction/1)
make probabilistic choices. A random switch
observable event is represented by the
has a name, a space of possible outcomes,
predicate direction/1. This means that we can
and a probability distribution. The example
observe the direction he/she goes.
declarations.
The
clause
declares
that
values
the
given below uses just one random switch: PRISM is a powerful tool for building Example:
complex statistical models. Its programs are
target(direction/1).
also able to learn from examples with the
values(coin,[head,tail]).
help of EM learning algorithm. Most popular
direction(D):-
probabilistic modelling formalisms such as
msw(coin,Face),
HMM, PCFG and Bayesian networks can be
(Face==head -> D=left;D=right).
described using PRISM. PRISM offers a common vehicle for these diverse research
The predicate direction(D) indicates that a
fields.
We
can
train
arbitrarily
person decides the direction to go as D. The
programs, as we desire in PRISM.
large
decision is made by tossing a coin: D is bound to left if the head is shown, and to
A PRISM program DB can be written as a
right if the tail is shown. In this sense, we
set of definite clauses. DB = F U R: F is a set
can
is
of facts and they are unit clauses and R
probabilistic. It is allowed to use disjunctions
denotes a set of rules which are non unit
say
the
predicate
CLEAR September 2013
direction/1
8
clauses. What makes PRISM programs differ from usual logic programs is a basic joint probability distribution Pf given to F. One of the significant achievements of prism is the elimination of the need for deriving new EM algorithms for new applications.
III. PRISM programming system:
PRISM programming consists of three phases:
1. Programming 2. Learning 3. Execution
The following Figure.1 shows a PRISM programming system. The original program must be treated differently in the learning phase and the execution phase. A translator
The control declarations give the data needed
translates it into two specialized programs. A
for the learning phase. The structure of a
PRISM program comprises three parts: a
PRISM program is shown in Figure. 2. The
model, a utility program and control
structure of the language is simple; most
declarations. The purpose of the model part,
instructions are processed independently.
which is a logic program, is to generate
The power lies in a very rich set of
possible proof trees of a target atom. The
instructions,
utility part also contains a logic program that
variables. Data types are both simple and
makes use of a probability distribution PDB
very powerful.
functions
and
pre-defined
(extension of Pf).
CLEAR September 2013
9
2. Unpack the downloaded package using the tar command. 3.Append
$HOME/prism/bin
to
the
environment variable PATH so PRISM can be started at every working directory.
The package contains binaries for both 32-bit and 64 bit systems. The start-up commands like
prism,
upprism
and
mpprism
automatically choose a binary.
As far as PRISM is concerned NLP is a promising area because of the obvious need for describing statistical correlations between syntactic structures and semantic structures. A cross fertilization of computation power IV. Installing PRISM:
and learning power will give a new dimension to programming, which will be
Windows: To
install
enable through PRISM. Using probabilities PRISM
on
Windows,
the
following steps are needed: 1.Download
the
can sometimes be more effective than using hard rules for handling many NLP problems.
package
Hope that these statistical models together
prism111_win.zip.
with PRISM will pave a way for the success
2. Unzip the downloaded package under C:\
in many areas of NLP.
3. Append C:\prism\bin to the environment variable PATH so PRISM can be started at
References:
every working folder. 1. Tom M Mitchell, “Machine learning” 2. Daniel Jurafsky and James H.Martin,”
Linux: 1.Download
the
package
Speech and Language Processing”
prism111_linux.tar.gz CLEAR September 2013
10
3.Taisuke
Sato
and
Yoshika
5.
Henning
Christiansen,
“Logical-
Kameya,”PRISM: A Language for Symbolic
Statistical Models and Parameter Learning in
Statistical Modeling”
the PRISM system”
4. Taisuke Sato and Neng-Fazhou,” A New Perspective of Statistical Modeling by PRISM”
Technology Development for Indian Languages (TDIL) Technology Development for Indian Languages (TDIL) Programme initiated by
the
Department of Electronics & Information Technology (DeitY), Ministry of Communication & Information Technology (MC&IT), Govt. of India has the objective of developing Information Processing Tools and Techniques to facilitate human-machine interaction without language barrier; creating and accessing multilingual knowledge resources; and integrating them to develop innovative user products and services. The Programme also promotes Language Technology standardization through active participation in International and national standardization bodies such as ISO, UNICODE, World-wide-Web consortium (W3C) and BIS (Bureau of Indian Standards) to ensure adequate representation of Indian languages in existing and future language technology standards. Visit: http://tdil.mit.gov.in/
CLEAR September 2013
11
Statistical NLP Toolkits for Various Computational Linguistics Problems Sreejith C M.Tech Computational Linguistics GEC Sreekrishnapuram
Hi. in this article I would like to introduce
be useful for various language processing
some of the useful and widely used toolkits
tasks especially for text mining.
available for various linguistics problem. Most of them are open source and can be
I. Hidden Markov Model Toolkit (HTK)
freely downloaded from internet. The scope
The Hidden Markov Model Toolkit (HTK) is
of
a
applications
of
natural
language
portable
toolkit
for
building
and
processing is enormous are in several areas
manipulating hidden Markov models. HTK
such as:
is primarily used for speech recognition research although it has been used for
Semantic Search
Morphology, Syntax, Named Entity
numerous
Recognition
Opinion,
applications
including
research into speech synthesis, character recognition and DNA sequencing. HTK is in
Emotions,
Textual
Entailment
use at hundreds of sites worldwide. HTK consists of a set of library modules and tools
Text and Speech Generation
Machine Translation
Information
other
Retrieval
available in C source form. The tools provide sophisticated facilities for speech analysis,
and
Text
HMM training, testing and results analysis.
Clustering
The software supports HMMs using both
Educational Applications
continuous density mixture Gaussians and
In Clear March 2013 edition Mr..Manu Madhavan have introduced about various Ontology based tools such as protégé, jena, swoop etc. In this section I would like to
discrete distributions and can be used to build complex HMM systems. The HTK release contains extensive documentation and examples.
introduce some other basic tools which will CLEAR September 2013
12
HTK is available for free download but you
including
must first agree to this license. You must
tagging, named entity recognition, parsing,
then register for a username and password
and co reference.
which will allow you to download the HTK Book and source code. Registration is free
tokenization,
part-of-speech
2. Stanford Parser
but does require a valid email address; your
Implementations of probabilistic natural
password for site access will be sent to this
language parsers, both
address.
optimized PCFG and dependency parsers,
highly
and a lexicalized PCFG
Ref : http://htk.eng.cam.ac.uk/
parser in Java.
3. Stanford POS Tagger II.
The
Stanford
Natural
Language A maximum-entropy (CMM) part-of-speech
Processing Group
(POS) tagger for English, Arabic, Chinese, The Stanford NLP Group makes parts of our Natural
Language
Processing
software
French, and German, in Java. 4. Stanford Named Entity Recognizer
available to everyone. These are statistical NLP
toolkits
for
various
major
A Conditional Random Field sequence
computational linguistics problems. They
model,
can be incorporated into applications with
features for Named Entity Recognition in
human language technology needs. All these
English and German.
software distributions are open source,
together
with
well-engineered
5. Stanford Word Segmenter
licensed under the GNU General Public License (v2 or later). Note that this is the full
A CRF-based word segmenter in Java.
GPL, which allows many free uses, but does
Supports Arabic and Chinese.
not allow its incorporation into any type of distributed proprietary software, even in part or in translation. 1. Stanford CoreNLP
IV. Weka: Data Mining Software in Java
Weka is a collection of machine learning
An integrated suite of natural language
algorithms for data mining tasks. The
processing tools for English in Java,
algorithms can either be applied directly to a
CLEAR September 2013
13
dataset or called from your own Java code.
tagging, parsing, and semantic reasoning.
Weka contains tools for data pre-processing,
NLTK is a free, open source, community-
classification,
clustering,
driven project. NLTK has been called “a
association rules, and visualization. It is also
wonderful tool for teaching, and working in,
well-suited for developing new machine
computational linguistics using Python,” and
learning schemes. Weka is open source
“an amazing library to play with natural
software issued under the GNU General
language.”
Public License.
Ref: http://nltk.org/
regression,
Ref: http://www.cs.waikato.ac.nz/ml/weka/ VII. Ngram Statistics Package (NSP)
V. WordFreak
NSP allows you to identify word and linguistic
character N grams that appear in large
annotation tool designed to support human,
corpora using standard tests of association
and automatic annotation of linguistic data as
such as Fisher's exact test, the log likelihood
well as employ active-learning for human
ratio, Pearson's chi-squared test, the Dice
correction of automatically annotated data.
Coefficient, etc. NSP has been designed to
For the latest news about WordFreak and to
allow a user to add their own tests with
participate in discussions, check out Word
minimal effort.
WordFreak
is
a
java-based
Freak's Sourceforge project page.
Ref: http://ngram.sourceforge.net/
Ref: http://wordfreak.sourceforge.net/ VIII. MALLET
VI. NLTK - Natural Language Toolkit
MALLET is a Java-based package for NLTK is a leading platform for building
statistical
Python programs to work with human
document classification, clustering, topic
language
modeling, information extraction, and other
data.
It
provides
easy-to-use
natural
language
processing,
interfaces to over 50 corpora and lexical
machine
resources such as WordNet, along with a
MALLET includes sophisticated tools for
suite
for
document classification: efficient routines
stemming,
for converting text to "features", a wide
of
text
classification,
processing tokenization,
CLEAR September 2013
libraries
learning
applications
to
text.
14
variety of algorithms (including Naïve
prototypes
to
Bayes, Maximum Entropy, and Decision
Ellogon is licensed under the GNU LGPL
Trees), and code for evaluating classifier
license, is easy to install and administer and
performance using several commonly used
is
metrics.
operating
reliable.
commercial
Running systems,
under Ellogon
applications.
all
major
offers
a
comfortable environment for computational
Ref: http://mallet.cs.umass.edu/
linguists, language engineers or plain users.
IX. Emdros
Ref: http://www.ellogon.org/
It is a text database engine for analyzed or annotated text. It is applicable especially in linguistics,
corpus
linguistics,
and
computational linguistics. Emdros is Open Source and Free Software (Libre Software).
These are only some of the natural language processing toolkits available over internet. There are several other tools. Have a look on all those and get familiarized. I hope that this article
will
help
you
to
start
your
“experimentation“ with languages. Happy
Ref: http://emdros.org/
coding...
X. Ellogon Ellogon is an effort that tries to keep all the excitement and reduce all the complexity. Ellogon is different from other similar software. First of all, it respects the user's time by offering a simple and user friendly graphical interface. But beneath this simple appearance a powerful engine is hidden, that has been proved to be able to support a wide range
of
uses,
from
CLEAR September 2013
simple
research
15
Automatic Headline Generation Using Context Free Grammars Krishnaprasad. P Student, B.Tech (IT) Govt. Engineering College Sreekrishnapuram
Vinayachandran K.K Assistant Professor (IT) Govt. Engineering College Sreekrishnapuram
This article presents a novel way for generating the headline by exploiting the benefits of context free grammar. The system is based on summarizing the given document in order to get condensed form of text and then the content words are identified. Input for sentence generation, is obtained by separating named entities, nouns, verbs etc. from the content words. The system effectively generates the summary of the text document based word-frequency based scoring technique. For generating title, the paper presents a context free grammar which produce suitable sentence from given content words. The experiments showed that the Title generated is efficient and the suggested titles are really helpful in extracting the important document.
I. Introduction
headline of a text, specially an article, is a
Natural language Processing has received a great deal of attention in recent research because of is wide applicability. Research on automatic text summarization provides basis for research on headline generation. The rapid growth of the Internet has resulted in enormous amounts of information that has become increasingly more difficult to access efficiently.
The
ability
to
summarize
information automatically and present results to the end user in a compressed, yet complete form, would help to solve this problem. A CLEAR September 2013
succinct representation of relevant points of the input text. It differs from the task of producing abstracts, in the size of the generated text and focuses on compressing the output. Headlines are terse while abstracts are ex-pressed using relatively more words. While headlines focus on pointing out the most relevant theme expressed in the input text, abstracts summarize the important points. This makes both headline generation and summarization extensively valuable.
16
Early, the problem of generating headlines
text
search and
Information
Retrieval.
for documents and text summarization uses
Automatic headline generation tries to
purely statistical extraction based. Most of
automate the process of providing more
the summarization work done till date is
relevant or reflective insight into the input
based on extraction of sentences from the
text rather than producing catchy lines.
original document. The sentence extraction
Automating in this context has to involve
techniques compute score for each sentence
some form of learning rather than an
based on features such as position of
algorithmic approach given the potentially
sentence in the document, word or phrase
infinite stretch of natural language text.
frequency, key phrases (terms which indicate
Many machine learning techniques have
the importance of the sentence towards
been explored involving varying degree of
summary). There were some attempts to use
use of natural
machine learning (to identify important
techniques.
language understanding
features), use natural language processing (to identify key passages or to use relationship
Context Free Grammars are relatively recent
between
techniques
words
rather
than
bag
of
used
in
natural
language
words).vector representation model is also
processing. In this work we are trying to
used in text summarization techniques.
make natural headline using context free
Headlines are commonly associated with news articles but have wide range of applications. Application areas of headline generation involve generating table of contents for a document to providing support
grammar. The advantage of using CFGs is that it does not require any training data. We also present a summarization technique based on sentence scoring method as a part of content word extraction.
for interactive query refinement in search engines. Headlines extracted from search result web pages can be used to augment a user search query. The resultant query can be used to further re-rank and improve upon the search results. This approach of augmenting a user query with key words extracted from text is being increasingly used in Contextual CLEAR September 2013
II. Context Free Grammar Model For Headline Generation
Here we present model for generating headline for a text document using context free grammars. Our model composed of two parts. The first part is comprised of 17
summarization part and second part is
summarization
headline
follows:
synthesis.
The
algorithm
for
can
be
summarized
as
headline synthesis purely depends on input text
document
only.
Natural
language
1. Sentence marking:
sentence generation is the heart of the algorithm.
This module divides the document into sentences. It appears that using end-ofsentence punctuation marks, such as periods,
A. Summarization
question marks, and exclamation points, is The summarization system has both text
sufficient
for
marking
analysis component and summary generation
boundaries. It should be noted exclamation
component. The text analysis component
point and question mark are somewhat less
used to identify the features associated with
ambiguous.
each sentence. Before the extraction process
appeared in non standard words like web
text normalization is performed (The text
URLs, emails etc.
Whereas
the
periods
sentence
can
be
normalization involves splitting the text into sentences). After text normalization, the
2. Feature extraction:
normalized text is passed through a feature extraction module. Feature extraction include
The system extracts both the sentence level
extracting features associated with the
and word level features. We are actually
sentences and the features associated with
interested only in word level features
words
word
because, we need not require high quality
frequency, characters per word etc. Later in
summary for title generation process. Our
order to summarize the system calculates the
aim is to supply a brief summary input to
score for each sentence based on the features
headline synthesis part. For picking out best
that we already identified in the previous
sentence from given document we follow the
step. Sentence refinement is done on the
sentence scoring technique based on word
sentences with high score, and the resulting
frequency and average number of characters
sentences are selected for the summary in the
per sentence.
such
as
named
entities,
same order as they were found in the input text
document.
Various
CLEAR September 2013
steps
in 18
3. Summary Generation:
B. Headline synthesis
Summary generation include tasks such as
Headline Synthesis involves generating a
calculating the score for each sentence,
suitable headline for given input text file
selecting the sentences with high score, and
based on the content words extracted from
refinement of the selected collection of
the
sentences.
components:
4. Sentence Ranking:
1. Extracting content words:
Summarization system follow simple but
Content words are the word which represents
efficient sentence ranking technique based
over all text. Identification of content words
on word frequency of particular word in the
has special importance, because the quality
sentence and average number of characters
of the title generated will depends on the
in each word. Mathematical model of
exact identification of the content words. We
sentence ranking is discussed later in this
can follow any prominent method for
paper.
extracting out the content words.
5. Sentence Selection:
2. Identification of Elements for title
document.
It
comprised
of
three
generation: After the sentences are scored, we need to select
the
good
Analyzing the headlines, it is noted that the
summary. One strategy is to pick the top N
headline is formed by named entities or/and
sentence towards the summary, but this
frequent nouns or verbs. Out of number of
creates the problem of coherence. The
selected content words we are actually
selection of sentence is dependent upon the
interested only in nouns, verbs, and named
type
entities.
of
sentences
sentences
the are
that
make
summary requested. selected
based
on
The the
percentage of output text required with
3. Generation of Headline:
respect to the input document.
CLEAR September 2013
19
The heart of this paper lays on fact that,
mathematical basis for sentence scoring
context
for
summarization method. Since the key part of
generating title from identified elements. The
summarization lies on sentence scoring and
effectiveness of the title depends on how
sentence selection followed by coherence,
well natural language sentence is generated
the quality of the generated summary can be
from the identified headline elements and the
ensured from the scoring function. The
following refinement. It is worth to note that
sentence scoring can be performed by: Term
the model entirely depends only on given
frequency: It takes into account only the
input text file. The key advantage of this
frequency of a term inside the document:
free
Grammars
are
used
headline synthesis model is that system doesn't need any separate learning process.
TFi,j= number of occurrences of term in document j. document length: It is logical to assume that
III. Mathematical Modeling
terms appear more frequently in bigger _les, A. HS Algorithm
so if a term is relatively more frequent in a short than in a big file, then it is more
Through this we implementing a new
important. To incorporate document length
algorithm named HS(Headline Synthesis)
in the weighting formula we define:
algorithm. We had already mentioned that the context free grammar based algorithm is
DLj= total number of term occurrences.
basis for headline synthesis. The algorithm
This can be generalized to the average length
can be implemented as - summarizing given
of a document:
text from which generating the Headline.
NDLj=DLj= Average DL of all document
The abstract view of the algorithm is given below:
A.2 CWE Algorithm
A.1 SSS Algorithm
The
CWE
(Content
Word
Extraction)
algorithm extracts content words of the text Following the discussion of summarization
document. Once the content words are
system we have to implement Sentence
separated we make dictionaries of noun, verb
Scoring
and named entities.
Summarization
CLEAR September 2013
algorithm
as
20
A.3 HS-CFGs Algorithm
[13] in noun [14] until l1>0
The
heart
algorithm
of
the
lies
on
Headline
Synthesis
HS-CFGs(Headline
[15] If l2 > 0 Do Steps 16 [16] Add production VP -> V, 'the' N
Synthesis using Context Free Grammar)
[17] Repeat Steps 15 to 17
algorithm.
[18] Add production V ->item in verb [19] until l3>0
Input:
[20] Initialize rules
Let l1, l2, l3 be length of the dictionaries
[21] Set expansion list Do step 20 to 23
representing noun, verb and named entities
[22] If the starting rule was in set of rules,
respectively. Let the nouns, verbs and named
then
entities are already extracted out from the
[23] Grab one possible expansion
content words, and let they are stored at
[24] For element in random expansion Do
dictionaries noun, verb and performers list
[25] Expand the element
respectively.
[26] Else Do step 25 to 23 [27] If the rule wasn't found, then
Output: Suitable headline for the text input. Method: Steps [1] If l2 > 0 Add production S -> NP, VP [2] Else Add production S -> NP
[28] It's a terminal: Simply append the string to the expansion [29] For every word in expansion [30] If the word is repeating than [31] Eliminate repeating word [32] Output Headline
[3] Set NP -> 'the', N [4] If l3 > 0 Do Steps 5 to 9
IV. Manual Evaluation Technique
[5] Repeat Steps 6 to 9 [6] Add production N -> item
Manual Evaluation is simple set up in which
[7] in performers list
the machine generated headline is evaluated
[9] until l3 > 0
manually. For a number of documents
[10] else Do steps 11 to 14
machine generated headline is compared
[11] Repeat Steps 12 to 14
against human generated headline. A suitable
[12] Add production N -> item
s core (mark) is assigned for both(human
CLEAR September 2013
21
generated machine generated) the headline.
language contains less number of content
The quality of headline generation system
words than that of scientific document.
can be analyzed using suitable graphical method (bar graph or line graph).
V. Evaluation using vector space analysis
A vector space search involves converting documents into vectors. Each dimension within the vectors represents a term. If a document contains that term then the value within the vector is greater than zero. In this method
both
machine
Generated
and
deviation of Human Generated headlines are converted into document vectors. A plot is Figure 1: Cosine Similarity
performed by:
The size of document has also an impact on cosA = (t1*t2)/(t12) _ (t22)
the
generated
generated Where t1 corresponds document vector of machine generated headline and t2 is for human synthesized headline.
title
headline. also
The
depends
machine on
how
effectively document is summarized. The extraction of content words
can also
influence the headline. The Grammar used
The graph obtained as shown in Figure 1.
for sentence generation can influence the accuracy of the headline. The advantage of
VI. Conclusion
this headline generation system is that it does not require any learning to machine and the
The headline generation system is very successful
for
scientific
and
technical
document, but less powerful for poetic language. The reason is that the poetic
generated title depends entirely on input text file.
Improved
methods
for
keyword
extraction and novel way for generating accurate sentences from input words will make the system powerful.
CLEAR September 2013
22
[6] Natural
References
Python.Steven [1] Automated Natural Language Headline
Language
Processing with
Bird,
Klein,
Ewan
and
Edward Loper.
Generation Using Discriminative Machine Learning Models. Akshay Kishore Gattani
[7] Optimizing Machine Learning Approach
B.E.(Honors). Birla Institute of Technology
Based
and Science Pilani(India) 2004.
Summarization. Farshad Kyoomarsi, Hamid
on
Fuzzy
Logicin
Text
Khosravi, Esfandiar Eslami, and Pooya [2] Automatic Text Summarization using a
Khosravyan
Dehkordy.
Machine Learning Approach. Joel Larocca
University(Shahrekord branch) International
Neto, Alex A.Freitas, Celso A. A.Kaestner.
Center
Pontifical Catholic University of Parana
Environmental
(PUCPR) Rua Imaculada Conceicao, 1155.
Shahid Bahonar Kerman Shahid Bahonar
for
Science
Islamic
High
Sciences
Azad
Technology
,University
of
University of Kerman, The center of [3]
Bengali
Sentence
Text
Summarization
Extraction.
Kamal
By
Sarkar.
Excellence
for
Fuzzy
system
and
applications.
Computer Science Engineering Department Jadavpur University Kolkata 700 032 India.
[8]
Sentence
Extraction
Based
Single
Document Summarization. Jagadeesh J, [4] Challenges and Trends of Automatic
Prasad Pingali, Vasudeva Varma. Workshop
Text Summarization. Oi Mean Foong1, Alan
on Document Summarization, 19th and 20th
Oxley1,
March, 2005, IIIT Allahabad Report No:
Suziah
Sulaiman.
Universiti
Teknologi Petronas, Malaysia.
IIIT/TR/2008/97.
[5] Improved Algorithms For Keyword
[9] Using Machine Learning for Medical
Extraction and Headline Generation From
Document Summarization. Kamal Sarkar,
Unstructured Text. Amit Kumar Mondal and
Mita Nasipuri, Suranjan Ghose .Computer
Dipak Kumar Maji. Department of Computer
Science
Science and Engineering Indian Institute of
Jadavpur University, Kolkata-700 032, India.
and
Engineering
Department,
Technology, Kanpur, India 208016.
CLEAR September 2013
23
NLP and Semantic Web for Competitive Intelligence Manu Madhavan Asst. Professor, SIMAT, Vavannur Although Natural Language Processing and Semantic Web technologies are both “Semantic Technologies,� they are, in a way, opposites. NLP tools focus on unstructured information, such as long-form documents, emails, and web pages, while Semantic Web tools typically deal with more structured information on a much more granular level. However, there are many important problems that span the two worlds of structured and unstructured information where the combination of NLP and Semantic Web tools is highly complementary. In fact, the flexibility of the Semantic Web’s data model is a particularly good fit for problems involving lots of unstructured information, making the combination particularly powerful.
information
I. Competitive Intelligence Competitive Intelligence (CI) is the action of defining,
gathering,
analyzing,
distributing intelligence about
and
products,
centers,
and
competitive
intelligence which is a perspective on developments and events aimed at yielding a competitive edge [1].
customers, competitors and any aspect of the environment needed to support executives and managers in making strategic decisions for an organization. Experts also call this process the early signal analysis. This definition focuses attention on the difference between dissemination of widely available factual information (such as market statistics, financial
reports,
newspaper
clippings)
performed by functions such as libraries and
CLEAR September 2013
24
thousands
II. The Scenario There has been a huge computer technology development and an accelerated growth in information quantity produced in the last two decades of the 20th Century. But how do companies use these published data, mainly in the digital media? What do they use to increase their competitive advantages? It is true
that
most
companies
recognize
information as an asset and believe in its value for strategic planning. The big difficulty
however
is
to
deal
with
information in a changing environment. The temporal
character
becoming
more
of and
information more
is
critical.
Information valid today may not be valid tomorrow anymore. Data are not static blocks to become a building block of a temporal reality. Information analysis is no longer an action; it has become a process [2].
of
articles
and
keep
this
information in their heads, or in workbooks like Excel, or, more likely, nowhere at all. To make a defensible and actionable strategy it is useful to perform Influence Analysis and Network Analysis, which can form the kernel of a competitive intelligence analysis strategy. The data required for analysis must be obtained by identifying and extracting target attribute values in unstructured and often very large (multi-terabyte or petabyte) data stores. This necessitates a scalable infrastructure, distributed parallel computing capability, and fit-for-use natural language processing algorithms. Taking advantage of the large amount of information in the World Wide Web it is propose
a
methodology
to
develop
applications to gather, filter and analyze web data and turn it into usable intelligence
Due to the lack of structure in news
(WeCIM). In order to enhance information
clippings,
search
it
pharmaceutical
is
very
difficult
competitive
for
a
intelligence
and
management
computers
"Which
knowledge domains.
have
published
information in the last 6 months referencing compounds that target a specific pathway that we're targeting this year?" At the moment, the most common approach to this problem is for certain people to read CLEAR September 2013
is
proposed the use of ontologies that allow
officer to get answers to questions such as, companies
quality it
to
“understand�
particular
III. Combining NLP and Semantic Web Functionalities Sean
Martin,
CTO
Cambridge
Semantics,
commented that, although Natural Language Processing and Semantic Web technologies are 25
both “Semantic Technologies,� they are, in a
data located in databases and elsewhere, thus
way, opposites.
bridging the gap between documents and formal, structured data [2]. IV. Text mining functionalities for CI The first supporting functionality to the analyst comes from the limited daily reading capacity of a human being [3]. The Filtering functionality has been created with the purpose of allowing pre-selection of reading contents,
assuming
that
the
important
information is, very likely, within the filtered subset. The technological Event Alert NLP
tools
focus
on
unstructured
information, such as long-form documents, emails, and web pages, while Semantic Web tools typically deal with more structured information on a much more granular level “.
So how can NLP technologies realistically be used in conjunction with the Semantic Web? The answer is that the combination can be utilized in any application where you are contending with a large amount of unstructured information, particularly if you
functionality
was
developed
with
the
objective of advising the analyst as soon as possible of some pre-specified events as being important to his business. The third and last functionality refers to a Semantic Search tool that becomes necessary for adhoc demanded information. This demand comes from the fact that both Filtering and Event Alert are planned and predefined. The objective of this tool is, therefore, to allow the analyst to reach the information required in a particular instance, as soon as possible.
also are dealing with related, structured information stored in conventional databases. Clearly, then, the primary pattern is to use NLP to extract structured data from text based documents. These data are then linked
V. Conclusion The combination of NLP and Semantic Web technology
enables
the
competitive
intelligence officer to ask complicated
via Semantic technologies to pre-existing CLEAR September 2013
26
questions
and
actually
get
reasonable
answers in return. By their very nature, NLP A web of meaningful data
technologies can extract a wide variety of information, and Semantic Web technologies
The Web was designed as an information
are by their very nature created to store such
space, with the goal that it should be
varied and changing data. In cases such as
useful
this, a fixed relational model of data storage
communication, but also that machine
is clearly inadequate.
would be able to participate and help.
not
only
for
human-human
One of the major obstacles to this has been the fact that most information on
References
the
Web
is
designed
for
human
1. http://www.cambridgesemantics.com
consumption, and even if it was derived
/semantic-university/nlp-and-the-
from a database with well defined
semantic-web
meanings (in at least some terms) for its
2. Christian Aranha, Emmanuel Passos,
columns, that the structure of the data is
“Automatic NLP for Competitive
not evident to a robot browsing the web.
Catholic
Leaving aside the artificial intelligence
University of Rio de Janeiro, Brazil,
problem of training machines to behave
2008.
like people, the Semantic Web approach
Intelligence”,
Pontifical
3. Juan Antonio , “Semantic Web meets
instead
develops
languages
for
Competitive Intelligence“, Master
expressing information in a machine
Thesis, University of Granada, 2009.
processable form" [Tim Berners-Lee, Semantic Web Road Map, sept. 1998]
CLEAR September 2013
27
ScalaNLP Robert Jesuraj M.Tech Computational Linguistics Breeze is the new scientific computing library for Scala, supporting linear algebra, numerics, statistics, machine learning, and natural language processing. Breeze merges two formerly separate projects: ScalaNLP and Scalala. Scalala provides linear algebra and numerics, while ScalaNLP provides the rest. The Scalala portions are largely rewritten, with high performance being the priority.
I. An Introduction to Scala Programming language It Scala is an object-functional programming and scripting language for general software applications, statically typed, designed to concisely express solutions in an elegant, type-safe and lightweight (low ceremonial) manner. Scala includes full support for functional programming (including currying, pattern matching, algebraic data types, lazy evaluation, tail recursion, immutability, etc.). It cleans up what are often considered to have been poor design decisions in Java (e.g. type erasure, checked exceptions, the nonunified type system) and adds a number of other features designed to allow cleaner, more concise and more expressive code to be written.
is
intended to be compiled to Java bytecode (the executable JVM) or (DOT) NET. Like Java, Scala is statically typed and objectoriented, uses curly-brace syntax reminiscent of C, and compiles code into Java bytecode, allowing Scala code to be run on the JVM and permitting Java libraries to be freely called from Scala (and vice-versa) without the need for a glue layer in-between. Compared with Java, Scala adds many features
of
functional
programming
languages like Scheme, Standard ML and Haskell, including anonymous functions, type inference, list comprehensions (known in Scala as "for-comprehensions"), lazy initialization, extensive language and library support for side-effect-less code, pattern
CLEAR September 2013
28
matching,
case
classes,
delimited
continuations, higher-order types, much better support for covariance and contra
breeze-process contains tokenization, and other NLP-related stuff.
variance than in Java, etc.
breeze-learn
contains
machine
learning and optimization routines.
The name Scala is a blend of "scalable" and "language", signifying that it is designed to
breeze-viz
contains
plotting
and
visualization routines.
grow with the demands of its users. James
breeze-core contains some basic data structures and configuration
Strachan, the creator of Groovy, described Breeze also provides a fairly large number of
Scala as a possible successor to Java.
probability distributions built in. These come with access to either probability density II. ScalaNLP: Natural Language
function (for discrete distributions) or pdf
Processing and Machine Learning
functions
(for
continuous
distributions).
Many distributions also have methods for ScalaNLP is a suite of machine learning and
giving the mean and the variance.
numerical computing libraries. ScalaNLP is the umbrella project for Breeze and Epic. Breeze is a set of libraries for machine
Breeze's
package
includes
several convex optimization routines and a simple
learning and numerical computing.
optimization
linear
optimization
program routines
solver. typically
Convex take
a
DiffFunction[T], which is a Function1 extended to have a gradient method, which returns the gradient at a particular point. Most routines will require a breeze.linalgenabled type: something like a Vector or a Counter. Breeze Data: Most of the classifiers in Breeze-Learn expect Examples for training,
Breeze consists of five parts:
breeze-math performance
contains linear
numeric. CLEAR September 2013
algebra
high-
which are simply traits that have a label,
and
some features, and an id. Example is generic about what those types are. Observation is 29
Example's parent type, and it differs only in
Breeze
not having a label.
classifiers: a standard Naive Bayes classifier,
There are also some routines for reading in different standard formats of datasets.
classify:
Breeze
provides
4
an SVM, a Perceptron, and a Logistic Classifier (also known as a softmax or maximum entropy classifier). All classifiers
Breeze.text.tokenize: This package provides
(except Naive Bayes) have a Trainer class in
methods for tokenizing text and turning it
their companion object, which can be used to
into more useful forms. For example, we
train
provide routines for segmenting sentences,
breeze.data.Examples. This classifier can
tokenizing text into a form expected by most
then be applied to new observations.
up
a
classifier
from
parsers, and stemming words. Breeze util: util contains a number of useful
References
things. Mostly notable is Index which specifies a mapping from objects to integers. In NLP, we often need to map structured
[1] https://github.com/scalanlp/breeze [2] http://www.scala-lang.org/
objects (strings say) to unique integers for efficiency's sake. This class, along with Encoder, allows for that.
Sandhan Sandhan' (http://www.tdil-dc.in/sandhan), a monolingual search engine of five Indian languages Bangla, Hindi, Marathi, Tamil and Telugu was released on September 20, 2012, at Electronics Niketan, New Delhi. The project 'Sandhan' was taken up under the umbrella of Technology Development in Indian Languages (TDIL) programme of the Ministry of Communication and Information Technology, and executed by institutions such as AUKBC, AUCEG, CDAC Noida (Co-coordination), CDAC Pune, DAICT Gandhinagar, Guahati University, IIT Bombay (Coordination), IIT Kharagpur, IIIT Bhubaneshwar, IIIT Hyderabad, ISI Kolkata, and Jadavpur University.
CLEAR September 2013
30
Adieu to First Batch of SIMPLE Groups Reshma O.K. M.Tech Computational Linguistics GEC Sreekrishnapuram
“ We will never forget them nor the last time we saw them this morning as they prepared for their journey and waved goodbye and 'slipped the surly bonds of earth to touch the face of God.' “ - Ronald Reagan
The pioneer batch (2011-2013) from Simple
had made an effort to participate in various
groups successfully completed their course
conferences, seminars and competitions, to
in Computational Linguistics. The students
share and showcase their understandings and
from the 2012-14 batch had organized
insights in the area of Computational
a Farewell party for them on 31st July 2013.
Linguistics. It is really glad to say that
They were a real role model for all in all
Robert Jesuraj from M.Tech CL (2011-2013)
aspects. It is really proud to say that, the
batch was awarded the esteemed Garuda
SIMPLE groups is the result of their intense
Challenge 2013 award by CDAC for GRID
hard work, determination and enthusiasm.
enthusiasts.
Though it is sad that, they are not with us when the „CLEAR- BIRTHDAY issue‟ is presented, it is a privilege to sustain their fruitful effort. Apart from their team effort, each one of them had put their sole effort for gaining attention towards the SIMPLE group and M.Tech CL at GEC Sreekrishnapuram. They
Robert Jesuraj of SIMPLE Groups won the first prize in GARUDA Challenge 2012
CLEAR September 2013
31
The National Seminar on Relevance of
Computer Application, CUSAT was held on
Malayalam in the Field of Information
19-20 of January 2013.
Technology, jointly conducted by Dept. of Linguistics, Kerala University and KSCSTE, on 1-2 November 2012 was attended by some them. And they had presented their papers on Malayalam computing and thus taking part in the promotion of the use of Malayalam
for
the
dissemination
of
information to common people. Christopher, Sibi , Divya , Rinju, Sumi, Divya Das, Radhika, Ayisha, Renuka, Saani, Athira,
Manu Madhavan, Mujeeb Rehman, Rechitha
Ancy and Pragisha had presented their
and Robert Jesuraj of M.Tech Computational
papers.
Linguistics attended 4th IASNLP workshop at IIIT-H from 5th to 14th July 2012. This gave a good chance to meet a lot of researchers, academicians and industrialists in NLP, and introduce the course to them
Pragisha presenting at National Seminar @ Kerala University
Christopher Augustine and Manu Madhavan had presented their papers on collocation generations
and
Karaka
relations
respectively at the National Conference on
Manu Madhavan, Mujeeb Rehman and
Indian Language Computing (NCILC 2012),
Robert Jesuraj represented SIMPLE groups
conducted by CUSAT on February 19 and
and presented their
20. The NCILC 2013, conducted by Dept. of
Malayalam prosodic patterns and POS
CLEAR September 2013
papers related to
32
tagging respectively. A one day workshop on
Thiruvalla. Divya S from SIMPLE groups
Malayalam Computing, jointly conducted
presented a paper on News Summarization.
by Thunchath
Ezhuthachan
Malayalam
University and Kerala State IT mission was held on February 8th 2013. Radhika K.T. represented SIMPLE groups and presented a paper
on
Conceptual
Indexing
and
Compound word Splitter in Malayalam. Computational
A
paper
Phoneme by
and
Grapheme
to
Converter for Malayalam
Rechitha and Sumi, published in July
issue
of
International
Computational Language
Engineering
on
Linguistics
Processing
paper on
Journal and
Natural
(IJCLNLP).
Dysarthric
The Speech
Networking department of Amrita Viswa
Enhancement by Divya
Vidhyapeedom
week
groups got published in Volume 2 Issue 4–
workshop on Computational Linguistics and
July 2013 of International Journal of Latest
Machine Translation(CLMT) from English
Trends in Engineering and Technology
to Indian Languages. Manu Madhavan,
(IJLTET). Thus each one of them had made
Nibeesh, Robert Jesuraj and Sreejith C
an immense effort in making the SIMPLE
represented SIMPLE groups at CLMT.
groups popular.
organized
a
one
Das of
of
SIMPLE
The farewell function commenced in the forenoon session with a bundle of games and entertainment. Later, the faculties including HOD joined the students for a sumptuous lunch. The afternoon session commenced with a meeting, which was inaugurated by Dr. P.C. Reghu Raj, the Head of Computer Science department, with an inaugural speech. During his speech, he shared some of his memories and appreciated them for their International Conference on Mathematical
remarkable achievements. He concluded by
Modelling in Computer Management and
wishing them good luck in all their future
Medical Sciences 2013 (ICMCMM 2013)
endeavors. After that, the senior faculties
was
Asst. Prof. C. Nazeer and Asst. Prof.
held
on
13-15
CLEAR September 2013
June
2013
at
33
R. Binu, followed by other staff members
Everyone thanked the HOD and other faculty
from CS department shared their memories
members for their endearing efforts in
with the first batch.
helping them during the course. It was a
The faculties also thanked them for all the assistance
they
had
provided
while
conducting various events. There was a
memorable and eventful day for the Simple group with great enthusiasm and a lot of nostalgia.
mixture of emotions among the audience when each one from the senior batch started sharing their experiences.
CLEAR September 2013
34
M.TECH Computational Linguistics 2011-2013
Ancy K Sunny
Athira P M
Christopher Augustine
Manu Madhavan CLEAR September 2013
Divya Das
Mujeeb Rehman O
Ayisha Noori V K
Divya S
Pragisha K 35
M.TECH Computational Linguistics 2011-2013
Radhika K T
Rechitha C R
Rinju O R
Robert Jesuraj K
Sibi S CLEAR September 2013
Renuka Babu T
Saani H
Sumi S Nair 36
M.Tech Computational Linguistics Dept. of Computer Science and Engg, Govt. Engg. College, Sreekrishnapuram Palakkad www.simplegroups.in simplequest.in@gmail.com
SIMPLE Groups Students Innovations in Morphology Phonology and Language Engineering
Article Invitation for CLEAR- Dec-2013 We are inviting thought-provoking articles, interesting dialogues and healthy debates on multifaceted aspects of Computational Linguistics, for the forthcoming issue of CLEAR (Computational Linguistics in Engineering And Research) magazine, publishing on Dec 2013. The suggested areas of discussion are:
The articles may be sent to the Editor on or before 10th Dec, 2013 through the email simplequest.in@gmail.com. For more details visit: www.simplegroups.in
Editor,
Representative,
CLEAR Magazine
SIMPLE Groups
CLEAR September 2013
37
Hello World, While bringing this 5th issue of CLEAR, there are some coincidences with this issue. This September issue is a BIRTHDAY issue, thus celebrating CLEAR„s birthday. It is an edition which witnessed both the admission and farewell of two batches in Computational Linguistics. CLEAR team wishes them the very best in their life ahead. Now, let me share one of my experiences from a workshop at Sri Krishna College. The majority of the participants were research scholars and that too in the area Web Content Mining and Semantic web. This was an eye opening incident that showed the umpteen number of people interested in this area. The tremendous development in computer technology and rapid growth in information in web has resulted in difficulty in handling the information. Thus an increase in need of a methodology to analyze and filter the data is at its peak. Semantic web using ontologies helps to solve this hindrance. Thus it results in a lot of opportunities for exploring in this area. Simple groups welcomes more aspirants in this area.
Wishing you all a Happy and Prosperous ONAM. ď Š
Reshma O.K.
CLEAR September 2013
38
CLEAR September 2013
39
CLEAR September 2013
40