CLEAR September 2015
2
Editorial…………………… 4 News & Updates……….5 \
Events……………………….6
C
M.Tech Computational Linguistic Batch Project Abstracts (2013-2015)………………36 CLEAR Dec 2015 Invitation…………………52
CLEAR September 2015 Volume-4 Issue-3 CLEAR Journal (Computational Linguistics in Engineering And Research) M. Tech Computational Linguistics Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad 678633 www.simplegroups.in simplequest.in@gmail.com Chief Editor Dr. Ajeesh Ramanujan Asst. Professor Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad Editors Raseek C Pelja Paul N Sini G M Revathy P
Last word…………………53
News Headline Generation base on Human Intelligence Learning Krishnapriya P S
Sentiment Prediction of Movie Performance Rahul Kothanath
Smart Meter Divya Jose
Deep Learning in NLP Athira M, Bhavya K
Information Extraction from Big Data Soorya K, Sreetha S
Cover page and Layout Anoop R Soorya K
CLEAR September 2015
3
Dear Readers! Greetings! We are extremely happy to bring out the September edition of the
on-line
the of
magazine
association Computer
of
CLEAR,
the
Science
MTech
and
an
initiative
students
Engineering,
of
of
SIMPLE,
the
Department
Government
Engineering
College, Sreekrishnapuram.We will try to include articles from other
departments also in the future editions, keeping in mind the broad objectives of the journal. In this issue, we are pleased to present five papers including three application related papers from NLP and two survey papers, one on deep learning in NLP and another on information extraction from Big Data. These research papers appeared in peer-reviewed journals or presented in national or international conferences during this period. The Editorial Board appreciates the time and effort that have been
devoted
by
the
different
authors
and
would
like
to
thank them all. Finally, we hope you enjoy this edition too. We hope that our magazine will have a long and successful life with the help of our readers and contributors. As always, suggestions and criticisms towards improving the magazine content are welcome.
With Best Wishes, Ajeesh Ramanujan (Chief Editor)
CLEAR September 2015
4
Valediction to the third batch of simple group It was a great pleasure moment when the third batch of Computational Linguistics (2013-2015) had successfully stepped out from the world of astounding knowledge. Our college has a great privilege of the unique batch in M.Tech which adds more contingency to the outgoing batch. Research projects, publications, workshops, professional talks all give them a great way to the path of the technological world.
Valediction function organized by the junior batch (2014-2016) was an official meet by all M.tech students and faculties. Dr.Ajeesh Ramanujan (Asst. Prof. Department of CSE) enlightened the function and shared memories along with future endeavours. Other faculties recollected the memories and gave blessing to the outgoing students. When students shared their admirable moments with the junior batch, it was an exaggerated moment which made the juniors to think about the various aspects in the field of Computational Linguistics. This is not an end, but a warm beginning for our great seniors, "You all listen to your teachers when they tell you what to do and give more importance to think about it later and ask yourself why they told you to do". ………………….Great Blessings…………………… (Content prepared by Sini G M)
CLEAR September 2015
5
What is the Right Direction for Research on Indian Languages? We can't imagine a day without Google, or more generally without the Internet. Hence the term “googling” was added to our dictionary recently because of its prominent and and continued usage. We are not actually searching the web but we are surfing, since the search-engine gives us only most related answers to our query and we have to further search on all these related links for required information. The problem is that, the search engines fail to give us exact answers to our search but only a approximate set of answers based on some heuristics rather than actually understanding the content in the web page. In order to revolutionize the information retrieval area, we need a system that can understand the information content queried in natural language, which is one of the fundamental goals of natural language processing. The research on natural language processing has seen substantial progress in respect of the English language. In the case of Indian languages the direction of research appears to be along that of English. Following the same procedure for Indian languages as that of English fails to exploit the unique properties of Indian languages. The Indian languages are based on the concept of “Akshara” where each syllable has unique pronunciation. While Unicode has come to be accepted as a standard for representing text, there are many difficulties in getting uniformity of text processing across applications, since unicode represents only the vowels, consonants and matras. Consequently text representation involves variable length codes for each akshara which implies that text processing gets unnecessarily complex. What one looks for is a simple extension of regular expression matching to work with aksharas as opposed to basic vowels and consonants. Such an approach is indeed viable. It will also help provide a uniform base for text processing across all Indian languages.
(Reference:Talk by K aly ana K rishnan R Professor (Retired) Dept. of Computer Science and Engg., IIT Madras,, on 19 Aug 2015)
CLEAR September 2015
6
The Nation Mourns its Missile Man
"My message, especially to young people is to have the courage to think differently, courage to invent, to travel the unexplored path, the courage to discover the impossible and to conquer the problems and succeed. These are great qualities that they must work towards. This is my message to the young people" --Dr.APJ. ABDULKALAM
On remembering Dr. Kalam, our dreams are also taking wings to soar high. We are fortunate to have such a great leader, an ideally dedicated person who ignited dreams of developed India in our minds. Despite all the achievements, Dr. Kalam made still remained humane and humble throughout life. He gained the greatest admiration in the contributions to rocket technology in India and the weaponisation of strategic missile systems. Besides being a great scientist, he was a remarkable leader, to every Indian and many more across the world. Former Finance Minister P Chidambaram remarkably quoted about Kalam that "in recent history only a few people had endeared themselves to the young and old, to the poor and rich, to the educated and the unlettered and to the people belonging to different faiths and speaking different languages". Dr. Kalam used to constantly interact with and motivate the students, engineers and scientists and that made him a mentor for all Indians. Even at the age of 84, he had a rigorous schedule full of what he is extremely passionate about- Inspiring and igniting young for a hard-work and to follow dreams, passion. In one of the quotes, he says, "Excellence is a continuous process and not an accident."To keep his dreams alive We the young India, should walk in the way enlightened by him and so let us pass his messages through ages. As Kalam epitomized sheer selfless dedication towards the betterment of this world, we may also follow his footprints. (Content prepared by Archana S .M, Naima Vahab )
CLEAR September 2015
7
Headlines in News Articles Based on Human Intelligence Learning Krishnapriya P S Assistant Professor, Department of CSE IHRD College of Applied science Agaly, Palakkad Kerala ABSTRACT— This paper presents an experimental intelligent presentation system to automatically generate headlines for news stories or articles. Headlines are useful for users who only need information on the main topics of a story. This system consists of two parts. The first part, content selection, identifies what the image and accompanying article are about. The second step, surface realization, determines how to verbalize the chosen content. This model learns to create headlines from a database of news articles and the pictures embedded in them. The system postulates that images and their textual descriptions are generated by a shared set of latent variables (topics) and is trained on a weekly labeled dataset. Inspired by recent work in summarization, here uses a clustering algorithm and mapping keywords method.
I.
INTRODUCTION
Here introduce a headline generation system that selects headline words in every part of the whole text and the captioned image with nearly equal status, the entire section of article is selected such as image and at that time constitute them next to searching phrase clusters to be locally separated in the entrance of the text. Natural language, whether spoken, written, or typed, to compensate much of human communication. Connecting visual imagery with visually descriptive language is a challenge for computer vision that is becoming more
CLEAR September 2015
relevant as recognition and detection methods are beginning to work. Humans can prepare a concise description in the form of a sentence comparatively easily. Such representations have the ability to identify the most interesting objects, what they are doing, and where this is occurring. These descriptions are plentiful, for the cause that they are in sentence form. They are accurate, with good agreement between annotators. They are brief: much is neglected, because humans tend not mention objects or events that they judge to be less significant. The standard approach to image description generation adopts a two-stage framework consisting of content selection and surface realization.
8
The previous phase resolves the content of the image and identifies “what to say”, whereas the second stage determines “how to say it”. Each of the two stages is generally developed by hand. Content selection bring about the use of dictionaries that state explicitly a correspondence between words and image regions or features and surface realization uses human written templates or grammars for producing textual output. This approach can design sentences of very much excellent quality that are both meaningful and accurate. This paper deals with the related problem of generating captions automatically for news articles. It focuses on automatic captioning without requiring expensive manual annotation. At training time, this model learn from images and privileged documents, while at trial time they are given an image and the document it is surrounded in and procreate the caption. My innovation is to exploit this entangled information and treat the surrounding document and caption words as signs for the image, thus reducing the need for human supervision. However, this argues that the superfluity inherent in such a multimodal dataset allows the development of a fully unsupervised caption generation system.
al., 1973), probabilistic measures for token salience (Salton et al., 1997), and the use of implicit discourse structure. A. Integrating Words and Pictures Early effort on joining words and images concentrated on associating each and every word with image regions for tasks like clustering, auto-annotation or autoillustration. A second work has made use of text as a source of noisy labels for predicting the substance of an image. This works extremely well in constrained acceptance scenarios for recognizing specific classes of objects such as for tagging faces in news photographs with associated captions. B. Learning Models of Categories or Relationships Some recent work has attempted to learn models of the world from language or images. Saenko and Darrell learn visual sense models for polysemous words (e.g., “jaguar”) from textual descriptions on Wikipedia. Yanai and Barnard directly predict the visualness of concepts (e.g., “pink” is a more visual concept than “thoughtful”) from web image search results. A related body of work on image parsing and object detection learns the spatial relationships between labeled. III.
II.
RELATED WORKS
Most previous work on summarization focused on extractive tasks, examining issues such as cue phrases (Luhn, 1958), positional indicators (Edmundson, 1964), lexical occurrence statistics (Mathis et
CLEAR September 2015
THE SYSTEM
As in any language generation task, summarization can be conceptually designed as consisting of two major sub-tasks: content selection, and surface realization. The objective documents – the summaries – that the system needed to learn the conversion mapping to the headlines accompanying the news stories. The documents were 9
preprocessed before training: formatting and mark-up knowledge, such as font changes and SGML/HTML tags, was removed; punctuation, except apostrophes, was also removed. A. Content Selection Content selection demanded that the system acquire a model of association between the presence of a few features in a document and the phenomenon of corresponding features in the brief. This can be designed by calculating the probability of some symbol appearing in a summary given that some tokens appeared in the document to be summarized. The actual uncomplicated model for this relationship is the choice of content and associated image that contains keywords as trained data. Once the parameters of a content selection model have been estimated from a suitable document corpus, the model can be used to compute selection scores for candidate summary terms, considering the terms happening in a specific source document. The probability of any particular surface ordering as a headline candidate can be computed by modeling the probability of In this case, the probability of any particular summary content candidate can be calculated simply as the product of the probabilities of the terms in the candidate set. Based on the previous assumptions, can be computed as the product of them likelihood of (i) (ii) (iii)
The terms selected for the summary The length of the resulting summary and The most likely sequencing of the terms in the content set.
CLEAR September 2015
B. Surface Realization The likelihood of a word series is nearly the product of the probabilities of seeing every term given its adjacent left context. Probabilities for sequences that have not been seen in the training data are estimated using back-off weights (Katz, 1987). As mentioned earlier, in principle, surface linearization calculations can be carried out with respect to any textual spans from characters on up, and could pick up into account additional information at the phrase level. They could also, of course, be extended to use higher sequence n-grams, preparing that adequate numbers of training headlines were readily obtainable to estimate the probabilities. Download the BBC News database. The full coverage texts were downloaded based on a snapshot of the links contained in Google Full Coverage at that time. A spider crawled the top eight categories: U.S., World, Business, Technology, Science, Health, Entertainment, and Sports. All of these news links in every group were saved in an index page that contained the headline and its full text URL.
IV.
HEADLINE GENERATION
A. Word selection There have to compose the words into readable headlines. The top-scoring words over the whole story are selected and highlighted. Fig.1 shows an example of the placement of the top scored words in the beginning of the original text. From the given news the frequent words are selected and it will further checked with the keywords
10
of images. Then the system determines which is needed for the generation task.
V.
CONCLUSION
A. Clustering
Unfortunately these words taken together do not satisfy the requirement of grammaticality. My idea is to have the ability of pulling out the largest window of words to form the headline. Top scoring words appeared in the text are queued for clustering. After drawing bigrams centered with top-scoring words, one can clearly see clusters of words forming. There is allowed Describing the menace of corruption is a very big issue for the Aam Aadmi Party (AAP), Delhi chief minister Arvind Kejriwal said on Wednesday that his minority government, which is being given outside support by the Congress, is determined to introduce and pass a Jan Lokpal Bill in the next 15 days. to as insert more than one topic for each headline described by the task definition, and the generated headline length is under the limit (10 words). The headlines are generated based on the keywords and their occurrences.
Here presented a new large-scale database of images and captions, for the automatic headline generation task. It is designed to be representative of the challenges in learning automatically from realistic image and caption pairs mined freely from the Internet. Extensive experiments on the whole and various subsets of the database show merits of the database. The novel task of automatic caption generation for news images. The process fuses introspection from computer vision and natural language processing and holds promise for various multimedia applications, such as image and video retrieval, development of tools supporting news media management, and for individuals with visual impairment.
Fig2.Example images, sentences and their associated keywords REFERENCES [1]
Feng,
"Automatic images"
Yansong, caption
and
Mirella
generation
Pattern
for
Lapata. news
Analysis and Machine
Intelligence, IEEE Transactions on 35.4 (2013): 797-812 [2] A. Kojima, M. Takaya, S. Aoki, T. Miyamoto, and K. Fukunaga, “Recognition and Textual Description Robot,”
of
Human
Activities
by
Mobile
3URF 7KLUG ,QWʦO &RQI ,QQRYDWLYH
Computing Information and Control, pp. 53-56, 2008. [3] P. He´de, P.A. Moe¨llic, J. Bourgeoys, M. Fig1. Example news with keywords highlighting
CLEAR September 2015
Joint, and C. Thomas, “Automatic Generation of
11
Natural Language Descriptions for Images,�
[8] T. Cour, B. Sapp, C. Jordan, and B. Taskar.
Proc. 5HFKHUFKH GĘŚ,QIRUPDWLRQ $VVLVWHÂ?H SDU
Learning from ambiguously labeled images. In
Ordinateur, 2004.
CVPR, 2009.
[4] B. Yao, X. Yang, L. Lin, M.W. Lee, and S.
[9] T. Berg, A. Berg, J. Edwards, and D.
Chun
Forsyth. :KRĘŚV LQ WKH SLFWXUH ,Q 1,36, 2004.
Zhu,
“I2T:
Image
Parsing
to
Text
Description,� Proc. IEEE, vol. 98, no. 8, pp.
[10] K. Barnard, P. Duygulu, N. de Freitas, D.
1485- 1508, 2009.
Forsyth, D. Blei, and M. Jordan, “Matching
[5] A. Kojima, T. Tamura, and K. Fukunaga,
Words
“Natural
Research, vol. 3, pp. 1107-1135, 2003.
Language
Description
of
Human
and
Pictures,�
J.
Machine
Learning
Activities from Video Images Based on Concept
[11] P. Duygulu, K. Barnard, N. de Freitas, and
Hierarchy of Actions,� ,QWʌO - &RPSXWHU 9LVLRQ
D. Forsyth, “Object Recognition as Machine
vol. 50, no. 2, pp. 171-184, 2002.
Translation,� Proc. European Conf. Computer
[6] Y.Wang and G. Mori. A discriminative latent
Vision, 2002.
model
of
image
region
and
object
tag
correspondence. In NIPS, 2010. [7] M.Guillaumin, T.Mensink, J. Verbeek, and C. Schmid. Automatic face naming with captionbased supervision. In CVPR, 2008.
CLEAR September 2015
12
Sentiment Prediction of Movies Performance Using Social Media Rahul Kothanath
Christy Jackson
School of Computing Science and Engg VIT University Chennai, India
Prof., School of Computing Science and Engg VIT University Chennai, India
ABSTRACT: Now days sentiment analysis on social media is widely used to analyze movies, business products, election results etc. This analysis can be done to predict the outcome of future events also. In this study, we analyse set of releasing movies and tries to predict its box office performance .We use data collected from IMDb ,YouTube and Twitter for the analysis . The prediction is based on factors derived from IMDb database, Twitter ,YouTube etc. We also use a calculated success value of director, actor to make prediction more reliable. K means algorithm is used to group the movies according to their similarities and Random forest ensemble learning is using for creating a predictive model.
I.
INTRODUCTION
Social Medias such as YouTube, Facebook, Twitter etc., are the network which brings people in a common platform. The data generated from this platform is having great business value. It can be used for different analytics which can bring huge change for the business world. Film industry is one of the popular industry which people likes to talks about. The data collected form social media can be used for the betterment of film industry. For example, film data analysis by geographical location can be use analysis by geographical location can be used for better future decisions.
CLEAR September 2015
The topic of movies is an interesting subject among online users now days. Now days producers are seeking a prediction on the success of their movie to validate their investment. Usually the success and failure of the movie depends on the quality of the movie. Now days the result also depends on the competition between movies, release date, the credibility of people worked behind the movie etc. Research has been done to generate models for predicting revenues of movies. Most of the analysis on films is done after the release of movie only. Analysis of movies before the release can actually create better decision input for the producers. Now days plenty of platforms are available to get the prior opinion on releasing movie. Today online users are giving feedback on movies starting from the release of trailer/teaser of the movie. The users mostly use twitter as the medium for expressing 13
their opinion. In twitter already there is information related to the popularity of Actor, Director, Scriptwriter, and Actress. The information regarding the popularity of people behind the movie helps in great to predict the outcome of the movie. The Internet Movie Database (IMDb) is a great repository of films. It contains huge information related to movies. IMDb repository provides much other information also. It consists of data related to actor, director, actress etc. if we want to collect history of a specific director it can be done through this. The data collected from IMDb database can be used for making better analytics on the upcoming movies. YouTube is one of the popular video sharing platforms. Millions of people use YouTube to upload, share and talk about videos. The data related to movie is usually available from different repository. Our aim is to make a prediction on upcoming movie considering many reliable factors. Usually we human make our views considering many factors such as director, actor, actress, scriptwriter, production house etc. now days machines make suggestions by analyzing users opinion. Combining these both can make a better reliable opinion. People now days don't like prescriptive diagnostics analysis, they prefer predictive and recommendation models. So we need to keep track of the history of the director and need to analyze the probability of him getting successful. Similarly the actor and script writer need to be watched. Since we are trying to make an opinion on upcoming movie, only analyzing trailer comments and twitter followers count may not work properly. A prediction should
CLEAR September 2015
not be made on the basis of trailer/teaser comments only. Now days trailers are getting popular but when comes to cinema, it fails. The studies suggest that the trailer is not having that much amount of decision power in judging a movie performance. Even though sometimes it reflects the popularity of the entire team behind the movie. So here we are considering the response from trailer/teaser. To make a sentiment classifier, we have different methodologies. We have the basic method of counting number of positive/negative characters, dictionary based approach, sentence level classification etc. Now days better meaningful methods are also there. Since we are not giving that much weighting to the results of trailer dictionary based approach is sufficient. The general method of sentiment classification is given in figure 1. Once we are able to combine this classification result with our other parameters, then it can make a better prediction in advance. We have different machine learning approaches to solve the problem. Since we don't have that much training data to be analyzed, selection of our approach plays great role. We have taken the K means algorithm here. Even though we are going to make a grouping of related movie, it should draw complex line of separation. To validate the result, usually we have to wait till the movie got released. To avoid that we have taken already released movie set and applied the algorithm.
14
Fig1. Sentiment Classifier
II.
METHODOLOGY
Twitter attracted many researchers and business people now days .The twitter data can be analyzed for predicting the future of business products, movie outcomes etc. Due to its different features, we are able to extract much additional information about a particular person/movie also. Sentiment analysis is the process of determining the different kind of opinion expressed by the person. There are many methods that used for this analysis; we have document level and the sentence level classification of reviews. In the document level, the analysis is based on the complete document, whereas in the sentence level, the analysis is performed at the sentence level. Now days there are many other methods which use machine learning algorithms to analyze the text precisely. As a part review study, twitter information is collected. Usually we can collect data from twitter using twitter API but here we need only a very few information’s. The data collected from twitter is followers count of Director, Actor, Actress, Script Writer . Usually this count
CLEAR September 2015
indicates the popularity of a personality. This will be a helpful element in judging the success of movie. Usually great film stars, popular actress will have a large number of followers list. Another information we need is the history data of Director, Actor, Scrip Writer. The IMDb repository provides the information related to the profile of cinema workers. Now days research workers are using IMDb database for various projects related to movie industry. The data collected are listed below. C. Popularity of Actor, actress, Director To predict the box office success of movie, we also need popularity of people who worked behind the movie. To get this, We collect details such as number of followers of Actor, Director, Actress. generally it gives an information about the people. Even though if an actor don't have so much experience ,his followers count will reflects his popularity among people. If he is familiar face or new a new upcoming star then it will help for our analysis. Similarly a successful actress in a short time period will have a huge number of followers count. So these types information is needed for our analysis. B. Sequel Movie The movie considered may be a second/third part of a sequel movie. In this situation we have to add a weight to this element also. Usually successful movie will be having second part, so it is an unavoidable parameter.
15
Fig 3.Confidence Measure
Fig 2.Popularity Measure
C. Confidence Value of Director, Actor Usually judging a Director/ Actor based on their popularity alone doesn't give good results. The success value of these people should be analyzed by tracking their previous history. So we will take the IMDb ratings of directors previous movie and average them to get a final rating on director. Same for script writer also. Similarly the ratings of past 4 movies actor will be taken and average of these ratings will be taken. If the director /actor/ script writer is new then we will add a fixed value as their rating.
CLEAR September 2015
some techniques.. Each comment will be divided into set of tokens. The sentiment of each token will be analyzed and classified into good/bad/neutral. The positive sentiment is having a value of +1,negative -1 and neutral 0. The total integer sum of a comment will decide the class in which the comment belongs to. The aggregated result of all comments related to trailer and teaser will classify the movie into one class. An example of expected output of classification is given in this table. So now we have the sentiment of each movie trailer/teaser in hand. The sentiment of each movie will be found out using this way. The class value that we got is categorical and we have to integrate this data with other attributes of film. To do so we will map this Hit/Flop/Neutral classes into numerical values. This will be used with 16
other parameters for making the decision.variables, all input files can be converted to parametric form as they are read-in. K means algorithm is a clustering algorithm which group data items based on their mean value. Here it clusters similar movies and it can be used for making our decisions. Once we created the clustering results, that label will be assigned to each data tuple. This we use for the creation of a predictive model. The clustering result with k mean is given here. Now we have to create a prediction model .we use this result to label our each movie entry. The new data will be given to random forest algorithm. Since this movie classification is a difficult process, ensemble
Relative absolute error Root relative squared error
14.2683 % 24.7074 %
===Confusion Matrix===
a b c <-- classified as 2 0 0 | a = Hit 0 5 0 | b = Flop 0 0 3 | c = Neutral Movie
Opinion
Iron man 3
Hit
The Iceman
Flop
The great gatsby
Neutral
Peeples
Flop
Black Rock
Flop
Fast & Furious 6
Hit
Epic
Neutral
After Earth
Flop
Man of Steel
Neutral
Fig 5.Output of K means
Fig 4.Movie Teaser/Trailer Classification
Total Number of Instances 10 The statistics are given below. Kappa statistic 1 Mean absolute error 0.06 Root mean squared error 0.1125
CLEAR September 2015
III.
CONCLUSION
In this paper, we predict the outcome of set of movies in box office. Here we collect various data related to the set of movies from twitter and created a predictive model for classifying it. According to our models, we are identifying different patterns such as: (1) 17
the opinion of users on the movie teaser/trailer. (2). the Importance of Actor and Actress, Director, Scriptwriter. (3). Sequel attribute of movie. (4). Success value of Actor/Director/Script writer. The analysis helps to group movies according to their success value. It will be useful for team behind film to change release date or making some modifications as future work it can be extended to incorporate with release date, past 10 year history of movies, audience targeting by movie etc. to get more better and accurate details regarding movies.
[3] Aman, S., &
Szpakowicz, S. (2007).
“Identifying Expressions of Emotion in Text”. doi:10.1007/978-3-540-74628-7_27 [4] Asur, S., & Huberman, B. A. (2010). “Predicting
the
Future
with
Social
Media”.
doi:10.1109/WI-IAT.2010.63 [5]
Barlett,
K.
(2008).
“Application
of
Unsupervised Learning Methods”. [6] Hetal Bhavasar, F., & Amit Ganatar, A. A. (2012). “A Comparative Study of Training Algorithms for Supervised Machine Learning?”.
International Journal of Soft Computing and
REFERENCES [1] A., & A. F. (2013). Sentiment “Analysis of Hollywood
Movies
on
Twitter”.
IEEE/ACM
International Conference on Advances in Social Networks
Analysis
and
Mining.
[2] Agarwal, A., Xie, B., Vovsha, I., Rambow, &
Passonneau,
R.
(2011).
CNSR.2007 .22 [8] Sharda, R., & Delen, D. (2006). “Predicting box-office success of motion pictures with neural
doi:10.1145/1871437.1871741 O.,
Engineering (IJSCE)ISSN. doi:10.1109/
networks”.
Expert
Systems
With
Applications. doi:10.1016/j.eswa.2005.07.018
“Sentiment
Analysis of Twitter Data”.
ADVANCED BIG DATA TOOLS COULD ENHANCE INDUSTRY PERFORMANCE USA-based management consulting firm Bain and Company reported that the oil and gas producers could improve their performance by six to eight per cent using advanced big data analytic tools. According to Bain and Company, seismic software, data visualisation and a new generation of pervasive computing devices – sensors that collect and transmit data – would continue to open new possibilities in the market. With these new tools and advanced analytic capabilities, oil and gas producers can capture more detailed data in real time at lower costs, added the company. “Our recent survey of more than 400 executives in oil and gas sector revealed that companies with better analytics capabilities were twice as likely to be in the top quartile of financial performance in their industry, five times more likely to make decisions faster than their peers and three times more likely to execute decisions as planned,” said a company source. http://www.oilandgasbigdata.com/
CLEAR September 2015
18
Home Appliance Recognition and Future Usage Prediction Using Smart Meter Divya Jose PG Scholar, Department of CSE Jyothi Engineering College Cheruthuruthy, Thrissur Email: divyajos001@gmail.com
ABSTRACT— Recognition of electrical appliance usage from the meter power reading has become an area of study .Currently two main methods to deal with the worldwide growing energy consumption. The first, to increase energy –power production rate capabilities. The second, to head toward more efficient energy-power consumption. Power management for homes and offices requires both recognition, prediction of the future usages or service requests of different appliances present in the buildings. From grid point of view, the work requires the division of the total load into its constituent components .The aim of this work is to identify residential appliances from aggregate reading at the smart meter and to predict their states in order to minimize their energy power consumption. For this work, our purpose is divided in two distinct modules: Appliance recognition and future usage prediction. Both are based on multi-label learners which take inter-appliance co-relation into account.
I.
INTRODUCTION
Knowledge of appliances usage in buildings is needed for control, the issue being to define “relevant knowledge”. Monitoring and control of appliances consumption and usage has two different purposes whose business model is in development stage. First, from a view of end users, having informations on appliances level usages can lead to reduced cost through a the consumption of energy or possible ancillary services Second, from a grid point of view the control of more loads represent more stability for the grid, i.e.
CLEAR September 2015
more flexibility and reliability . These services represent elementary bricks of an energy management system whose privacy limits has still to be defined. They are based firstly on load identification and secondly on prediction of load energy consumption. Fig. 1 shows the different actors in a smart residential building or smart home system. A. Load Identification The first approach of load separation is based on identifying the state transitions of appliances which in most cases is done by the ON/OFF transition recognition. The work in load separation proposed methods to 19
identify individual appliances from their ON/OFF transitions [2]. Appliances transitions result in corresponding changes in the overall power consumption monitored at the power meter.
predict the usage of each appliance because, regarding dynamic demand-side management, it is also important to evaluate how much energy can be saved to specific requests to the customers like unbalancing requests, load shedding or energy power variations. The energy savings depend on appliances: some can be unbalanced, some can be postponed and some cannot be used. These load controls need at least anticipation and reaction. Considering these elementary bricks of algorithm, Fig. 2 proposes a synoptic of a three layer energy management system.
Fig1. Potential components in a smart residential building
In the last two decades there have been sizable amount of work to this effect [3], [4]. Each new method proposes to reduce the limitations of the previous ones both in term of signatures or applying state of the art pattern recognition techniques. These identified features are called appliances signature [5]. The disadvantage of these approaches is mainly hardware requirement due to high sampling rates and the impracticality of the process being totally non-intrusive. The load separation at high sampling rate of all the appliances also raise privacy concerns [6] as user activity can be easily detected and monitored. B. Prediction of Energy Consumption The anticipation of problematic situations from the index of energy management requires also prediction capabilities. Even if it is easier to guess overall usage, it is important to be able to
CLEAR September 2015
Fig.2. A three layer architecture for energy management of residential houses
II.
DATA SETS
The IRISE database is obtained from Residential Monitoring to Decrease Energy Use and Carbon Emissions in Europe (REMODECE) which is a European database on residential usage, including Central Gov. and Eastern EuropeanCountries... This database stores the characterization of residential electricity consumption by end-user and by country. As part of the REMODECE dataset, The IRISE dataset deals only with houses in France, grouping the energy consumption of 100 houses. In each database concerning one 20
house, information is recorded every 10 minutes and 1-hour for each appliance in the house and for a year. This information represents the energy consumed by each appliance, its date and its time. At 1-hour sampling rate is included the weather and humidity information. III.
TEMPORAL CLASSIFIER
A. Temporal Sliding Window Temporal data mining encompasses time series analysis both in form of type of data rate and scope. Temporal data can be timeseries or events and include among other topics such as classification and prediction. The classifier system both for load identification, recognition and future usage prediction is based on temporal classification of standard pro-positional machine learning algorithms. In order to make the time dependency it creates copies of the target field that are shifted in time and generate the sub-sequences. Instances containing these sub-sequences and the current target value are presented as standard propositional instances to the underlying classifyingalgorithm. This process effectively removes the time dependency in the original target since this is captured by the shifted attributes which is essentially a sliding window.
B. Multi-label Classifiers For multi-label classification learners [21], [22], the problem of load identification is not a simple case of temporal classification as there is a possibility to take into account co-relations between appliances whenever they exist. The classification learner
CLEAR September 2015
approximates a function mapping a vector into labels rather than a scalar output by looking at input-output examples of this function. There are two broad approaches in handling multi-label classification algorithms. One is by way of problem transformation where a multi-label problem is transformed into one or more single-label problems. 1) Binary Relevance, BR: is a method of problem transformation that learns separately single-label binary model or each classes of label [21]. It transforms the original data into single-label data-sets that contain all the features of the original data-set. The BR algorithm will extract as much new tables as there are labels, each one of them grouping all attributes and only one label. 2) Label Power set, LP: considers each different set of tables that exists in the multilabel data-set as a single-label [21]. Unlike the BR classifier, the LP algorithms terms using only one single classifier consisting of the number of classes. C. Evaluation Measures Precision: Represents the fraction of the positives states (ON) of the appliances correctly predicted. Recall: Represents the ratio between the predicted positives states (ON) of the appliances and the total number of correct positives states of the appliances. Accuracy: Is defined as the percentage of cases where the predicted energy state (ON or OFF) is correct for an appliance.
21
D. LOAD IDENTIFICATION A. Synoptic The synoptic of the loadidentification proposed in this work is shown Fig. 4.
The identification uses a specific set of metafeatures which takes into account the different characteristics of appliances (loads) such as time of use, duration of use, trend of loads, sequence of loads, spike in loads and co-relation among appliances. In addition, some of the generated meta features are summed up Fig. 5.
&ŝŐ͘ ϰ͘ ^LJŶŽƉƚŝĐ ŽĨ ůŽĂĚ ŝĚĞŶƚŝĮĐĂƚŝŽŶ
In this section, the detailed implementation of the load identification is discussed. A stepwise outline of the load identification implementation is shown below. 1)
The aggregate energy readings from the IRISE dataset (Sec. II) at the sampling rate of 1 hour rate are generated.
1)
Subsequence is generated using temporal sliding window (Sec. III-A) with a window size of 10 units (the unit being the sampling rate of the dataset).
2)
Meta-features (Sec. III-B) for the subsequence are generated.
3)
The Multi-label classifier (Sec. III-C) with the generated features as input and high energy appliances as output classes is trained for 10 percent of the dataset.
4)
The model is evaluated (Sec. III-D) on the other 90 percent of the dataset.
CLEAR September 2015
x
Hour of the day,
x
local maximum and minimum in the sliding wind Distance from current state to local max. and min.,
x
energy change from current state,
x
gradient and laplacian at each time interval
B. Identification Results The results indicate that water heater and washing machine are identified with higher performance than other high energy consuming appliances in the house. These two loads are interesting for deferring and therefore it is favorable to propose an algorithm presenting good efficiency to identify them. Though not presented in this work, the appliance identification based on algorithms which take into account interappliance dependence performs in general better than without. V. APPLIANCE PREDICTION
AND
USAGE
Real-time demand response can be complemented if the future usage of deferrable load can be predicted with reasonable accuracy. The proposed model tries to take into account all the possible 22
information based on identified appliance state, time of the event and meteorological information. A stepwise outline of the future usage prediction implementation is shown
4) The Multi-label classifier with the generated features as input and high energy appliances as output classes is trained and tested iteratively. 5) The model is learned iteratively and is tested in an online learning procedure. Fig. 6 describes the principle of the proposed model. At every sampled time instance it is predicted if an appliance will start at the following hour or not. This idea will be extended to more than one hour in future works. In the following subsections the feature space and the results are discussed.
Fig.
5.
concerning
Different the
meta-features
aggregated
power
measurements in a sliding window
Fig. 6. Proposed method at a given time instance
1) The identified states (ON/OFF) from the load identification results at 1-hour sampling are obtained. 2) Subsequence is generated using temporal sliding window with a window size of 24 units. 3) Features generated.
for
the
subsequence
CLEAR September 2015
are
The results have been validated for a dataset of 100 houses monitored The results suggest that certain high consuming appliances can be identified and predicted even at a low sampling rate of 1-hour. This result is important in the context of energy management and specifically designed for better non-intrusive monitoring of loads and future usage prediction of deferrable loads. Over 1-year and a result for 1-house is presented.
23
[4] W. K. Lee, G. S. K. Fung, H. Y. Lam, F. H. Y. Chan, and M. Lucente, “Exploration on load signatures,” Electrical Engineering, no. 725, pp. 1–5, 2004. [5] M. B. Figueiredo, A. de Almeida, and B. Ribeiro, “An experimental study on electrical VLJQDWXUH LGHQWL¿FDWLRQ RI QRQ-intrusive load monitoring (nilm) systems,” in Proceedings of the 10th international conference on Adaptive and natural computing algorithms – Volume Part II, ser. ICANNGA’11. Berlin, Heidelberg:
REFERENCES [1] C. Clastres, T. H. Pham, F. Wurtz, and S.Bacha,
“Ancillary
household
energy
services
and
management
optimal with
pv
production,” Energy, Elsevier, vol. 35, no. 1, pp. 55–64, January 2010. [2]
G.
Hart,
“Nonintrusive
appliance
load
monitoring,” Proceedings of the IEEE, vol. 80, no. 12, pp. 1870 –1891, dec 1992. [3] M. E. Berges, E. Goldman, H. S. Matthews, and L. Soibelman, “Enhancing electricity audits in residential buildings with nonintrusive load monitoring,” Journal of Industrial Ecology, vol. 14, no. 5, pp. 844–858, 2010
Springer-Verlag, 2011, pp. 31–40 [6] G. Kalogridis, C. Efthymiou, S. Denic, T. Lewis, and R. Cepeda, “Privacy for smart meters: Towards undetectable appliance load signatures,” in Smart Grid Communications (SmartGridComm),
2010
First
IEEE
International Conference on, oct. 2010, pp. 232 –237. [7]
A.
Prudenzi,
“A
neuron
nets
based
procedure for identifying domestic appliances pattern-of-use from energy recordings at meter panel,” in Power Engineering Society Winter Meeting, 2002. IEEE, vol. 2, 2002, pp. 941 – 946 vol.2. Authors, “The frobnicatable foo filter,” ACM MM 2013 Submission ID 324.
What professional designers think of Google’s new logo Google last week, in case you missed it, introduced a brand new logo design. While not a monumental change, the new typeface offers up a subtle variation to the logo the search giant had been using for the last five years. The most prominent change is that Google’s new typeface completely does away with the serifs that helped define its logo for the past 16 years.
(http://bgr.com)
CLEAR September 2015
24
Deep Learning in NLP Athira M
Bhavya K
M. Tech Computational Linguistics University of Calicut athirakalengottil@gmail.coml
M. Tech Computational Linguistics University of Calicut aamibhavya@gmail.com
ABSTRACT: Deep learning has becoming a new field of machine learning, and has gained extensive interests in different research area. It tries to mimic the human brain, which is capable of processing and learning from the complex input data and solving different kinds of complicated tasks well. t has been successfully applied to several fields such as images, sounds, text and motion. The techniques developed from deep learning research have already been impacting the research of natural language process.
I.
INTRODUCTION
Deep learning is a branch of machine learning based on learning representations of data. Representation learning is a set of techniques that learn a feature: a transformation of raw data input to a representation that can be effectively exploited in machine learning tasks. . Deep Learning algorithms are based on the learning of multiple levels of features or representations of the data. Higher level features are derived from lower level features to form a hierarchical representation. Feature learning is motivated by the fact that machine learning tasks such as classification often require input that is mathematically and computationally convenient to process. However, real-world data such as images, video, and sensor measurement is usually complex, redundant, and highly variable. Thus, it is necessary to
CLEAR September 2015
discover useful features or representations from raw data. The features learning process can be purely unsupervised, which can take advantage of massive unlabelled data. The feature learning is trying to learn a new transformation of the previously learned features at each level, which is able to reconstruct the original data. For supervised learning tasks where label information is readily available in training, deep learning promotes a principle which is very different from traditional methods of machine learning. That is, rather than focusing on feature engineering which is often labour-intensive and varies from one task to another, deep learning methods are focused on end-to-end learning based on raw features. In other words, deep learning moves away from feature engineering to a maximal extent possible. 25
II.
APPLICATIONS
The deep learning techniques have already been impacting a wide range of machine learning and artificial intelligence. It is thought that moving machine learning closer to one of its original goals: Artificial Intelligence. It has been successfully applied to several fields such as images, sounds, text and motion. The rapid increase in scientific activity on deep learning has been motivated by the empirical successes both in academia and in industry. Objet Recognition Object recognition is thought to be a nontrivial task for computer. MNIST digit image classification problem has been used as benchmark for many machine learning algorithms. In the last few years, deep learning has extended from digits to object recognition in natural images. Speech Recognition and Signal Processing Deep learning was thought to yield breakthrough results, obtained by several academics as well as researchers at industrial labs, even bringing these algorithms to a larger scale and into products. For example, Microsoft has released a new version of their MAVIS (Microsoft Audio Video Indexing Service) speech system based on deep learning in 2012. In this work reduce the word error rate on four major benchmarks by about 30% compared to state-of-the-art models based on Gaussian mixtures for the acoustic modelling and trained on the same amount of data.
CLEAR September 2015
The standard deep neural network is a static classifier with input vectors having a fixed dimensionality. However, many practical pattern recognition and information processing problems, including speech recognition, machine translation, natural language understanding, video processing and bio-information processing, require sequence recognition. In sequence recognition, sometimes called classification with structured input/output, the dimensionality of both inputs and outputs are variable. One way to solve this problem is through the HMM. Natural Language Processing Besides speech recognition, deep learning has been applied to many other Natural Language Processing applications. One important application is word embedding. The idea that symbolic data can be represented via distributed representation .The learning of a distributed representation for each word, also called a word embedding.
III. CONCLUSION Deep learning is an emerging field in Natural Language Processing which has wide range of applications. It has shown some advantages over the traditional machine learning methods in some fields. Although deep learning works well in many machine learning tasks, it works equally poorly in some areas as the other learning methods.
26
REFERENCES
3.
1.
Deng ., “Deep Learning for Natural Language
Auli,M., Galley,M., Quirk,C., and
Xiaodong He., Jianfeng Gao., and Li
Zweig,G., 2013,” Joint Language and
Processing: Theory and Practice” , Deep
Translation Modelling using Recurrent Neural
Learning Technology Center Microsoft
Networks”. In EMNLP.
Research, Redmond, WA
2.
Tianchuan Du ., Vijay K. Shanker .,
“Deep Learning for Natural Language Processing “.
FACEBOOK’S NEWEST DEEP LEARNING SYSTEM MAKES SAMPLES OF IMAGES THAT HUMANS THINK ARE REAL 40% OF THE TIME Researchers at Facebook have taken a step closer to a holy grail of artificial intelligence known as unsupervised learning. They’ve come up with a way to generate samples of real photographs that don’t look all that fake. In fact, the computer-generated samples — of scenes featuring planes, cars, birds, and other objects — looked real to volunteers who took a look at them 40 percentage of the time, according to a new paper on the research posted online yesterday. Facebook has submitted the paper for consideration in the upcoming Neural Information Processing Systems (NIPS) conference in Montreal. The research goes beyond the scope of supervised learning, which many startups and large companies, including Facebook, use for a wide variety of purposes. Supervised deep learning traditionally involves training artificial neural networks on a large pile of data that come with labels — for instance, “these 100 pictures show geese” — and then throwing them a new piece of data, like a picture of an ostrich, to receive an educated guess about whether the new picture depicts a goose.
http://www.venturebeat.com//
CLEAR September 2015
27
Information Extraction from Big Data Soorya K
Sreetha S
M. Tech Computational Linguistics University of Calicut suryavpz@gmail.coml
M. Tech Computational Linguistics University of Calicut s.sreetha.1@gmail.com
ABSTRACT: Big data is a phrase, used to describe a massive volume of both structured and unstructured data that is so large. Also it is difficult to process using traditional database and software techniques. In most enterprise scenarios the volume of data is too big or it moves too fast or it exceeds current processing capacity. The biggest challenge is to extract information from these kinds of unstructured data. Despite these problems, big data has the potential to help companies improve operations and make faster, more intelligent decisions.
I.
INTRODUCTION
Every day, we create quintillion bytes of data — and most the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it. This data is big data. A primary goal of NLP is to derive meaning from text. Natural Language Processing generally makes use of linguistic concepts such as grammatical structures and parts of speech to extract information. Often, the idea behind this type of analytics is to
CLEAR September 2015
determine who did what to whom, when, where, how, and why. In general, text analytics solutions for big data use a combination of statistical and Natural Language Processing (NLP) techniques to extract information from unstructured data. NLP is a broad and complex field that has developed over the last 20 years. NLP performs analysis on text at different levels: Lexical/morphological analysis examines the characteristics of an individual word — including prefixes, suffixes, roots, and parts of speech (noun, verb, adjective, and so on) — information that will contribute to understanding what the word means in the context of the text provided. Lexical analysis depends on a dictionary, thesaurus, or any list of words that provides information about those words. Syntactic analysis uses grammatical structure to dissect the text and 28
put individual words into context. Here you are widening your gaze from a single word to the phrase or the full sentence. This step might diagram the relationship between words (the grammar) or look for sequences of words that form correct sentences or for sequences of numbers that represent dates or monetary values. Semantic analysis determines the possible meanings of a sentence. This can include examining word order and sentence structure and disambiguating words by relating the syntax found in the phrases, sentences, and paragraphs. Discourse-level analysis attempts to determine the meaning of text beyond the sentence level. VI.
WHERE DOES BIG DATA COME FROM? Big data is often boiled down to a few varieties including social data, machine data, and transactional data. Social media data is providing remarkable insights to companies on consumer behavior and sentiment that can be integrated with CRM data for analysis, with 230 million tweets posted on Twitter per day, 2.7 billion Likes and comments added to Facebook every day, and 60 hours of video uploaded to YouTube every minute (this is what we mean by velocity of data). Machine data consists of information generated from industrial equipment, realtime data from sensors that track parts and monitor machinery (often also called the Internet of Things), and even web logs that track user behavior online. Regarding transactional data, large retailers and even B2B companies can generate multitudes of
CLEAR September 2015
data on a regular basis considering that their transactions consist of one or many items, product IDs, prices, payment information, manufacturer and distributor data, and much more. Major retailers like Amazon.com, which posted $10B in sales in Q3 2011, and restaurants like US pizza chain Domino's, which serves over 1 million customers per day, are generating petabytes of transactional big data. The thing to note is that big data can resemble traditional structured data or unstructured, high frequency information. The big data trend promises that harnessing the wealth and volume of information in your enterprise leads to better customer insight, operational efficiency, and competitive advantage. III. HOW TO EXTRACT INSIGHT FROM UNSTRUCTURED DATA? As a result, organizations have to study both structured and unstructured data to arrive at meaningful business decisions, including determining customer sentiment, cooperating with e-discovery requirements and personalizing their product for their customers. Not only do they have to analyze information provided by consumers and other organizations, information collected from devices must be scrutinized. This must be done not only to ensure that the organization is on top of any network security threats, but to also ensure the proper functioning of embedded devices. While sifting through vast amounts of information can look like a lot of work, there are rewards. By reading large, disparate sets of unstructured data, one can identify
29
connections from unrelated data sources and find patterns. What makes this method of analysis extremely effective is that it enables the discovery of trends; traditional methods only work with what is already quantifiable, while looking through unstructured data can cause revelations.
are important for choosing the data storage and data retrieval depend often on the scalability, volume, variety and velocity requirements. A potential technology stack should be well evaluated against the final requirements, after which the information architecture of the project is set.
x
A few likely influential requirements are that the results of the analysis must be available in real-time, have high availability for access while still functioning in a real-time multitenant environment. Real-time access is crucial, as it has become important for ecommerce companies to provide real-time quotes. This requires tracking real-time activities, and providing offerings based on the results of a predictive analytic engine. Technologies that can provide this include Storm, Flume and Lambda. High availability is crucial for ingesting information from social media. The technology platform used must ensure that no loss of data occurs in a real-time stream. It is a good idea to use a messaging queue to hold incoming information as part of a data redundancy plan, such as Apache Kafka. The ability to function in real-time multi-tenancy environments is required if the results are required to avoid state changes and continue to be mutable data.
Make sense of the disparate data sources Before the actual process of extraction, one needs to know what sources of data are important for the analysis. One information channel is log files from devices, but that source won't be of much help when searching for user trends. If the information being analyzed is only tangentially related to the topic at hand, it should be set aside. Instead, only use information sources that are absolutely relevant. x
Sign off on the method of analytics and find a clear way to present the results The analysis is useless if it is not clear what the end result should be. One must understand what sort of answer is needed - is it a quantity, a trend or something else? In addition, one must provide a roadmap for what to do with the results so that they can be used in a predictive analytics engine before undergoing segmentation and integration into the business's information store. x Decide the technology stack for data ingestion and storage Even though the raw data can come from a wide variety of sources, the results of the analysis must be placed in a technology stack or cloud-connected information store so that the results can be easily utilized. Factors that
CLEAR September 2015
x
Keep information in a data lake until it has to be stored in a data warehouse. Traditionally, an organization obtained or generated information sanitized it and stored it away. For example, if the information source was an HTML file, the text might be stripped and the rest discarded, such that
30
information was lost during storage in a data warehouse. Anything useful that was discarded in the initial data load was lost as a result, and the only thing one could do with the data was what is possible after extraneous information was stripped away. The appeal of this prior strategy was that the data was in a pristine, mutable format that could be used whenever. However, with the advent of Big Data, it has come into common practice to do the opposite. With a data lake, information is stored in its native format until it is actually deemed useful and needed for a specific purpose, preserving metadata or anything else that might assist in the analysis. x Prepare the data for storage While keeping the original file, if one needs to make use of the data, it is best to clean up a copy. In a text file, there can be a lot of noise or shorthand that can obscure valuable information. It is good practice to cleanse noise like whitespaces and symbols, while converting informal text in strings to formal language. If it is possible to detect the spoken language, it should be categorized as such. Duplicate results should be removed, the dataset treated for missing values, and off-topic information extirpated from the dataset. x Retrieve useful information Through the use of natural language processing and semantic analysis, one can make use of Parts-of-Speech tagging to extract common named entities, such as "person," "organization," "location" and their relationships. From this, one can create a term frequency matrix to understand the word pattern and flow in the text.
CLEAR September 2015
x Ontology evaluation Through analysis, one can then create the relationships among the sources and the extracted entities so that a structured database can be designed to specifications. This can take time, but the insights provided can be worth it for an organization. x Statistical modeling and execution Once the database has been created, the data must be classified and segmented. It can save time to make use of supervised and unsupervised machine learning, such as the K-means, Logistic Regression, Na誰ve Bayes, and Support Vector Machine algorithms. These tools can be used to find similarities in customer behavior, targeting for a campaign and overall document classification. The disposition of customers can be determined with sentiment analysis of reviews and feedback, which helps to understand future product recommendations, overall trends and guide introductions of new products and services. The most relevant topics discussed by customers can be analyzed through temporal modeling techniques, which can extract the topics or events that customers are sharing via social media, feedback forms or any other platform. x Obtain insight from the analysis and visualize it From all the above steps, it all comes down to the end result, whatever it might be. It is crucial that the answers to the analysis are provided in a tabular and graphical format, providing actionable insights for the end-user of the resultant information. To ensure that the information can be used and accessed by
31
the intended parties, it should be rendered in a way that it can be reviewed through a handheld device or web-based tool, so that the recipient can make the recommended actions on a real time or near-real time basis.
cannot be stored in row/ column format as structured data. Transforming this data to structured format for later analysis is a major challenge in big data mining. So new technologies have to be adopted for dealing with such data.
IV. ISSUES AND CHALLENGES Big data analysis is the process of applying advanced analytics and visualization techniques to large data sets to uncover hidden patterns and unknown correlations for effective decision making. The analysis of Big Data involves multiple distinct phases which include data acquisition and recording, information extraction and cleaning, data integration, aggregation and representation, query processing, data modeling and analysis and Interpretation. Each of these phases introduces challenges. Heterogeneity, scale, timeliness, complexity and privacy are certain challenges of big data mining.
Incomplete data creates uncertainties during data analysis and it must be managed during data analysis. Doing this correctly is also a challenge. Incomplete data refers to the missing of data field values for some samples. The missing values can be caused by different realities, such as the malfunction of a sensor node, or some systematic policies to intentionally skip some values. While most modern data mining algorithms have inbuilt solutions to handle missing values (such as ignoring data fields with missing values), data imputation is an established research field which seeks to impute missing values in order to produce improved models (compared to the ones built from the original data).
a) Heterogeneity and Incompleteness The difficulties of big data analysis derive from its large scale as well as the presence of mixed data based on different patterns or rules (heterogeneous mixture data) in the collected and stored data. In the case of complicated heterogeneous mixture data, the data has several patterns and rules and the properties of the patterns vary greatly. Data can be both structured and unstructured. 80% of the data generated by organizations are unstructured. They are highly dynamic and does not have particular format. It may exists in the form of email attachments, images, pdf documents, medical records, X rays, voice mails, graphics, video, audio etc. and they
CLEAR September 2015
b) Scale and complexity Managing large and rapidly increasing volumes of data is a challenging issue. Traditional software tools are not enough for managing the increasing volumes of data. Data analysis, organization, retrieval and modeling are also challenges due to scalability and complexity of data that needs to be analyzed. c) Timeliness As the size of the data sets to be processed increases, it will take more time to analyze. In some situations results of the analysis is required immediately. For example, if a
32
fraudulent credit card transaction is suspected, it should ideally be flagged before the transaction is completed by preventing the transaction from taking place at all. Obviously a full analysis of a userâ&#x20AC;&#x2122;s purchase history is not likely to be feasible in real time. So we need to develop partial results in advance so that a small amount of incremental computation with new data can be used to arrive at a quick determination.
[2]http://www.01.ibm.com/software/in/data/bi gdata/] [3]http://www.datamation.com/applications/bi g-data-9-steps-to-extract-insight-fromunstructured-data.html, [4]http://www.datamation.com/applications/bi g-data-9-steps-to-extract-insight-fromunstructured-data.html [5]http://www.cioupdate.com/technologytrends/how-to-extract-information-from-thesea-of-big-data-part-i.html
V. CONCLUSION WORK
AND
FUTURE
[6http://www.dummies.com/howto/content/analysis-and-extraction techniquesfor-big-data.html
New information forms such as social media and machine logs have made themselves crucial to organizations for their ability to provide unique content and diagnostic intelligence once they are properly analyzed. Traditional or conventional data scientists will have to acquire new skills sets to analyze unstructured data. While enterprises develop content intelligence capabilities, the real power lies in fusing different data formats and overlaying structured data with semi and unstructured data sources for insights into the mind of a user or the life of a device. REFERENCES [1]
Jaseena,
K.
U.,
and
Julie
M.
David.
"ISSUES, CHALLENGES, AND SOLUTIONS: BIG DATA MINING."
CLEAR September 2015
33
M.Tech Computational Linguistics 2013-2015 Batch
Alen Jacob
Amal Babu
Anagha M
Anajly V
Devisree V
Kavitha Raju
Manu V Nair
Nisha M
Rajitha K
CLEAR September 2015
34
M.Tech Computational Linguistics 2013-2015 Batch
Raveena R Kumar
Reji Rahmath K
Rekha Raj C t
Sarath K S
Sreerekha T V
Sreetha K
Sruthi Sankar K P
Vidya P V
CLEAR September 2015
35
M.Tech Computational Linguistics Department of Computer Science and Engineering 2013-2015 Batch Details of Master Research Projects
Title Name of Student Abstract
Large Scale Social Network Analysis for Hypergraphs using Apache Spark Alen Jacob
Today, most of the data and information are presented in the form of graphs. The graph structure of data makes it readable and easy to perform analysis measures. This project aims to develop a large scale social network analysis system capable of extracting hypergraphs from input files and convert it to sparse multigraph format. This multigraph file is given as input to the distributed graph processing frameworks such as Apache Spark or Apache Giraph to perform the analysis on the graph data. The graph data extracted from the sequential data is represented as GraphML format. GraphML is a comprehensive and easy-to-use file format used vfor the modelling analysis and visualization of data represented as a graph or network. JUNG is a Java library used to extract the hypergraph from GraphML file and to convert the hypergraph to multigraph format. Two types of multigraph files are generated by the system. One, the JSON format .txt file for Apache Giraph and two, adjacency list format for Apache Spark Graphx.
CLEAR September 2015
36
Title
Name of Student
Abstract
A Domain Independent Public Opinion Mining System Using Multiple Statistical Approaches
Anagha M
Sentiment analysis is an emerging area of research which aims in extracting the subjective information in source materials by applying Natural Language Processing, Computational Linguistics, and Text Analytics and classifying the polarity of the opinion stated. In this paper, a cross domain opinion mining system is introduced, which collects a keyword, which comes under one of the predefined classes, as user input and returns the overall public opinion about that keyword as output. The predefined classes are: Person, Organization, Product, and Movie. The public opinion about the user input is determined by collecting reviews/news about the particular input available online and extracting the positive and negative sentiments contained in them. A hybrid approach for Sentiment Analysis is proposed in this work in which multiple machine learning methods are used for opinion mining and certain rules are also incorporated to handle certain special cases. The rules included ensure that special cases are handled which include negation, intensifiers, dilators etc. The system performed well giving a considerable precision rate.
CLEAR September 2015
37
Title
Table Retrieval from Scanned Documents using Deep Belief Network
Name of Student
Amal Babu
Abstract
The proposed table retrieval system takes as input scanned copies of documents and provide the user with functionalities that enables the user to search and retrieve tables present in the document. This is accomplished by Optical character recognition and feature extraction from scanned documents. This involves the problem of detecting and recognizing document constructs such as tables, words, lines and paragraphs. The system make use of character segmentation results and various layout features such as spacing, alignment and font size to detect and recognize document features. A five layered Deep belief network(DBN) trained for 64 character class serves the purpose of character recognition. Once the pages in the document that contains potential candidates for tables are recognized, the system creates a bag of word vector representation for each such tables. A user table query is then matched with these table vectors using Googleâ&#x20AC;&#x2122;s word2vec. Table detection and recognition from large scanned documents have been one among the major issues faced by an accounting or auditing firm day today. Such firms invests a lot of man hours in finding and converting tables that contains relevant auditing information to computer processable data. Using the proposed system, such firms can automate the process of data collection saving a lot of man hours.
CLEAR September 2015
38
Title
Name of Student Abstract
A Hybrid Approach for Temporal Information Extraction from Web Documents in Malayalam
Anjaly V Time expressions are fundamental entities in a temporally aware system. The importance of representation of time and reasoning about time indicates that time has a significant role in information processing. Temporal information processing has increasing demand in many Natural Language Processing (NLP) applications, such as question answering, text summarization, machine translation, and information retrieval. Many of them demand proper ordering of events, which tends to be a difficult task. Temporal information extraction aims to extract time expressions and temporal relations from natural language text. The goal of the proposed work is to extract temporal information from Malayalam web documents. Malayalam is one of the 22 scheduled languages of India, and its complex linguistic structure makes its processing, a challenging task. The system generates temporally annotated version and a brief summary of the source document. The main tasks involved are, preprocessing of input document, temporal expression recognition, normalization of identified expressions, resolving event-anchored temporal expressions through on-line search, and generation of a brief summary of the document. A supervised machine learning approach based on Conditional Random Fields (CRF) is employed for temporal expression recognition, which becomes the major part of proposed work. An extractive summarization method is adopted for generating summary, where important sentences are selected from the original document by considering their statistical and linguistic features.
CLEAR September 2015
39
Title
A Hybrid Approach to Relationship Extractionfrom Stories
Name of Student
Devisree V
Abstract
A story may be analyzed to identify the main characters and to extract the relationship between them. Relation extraction problems are generally solved either through supervised or unsupervised learning algorithms. In the former, there should be a text corpus for which the entities and their relation types are already known. Such algorithms typically learn to classify new entity pairs into any of the relation types it has already seen, based on some reccurring patterns. On the other hand, the unsupervised learning approach is used when there is no such marked up corpus. Such algorithms typically identify patterns relevant to the relation extraction task, occurring within the corpus and then use these patterns to group entities such that the entities within a group share similar relationships. The proposed method is a hybrid approach which combines the features of unsupervised and supervised learning methods. It also uses some rules to extract relationships. The method identifies the main characters and collects the sentences related to them. Then these sentences are analyzed and classified to extract relationships. The main applications are story summarization and analysis of the major characters in stories.
CLEAR September 2015
40
Title
Information Extraction from Online News Articles
Name of Student
Kavitha Raju
Abstract
The amount of information available to us in the web is large. If it was possible for a computer to automatically read through these numerous pages and extract the relevant informations in them that are of interest to the problem at hand, and present to us only those facts so that we can easily go through them, analyze and make decisions faster and more efficiently, it would have been good. The main barrier that blocks this possibility is the incapability of computers to read and understand natural language text. The way information is represented in natural language text is highly in-consistent, which makes it extremely complicated and practically impossible to explicitly tell a computer in which all ways a particular information can be represented in natural language texts, and enable it to read and understand it. The system presented here tries to bridge this gap and automatically extract information from English text in web pages and represent them in a structured format so that it can be analyzed efficiently. The domain for the experiment was chosen as the news articles on web, from which, the details of various company mergers and acquisitions that has taken place are extracted. Further, a simple querying system is built on the extracted data stored in a machine readable structured format in order to demonstrate it’s efficient analysability. In realization of the system natural language processing tasks such as tokenisation, POS tagging, lemmatisation, named entity recognition and semantic role labeling are performed. For these both statistical and rule based techniques where ever appropriate are employed. The system exhibits satisfactory performance that guaranties the validity of extraction techniques employed.
CLEAR September 2015
41
Title
Name of Student Abstract
Deep Belief Network based Parsing and Word Meaning Visualization in Scanned Documents
Manu V Nair The proposed system attempts to produce a parsed editable text from scanned printed text document and to provide the user with an option for checking the meaning of each content word by displaying it using a web application. The system mainly uses the techniques like Optical Character Recognition and Deep Belief Networks(a type of Deep Learning algorithm) to perform the task. Optical Character Recognition(OCR) is one of the challenging tasks in the field of Artificial Intelligence(AI). Recognition is a trivial task for humans, but not for computer. Deep learning is one of the latest machine learning techniques, that can be used for speech recognition, character recognition, sentiment analysis etc. DBN is a generative model deep architecture which uses Restricted Boltzmann Machines as building blocks of each layer. Proposed system is initially trying to combine OCR and DBN to convert a scanned document to editable text format, and to convert it to a parsed file which is also capable of visualising the meaning of each content word. The system parses a scanned document for its structure, such as headings, paragraphs, tables, sections, words, etc. Characters are recognised using DBN and the meanings are visualized using WordNet. The whole document is converted into HTML format. The experiment is done on 150 scanned documents in which the document feature parsing got an accuracy of 90%. The error rate of character recognition using deep belief network is 2.5%.
CLEAR September 2015
42
Title
Malayalam Morphological Analyzer Using MBLP approach
Name of Student
Nisha M
Abstract
Morphological analysis is a process to identify grammatical features of a word based on its suffix. It is a basic tool for building any Natural Language Processing application. Morphological analysis is used in many practical Natural Language Processing applications such as machine translation, text mining, spell checkers etc. One of the near term goal is to integrate the morphology learning algorithms into language independent Text-to-Speech(TTS) system for improving grapheme-to-phoneme rules, stress prediction and tone assignment. Malayalam is morphologically rich and highly agglutinative in nature and hence machine analysis of morphological features of Malayalam word is a complex task. The aim of this project is to develop a morphological analyzer for Malayalam using Memory Based Language Processing(MBLP). In this approach, Morphological analysis of Malayalam words is considered as a classification problem. MBLP is an approach to language processing based on exemplar storage during learning, and analogical reasoning during processing. Training instances are created from words and are manually annotated for their segmentation. The system can be trained using TiMBL(Tilburg Memory Based Learner).
CLEAR September 2015
43
Title
Name of Student Abstract
Event based Information Organisation for Topic Detection and Tracking in Malayalam
Rajitha K
On the Internet, news articles and related news information is rapidly spread from a variety of sources in many languages. The objective of Topic Detection and Tracking(TDT) in Malayalam is to develop technologies to search, organize and structure Malayalam textual news stories from variety of news websites. On TDT, a topic is defined to be a set of news stories that are strongly related to a some seminal real world event. News stories from various online newspapers are considered as the corpus for the TDT. The project mainly focuses on four tasks, namely Topic Detection, Topic Tracking, First Story Detection and Link Detection. Each task is viewed as a component technology whose solution will help to address the problem of event-based news organization. Topic detection involves detecting the occurrence of a new event. Topic tracking is the process of monitoring a stream of news stories to find those that discuss or track the same event as one specified by a user. For example a user might provide the system with one or more sample stories about a specific event such as the earth quake and request notification when news stories about the same event are broadcast. First story detection is the detecting occurrence of new event in a streams of news stories. Link detection is the task detecting whether a user given two stories discuss the same event. The underlying task is the accurate similarity assessment. The proposed detection and tracking system involves topic identification, clustering of news stories, cluster ranking for tracking events and similarity checking. The proposed approach is capable of handling event-based news organization for topic detection and tracking in Malayalam.
CLEAR September 2015
44
Title Name of Student Abstract
Title Name of Student Abstract
A Statistical Approach to Anaphora Resolution in Malayalam Using Conditional Random Field Raveena R Kumar Anaphora Resolution (AR) is the process of resolving what a pronoun, or a noun phrase refers to. This thesis aims at resolving the pronouns in Malayalam Language. The data required for both training and testing are collected from Malayalam story blogs. Machine learning methods used for AR are Trigram n Tagger (TnT) and Conditional Random Field (CRF). TnT is used for Parts Of Speech (POS) tagging and CRF is used to identify the class of nouns and pronouns. To resolve the anaphora, the features of pronouns are matched with the features of noun, and only the matching noun will be considered as antecedent. Along with the feature matching, some rules are used to identify the correct antecedent. The performance of the system is measured using precision and recall. Anaphora resolution has great importance in fields like Information Extraction, Question Answering, Summarization and Dialogue interpretation systems.
Offline Handwritten Character Recognition for Malayalam Using Multiclass SVM with Hybrid Features Sreerekha T V Handwritten character recognition is the process of converting handwritten text to a form that is machine readable and editable. With improvements in the area of localization, handwritten character recognition has been receiving more attention in India. The task of recognition becomes more complicated for a language like Malayalam due to its large character set and the presence of compound characters and vowel modifiers. Besides these, various character scripts available and more importantly the difference in ways in which the characters are written has to be considered. This paper describes a system which can recognize handwritten Malayalam characters. The method employs segmenting the text document into individual characters and classifying them. The features used for uniquely identifying each character are hue moments and texture features. Support Vector Machines (SVMs) have successfully been used for character recognition.
CLEAR September 2015
45
Title
Morphological Generator for Malayalam using MBLP approach
Name of Student
Reji Rahmath K
Abstract
Words are the important building blocks of every language. Morphological generation and analysis are necessary for developing computational grammars as well as machine translation systems. The proposed system is a morphological generator for Malayalam using the Memory Based Language Processing (MBLP) approach. The basic principle of morphological generation is to get the inected form of a word, given its root word and a set of properties such as lexical category and morphological properties. MBLP is an approach to language processing based on exemplar storage during learning, and analogical reasoning during processing. A training corpus containing root words and basic features is an important part of the system. The feature set is a set of selected morphological properties. For nouns, the feature set contains number, case and the last syllable of the input word. For verbs, it contains tense or mode or aspect label and one or more syllables from the last of the input word. For training the system, Tilburg Memory based Learner (TiMBL) is used. As there is no standard corpus available for Malayalam, a major task was to create a corpus containing nouns, verbs and their features for training and testing. The evaluation metrics such as recall, precision, F-measure, etc., shows that the system performs well. A full edged morphological generator can be developed by extending the system to handle other word categories like adjectives, adverbs, etc. The system gives a satisfactory result having an accuracy of 89.65
CLEAR September 2015
46
Title
Name of Student Abstract
Dependency Parsing and Semantic Role Labeling of Malayalam Sentences using MBLP Approach
Rekha Raj C T Dependency parsing is an approach to automatic syntactic analysis of natural language, inspired by the theoretical linguistic tradition of dependency grammar. Given an input sentence, the task of dependency parsing is to identify the syntactic head of words in the sentence and classify the relation between the head and its dependent. The task of semantic role labeling is to detect the semantic arguments associated with the predicate or verb of a sentence and their classification into their specific roles. This thesis addresses the problem of dependency parsing and semantic role labeling of Malayalam sentences using the Memory-Based Language Processing (MBLP) approach. In this approach, dependency parsing has been considered as a classification problem. The system has been developed using Memory-Based Tagger (MBT). The theoretical model that has been adopted for the sentence analysis is the P an.inian grammatical model which provides a level of syntacticosemantic analysis. The tokens in a sentence are annotated with dependency relations. From the dependency relations, the karaka relations are extracted to identify the semantic roles of the predicates in the sentence. The arguments of predicates are found by identifying the clause boundaries. Given a sentence, the system will generate a dependency tree for the sentence depicting the dependency relations. The system demonstrated good results with an accuracy of 96.68% for clause boundary identification and 79.70% for dependency parsing with considerable saving in training time, when compared with CRF.
CLEAR September 2015
47
Title
Name of Student Abstract
Patent Document Similarity Measure; A Machine Learning Approach to Plagiarism Detection
Sarath K S Document similarity measure is a key task in Natural Language Processing(NLP), that has many practical applications in the real world. One such use case is the Plagiarism detection in patent documents. Plagiarism is the copying of someone else’s works or ideas and passing them off as ones own. There are methods which measure similarity in terms of common words between documents and their frequencies. A lexicon based approach is incorporated to get word-to-word similarity, but limited by its coverage. The proposed system is focused only on semantic similarity in text descriptions within patents, using a machine learning library ’Gensim’, which composed of an opensource implementation of Googles word2vec in python. Patents are represented by high dimensional word vectors, where words are selected using CRP (Chinese Restaurant Processes) based clustering. The similarity is calculated in terms of cosine-similarity metric between patent documents. The proposed system is developed and experimented on 9941 patents from a special medical domain, ’Dentistry’.
CLEAR September 2015
48
Title
Optimal Web Page Ranking Using Multilingual Information Search Algorithm
Name of Student
Vidya P V
Abstract
The multilingual search algorithm proposed in this thesis aims at refining the task of finding documents of unstructured nature which satisfies an information need from within a large collection. This is done by including more relevant results by enabling the information retrieval system to retrieve documents in a language different from that of the query. It also proposes an optimal method to rank the resultant web pages based on the information content, irrespective of the language. It currenly works for three languages, namely English, Malayalam and Hindi. The preprocessing tasks include language detection, stopword elimination, stemming, translation and searching. Translation is performed using Google Translate API. The system works for both single-word and phrase-based queries. It then searches the web for the requested information and collects the resultant URLs. Full text indexing of the web pages corresponding to these URLs is performed. Pages corresponding to these URLs are then ranked according to relevance (based on the frequency of keywords) and the search result is then passed on to the user. Multi lingual information retrieval is important for countries like India where majority of the people are not conversant with English and thus do not have access to the vast store of information on the web.
CLEAR September 2015
49
Title
Name of Student Abstract
News Sentiment Analysis and Trend Forecasting Using Multiple Statistical Approaches
Sreetha K Sentiment Analysis focuses on identifying whether a given piece of text is subjective or objective and if it is subjective, then whether it is negative or positive. This is also useful for business organization to determine whether they are viewed positively or negatively by the public. Now-a-days, large amount of news is disseminated through various websites. These articles can be used to extract general sentiment about a particular entity like company. These extracted sentiments can be used for forecasting the trend of the company and analyze their demands. The work represented in this thesis focuses on business news articles. A large number of companies use news analysis to help them to make better business decisions. The proposed system crawls various online news websites, checks whether the news is of corporate nature, and then identifies the sentiment in it by incorporating different machine learning approaches. Also, this system helps in trend forecasting by analysisng the sentiment about the company and the frequency at which the news appears. Test results shows that incorporating multiple machine learning methods for extracting the sentiment has helped to improve the prediction accuracy of the system.
CLEAR September 2015
50
Title
Unsupervised Approach to Word Sense Disambiguation in Malayalam
Name of Student
Sruthi Sankar K P
Abstract
Word Sense Disambiguation(WSD) is the tsak of identifying the correct sense of a word in a specific context when the word has multiple meaning. WSD is very important as an intermediate step in many Natural Language Processing(NLP) tasks, especially in Information Extraction(IE), Machine Translation(MT) and Question/Answering Systems. Word sense ambiguity arises when a particular word has more than one possible sense. The peculiarity of any language is that it includes a lot of ambiguous words. Since the sense of a word depends on its context of use, disambiguation process requires the understanding of word knowledge. Automatic WSD systems are available for structured languages like English, Chinese, etc. But Indian languages are morphologically rich and thus the processing task is very complex. The aim of this work is to develop a WSD system for Malayalam, a language spoken in India, predominantly used in the state of Kerala. The proposed system uses a corpus which is collected from various Malayalam web documents. For each possible sense of the ambiguous word, a relatively small set of training examples(seed sets) are identified which represents the sense. Collocations and most co-occurring words are considered as training examples. Seed set expansion module extends the seed set by adding most similar words to the seed set elements. These extended sets act as sense clusters. The most similar sense cluster to the input text context is considered as the sense of the target word. The WSD system performs very well and gives an accuracy of 72%.
CLEAR September 2015
51
M.Tech Computational Linguistics Dept. of Computer Science and Engg, Govt. Engg. College, Sreekrishnapuram Palakkad www.simplegroups.in simplequest.in@gmail.com
SIMPLE Groups Students Innovations in Morphology Phonology and Language Engineering
Article Invitation for CLEAR- Dec-2015 We are inviting thought-provoking articles, interesting dialogues and healthy debates on multifaceted aspects of Computational Linguistics, for the forthcoming issue of CLEAR (Computational Linguistics in Engineering And Research) Journal, publishing on Dec 2015. The suggested areas of discussion are:
The articles may be sent to the Editor on or before 10th Dec, 2015 through the email simplequest.in@gmail.com. For more details visit: www.simplegroups.in
Editor,
Representative,
CLEAR Journal
SIMPLE Groups
CLEAR September 2015
52
Hello World, This is a hurtful moment for all of us as on the demise of our former president Dr.A.P.J. Abdul Kalam. As a part of the editorial team, I express my deepest condolence. With this release of CLEAR Journal, we bring an edition on the new direction in the linguistics. This will provide a forum for students to enhance their background and get exposed to tortuous research areas in this field. The exponential growth of data, coupled with the increase in computing power, has lead to increasing the popularity of deep learning. This edition also includes some conducted in our college. I would like to wholeheartedly thank the contributing authors, for their considerable endeavor to bring their perception and perspectives on the latest technologies in computational linguistics field. The pace of Innovation continues. Innovation is taking two things that already exist and putting them together in a new way. Simple groups welcomes more aspirants in this area. Wish you all the best!!! Revathy P
revathyravindranp@gmail.com
CLEAR September 2015
53