Clear Journal December 2016 Edition

Page 1

CLEAR DECEMBER 2016

1


CLEAR DECEMBER 2016

2


CLEAR Journal (Computational Linguistics in Engineering And Research) M. Tech Computational Linguistics, Dept. of Computer Science and Engineering, Govt. Engineering College, Sreekrishnapuram, Palakkad678633 www.simplegroups.in simplequest.in@gmail.com Chief Editor Dr. Ajeesh Ramanujan Assistant Professor Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad678633 Editors Ayishathahira C H Manjusha P D Rahul M Sreelakshmi K Cover page and Layout Rahul M

Editorial………………………………………… 4 News & Updates…………………………….5 CLEAR March 2017 Invitation………………………………………22 Last word………………………………………23

An Efficient Frame Work for Semantic Similar Short text Retrieval.........................................07 Ayishathahira C H, Manjusha P D, Nasreen K V Student Course Feedback Summarization Using ILP Framework....................................09 Fathima Riswana K, Fathima Shabana K, Fathima Shirin A, Shabna Naser, Sruthy K G Approaches for the Detection of Opinion/Review Spams ................17 Sandhini S, Sreelakshmi K, Varsha E Entity Focused Summary Generation.....................................20 Rahul M

CLEAR DECEMBER 2016

3


Dear Readers! Greetings! This edition of CLEAR Journal contains paper and articles about the trending topics in the field of Information Retrieval, like Semantically Similar Short Text Retrieval, Entity-Centric Summarization, Feedback Summarization Using ILP Framework and Detection of Review Spams. In our last edition, we focused mainly on researches done in the field of Natural Language Processing, Deep Learning and Big Data Analytics, Semantic Web and Block Chain Technology, while this edition consists of paper and articles related to the hot areas of Information Retrieval. Our readers include a faction of people who have shown a keen interest in natural language engineering. They have continuously encouraged and criticized all our efforts and it has served as a catalyst to the entire CLEAR team. On this hopeful prospect, I proudly present this edition of CLEAR to the readers and look forward to your opinions and criticism. Best Regards, Dr. Ajeesh Ramanujan (Chief Editor)

CLEAR DECEMBER 2016

4


Workshop on Algorithm Design Techniques and Complexity Theory The five day workshop on Algorithm Design Techniques and Complexity Theory was held at GEC Sreekrishnapuram from 26th to 30th September 2016. Eminent personalities from NIT Calicut , IIT Palakkad and GEC Idukki gave sessions on the topics data structures, algorithms and complexity. The workshop was inaugurated by Dr.P.C. Reghu Raj, Principal of GEC Sreekrishnapuram. The sessions are conducted by Dr. Muralikrishnan.K (NIT Calicut), Dr. Jasine Babu (IIT Palakkad), Dr. Sudeep.S (NIT Calicut) and Mr. Anilkumar.S (GEC Idukki).

Workshop on Graph Algorithms and Complexity The four day workshop on Graph Algorithms and Complexity Theory was held at GEC Sreekrishnapuram from 17th to 20th October 2016. The workshop was held on the topics graph, algorithms and analysis, Datastructures, Applications of Graph theory.On the first and second day, D.Venkatesh Raman from IMSC Chennai discussed about the topics graph, algorithms and analysis. Third day session discussed about the topic datastructures which was handled by Mr. Anilkumar .S from GEC Idukki. The fourth day session was mainly focused on applications of graph theory by Dr. Narayanan.N from IIT Madras.

CLEAR DECEMBER 2016

5


Workshop on Computational Linguistics : An Industrial Perspective A one day workshop was held at GEC Sreekrishnapuram on October 1st 2016. The session was handled by Mr. Manu Madhavan, Mr.Sreejith, Mr. Gopalakrishnan and Mr. Robert Jesuraj, Alumini of GEC Sreekrishnapuram. In the forenoon session, Mr. Sreejith discussed about the topic Datascience. The interesting session was followed by Mr. Gopalakrishnan. Next session about Python was conducted by Mr. Manu Madhavan and Mr. Robert Jesuraj and followed by hands on training on Python. In the afternoon session, there was a session on topic Big data analytics by Mr.Sreejith.

CLEAR DECEMBER 2016

6


An Efficient Framework for Semantic Similar Short Texts Retrieval Ayishathahira C H1, Manjusha P D2, Nasreen K V3 M.Tech Computational Linguistics Government Engineering College, Sreekrishnapuram

ayishathahira007@gmail.com1, manjushapda@gmail.com2, nasrinnazar094@gmail.com3

Short text semantic similarity is a metric defined over a set of terms, where the idea of the distance between them is based on the likeness of their meaning or semantic content as opposed to similarity which can be estimated regarding their syntactical representation. Short text appears in the form of search queries, ad keywords, tags, tweets, messenger conversations, social network post etc. Unlike documents, short texts have some unique characteristics which make them difficult to handle. First, short texts do not always observe the syntax of a written language. Second, they contain limited context. The fast approach of short texts retrieval is important to many applications like web search, ads matching, questionanswer system. The basic approaches return the top k short texts by sorting them with regard to the similarity score. After surveying these approaches, we find that almost all the methods concentrate on the precision of the text retrieved (effectiveness issue) in which they used small sized data collections. I am sure that after reading this article you will definitely have an idea about this efficient approach for fast retrieval of semantic similar short texts. In FAST approach (efficient FrAmework for semantic similar Short Texts CLEAR DECEMBER 2016

retrieval), we address the efficiency issue with their high precision. The approach is conducted on large sized data collection in order to overcome the problems in basic approaches. Different from long texts, short texts cannot always observe the syntax of a written language and usually do not possess sufficient information to support statistical based text processing techniques. The FAST approach focuses on the top-k issue because users commonly do not care about the individual similarity score but only the sorted results. The approach is used to tackle the efficiency problem for retrieving top-k semantic similar short texts and improve the efficiency which minimizes the candidate number to be evaluated in the framework. The approach starts with the pre-processing procedure which includes creating appropriate indices. We consider two representative word similarity measurement strategies which obtain the best performance compared with human judges. The two strategies are: Knowledge-based strategy and Corpus-based strategy. In knowledge-based strategy, semantic similarity is determined by measuring their shortest path in the predefined taxonomy. Whereas in corpus-

7


based similarity, we can only apply statistical information to determine the similarity.

Figure 1 shows an example to illustrate the FAST approach. Let the query be “Delicious lunch in Japan� and k be 1. To retrieve top-1 short text from the whole data, create a ranked list of knowledge based similarity and corpus-based similarity. Then apply the threshold algorithm to efficiently retrieve the top-k semantic similar short texts either by equal weight tuning strategy. However, we cannot know such ranking directly because these two lists are texts layer but each list has a word layer as sub layer. For that we apply two kinds of similarity metrics. Therefore, there are two assembling tasks: assembling knowledge based and corpus-based similarities and assembling words to texts. Rank list for knowledge-based strategy is created using WordNet by the following lemma: Let q be the query. Let P and S be two candidates that exist in the

CLEAR DECEMBER 2016

same taxonomy of q, i.e., TP and Tq. The shortest path between q and P (or S) is LP (or LS ). The maximum depth of TP is DP(or DS of TS ). P is more similar to Q compared with S. Thus, we have DP/LP > DS/LS . Rank list for corpus-based strategy is created based on Wiki encyclopedia that maps Wiki texts into appropriate topics. A short text is a vector based on topics. The topic could be generated either by ESA or by LDA. First, calculate all the similarity scores between each word in Wiki and that between topics in the data collection to obtain a set of lists during pre-processing. Then build a weighted inverted list which represents a word with sorted corresponding short texts based on the similarity score. According to the evaluation four different settings have been proposed to improve the effectiveness: (1) FASTE is the one that we apply the ESA topic strategy; (2) FASTL employs the LDA topic strategy in corpus-based similarity with equal weight; and (3) FASTEw and (4) FASTLw are based on the former two ones, respectively, with the tuned combinational weights. The abovementioned evaluations demonstrate the efficiency of the proposed techniques while keeping the high precision. We can incorporate new methods to tackle efficiency issue and take effective semantic similarity strategies to obtain high performance which we leave for the future scope.

8


Student Course Feedback Summarization Using ILP Framework Fathima Riswana K1, Fathima Shabana K2, Fathima Shirin A3, Shabana Nasser4, Sruthy K G5 M Tech Computational linguistics Government Engineering College, Sreekrishnapuram fathimariswana024@gmail.com1, fathimamkd1993@gmail.com2, fathimashirin94@gmail.com3, shabnanasser@gmail.com4, sruthydas.abhi@gmail.com5

ABSTRACT: Instructors collect feedback from students. Student course feedback is generated daily in both classrooms and online course discussion forums. Traditionally, instructors manually analyse these responses in a costly manner. In this work, propose a new approach for summarizing student course feedback based on the integer linear programming (ILP) framework. The proposed approach allows different student responses to share co-occurrence statistics and alleviates scarcity issues. Experimental results on a student feedback corpus show that proposed approach outperforms a range of baselines in terms of both ROUGE scores and human evaluation.

I.

INTRODUCTION Student course feedback is generated daily in both classrooms and online course discussion forums. Rich information from student responses can reveal complex teaching problems, help teachers adjust their teaching strategies, and create more effective teaching and learning experiences. Traditionally, instructors manually analyze these responses in a costly manner. In this work, propose a new approach to summarizing student course feedback based on the integer linear programming (ILP) framework. Automatic summarization systems are typically extractive or abstractive. Since abstraction is quite hard, the most successful systems tested at the Text Analysis Conference (TAC) and CLEAR DECEMBER 2016

Document (DUC).

Understanding

Conference

In this work, student responses are collected from an introductory materials science and engineering course, taught in a classroom setting. Students are presented with prompts after each lecture and asked to provide feedback. These prompts solicit “reflective feedback” from the students. An example is presented in Table 1. Summarization is one of the greatest challenges in the field of NLP. So summarization is taken as the core idea in this paper. In this work, it aims to summarize the student responses automatically by the system. This is formulated as an extractive summarization task, where a set of 9


representative sentences are extracted from student responses to form a textual summary. Another challenge faced in summarizing student feedback is its lexical variety. To tackle this challenge, propose a new approach for summarizing student feedback, which extends the standard ILP framework by approximating the co-occurrence matrix using a low-rank alternative. The resulting system allows sentences authored by different students to share co-occurrence statistics.

Table 1: Example student responses and a reference summary created by the teaching assistant. „S1‟–„S8‟ are student IDs. II. RELATED WORK

CLEAR DECEMBER 2016

The previous work proposes to summarize student responses by extracting phrases rather than sentences in order to meet the need of aggregating and displaying student responses in a mobile application. It adopts a clustering paradigm to address the lexical variety issue. In this work, implemented a technique that leverage matrix imputation to solve this problem and summarize student response at a sentence level. Developed an application called CourseMIRROR[2] (Mobile Insitu Reflections and Review with Optimized Rubrics) is the model to enhance instructorstudent and student-student interactions with mobile interfaces and summarization. It collects and shares learner‟s in-situ reflections in large classrooms. The set of phrases in a summary and the associated student coverage estimates are presented to both the instructors and the students to help them understand the difficulties and misunderstandings encountered from lectures[7]. The Integer Linear Programming [4] framework has demonstrated substantial success on summarizing news documents. Previous studies try to improve this line of work by generating better estimates of concept weights [5]. Another proposed a support vector regression model to estimate bigram frequency in the summary. Also explored a supervised approach to learn parameters using a cost-augmentative SVM. Different from the above approaches, we focus on the co-occurrence matrix instead of concept weights, which is another important component of the ILP framework. 10


Propose a model which individual aspects are learned separately from data (without any hand-engineering) but optimized jointly using an Integer Linear Programming. The ILP framework allows to combine the decisions of the expert learners and to select and rewrite source content through a mixture of objective setting, soft and hard constraints. Experimental results on the TAC-08 data set show that the proposed model achieves state-of-the-art performance using ROUGE and significantly improves the informativeness of the summaries.

course feedback is Integer Linear Programming Formulation. The terminologies used in ILP formulation as follows:

Most summarization work focuses on summarizing news documents, as driven by the DUC/TAC conferences[3]. Notable systems include maximal marginal relevance, submodular functions, jointly extract and compress sentences, optimize the content selection and surface realization, minimize reconstruction error, and dual decomposition. Albeit the encouraging performance of the proposed approach on summarizing student responses, when applied to the DUC 2004 dataset and evaluated using ROUGE it observed only comparable or marginal improvement over the ILP baseline. However, this is not surprising since the lexical variety is low (20 percent of bigrams appear more than twice compared to 3 percent of bigrams appear more than twice in student responses) and thus less data sparsity, so the DUC data cannot benefit much from imputation.

Similarly,

Let D be the set of student responses M be the sentence in total Let j = {1,……..,M} {

Let N be the no: of unique concepts in D

Indicate the appearance of concepts in the summary be the weight assigned for each concept i Weight is often measured by the number of sentences or documents that contain the concept. The ILP-based summarization approach searches for an optimal assignment to the sentence and concept variables so that the selected summary sentences maximize coverage of important concepts. be the co-occurrence matrix which captures the relationship between concepts and sentence

III. METHODOLOGY The method used in the formulation of automatic summarization of student CLEAR DECEMBER 2016

{

11


∑ Two sets of linear constraints are specified to ensure the ILP validity: 1. a concept is selected if and only if at least one sentence carrying it has been selected by ∑ 2. all concepts in a sentence will be selected if that sentence is selected if Finally, the selected summary sentences are allowed to contain a total of L words or less If and only if ∑

IV. PROPOSED APPROACH Because of the lexical diversity in student responses, suspect the co-occurrence matrix A may not establish a faithful correspondence between sentences and concepts. A concept may be conveyed using multiple bigram expressions; however, the current co-occurrence matrix only captures a binary relationship between sentences and bigrams. For example, give partial credit to “bicycle parts” (student response) given that a similar expression “bike elements” (student response) appear in the sentence. Domainspecific synonyms may be captured as well. For example, the sentence “I tried to follow along but I couldn‟t grasp the concepts” is expected to partially contain the concept “understand the”, although the latter did not appear in the sentence.

CLEAR DECEMBER 2016

The existing matrix A is highly sparse. Only 2.7% of the entries are non-zero in our dataset. Therefore propose to impute the co-occurrence matrix by filling in missing values. This is accomplished by approximating the original co-occurrence matrix using a low-rank matrix. The lowrankness encourages similar concepts to be shared across sentences. The data imputation process makes two notable changes to the existing ILP framework. 1. It extends the domain of from binary to a continuous scale [0, 1] by ∑ , which offers a better sentence-level semantic representation. 2. The binary concept variables ( ) are also relaxed to continuous domain [0, 1] by , which allows the concepts to be “partially” included in the summary. Let be the co-occurrence matrix And be the low-rank matrix whose values are close to A at the observed positions. Objective function is ∑

(

)

‖ ‖

Ω represents the set of observed value positions ‖ ‖ ‖ ‖

denotes the trace norm of B ∑

, where r is the rank of B

are the singular values λ is the hyper parameter Projection operator , 12


[

]

{

Objective function can be represented as, ‖ Where ‖ ‖

‖ ‖

denotes the Frobenius

norm Optimize the above equation using the proximal gradient descent algorithm. The update rule is, ( ) where ρk is the step size at iteration k and the proximal function (B) is defined as the singular value softthresholding operator, = U · diag((σi − t)+) · , where B = Udiag(σ1,··· ,σr) is the singular value decomposition (SVD) of B and (x)+ = max(x,0). Since the gradient of

‖ is Lipschitz continuous with L = 1(L is the Lipschitz continuous constant), we follow to choose fixed step size ρk = 1, which has a provable convergence rate of O (1/k), where k is the number of iterations.

carefully designed to encourage students to self-reflect, allowing them to “recapture experience, think about it and evaluate it”. The average response length is 10±8.3 words. If it concatenates all the responses to each lecture and prompts into a “pseudodocument”, the document contains 378 words on average. The reference summaries are created by a teaching assistant. They are allowed to create abstract summaries using own words in addition to selecting phrases directly from the responses. Because summary annotation is costly and recruiting annotators with the proper background are nontrivial, 12 out of the 25 lectures are annotated with reference summaries. There is one gold-standard summary per lecture and question prompt, yielding 36 document summary pairs. On average, a reference summary contains 30 words, corresponding to 7.9% of the total words in student responses. 43.5% of the bigrams in human summaries appear in the responses. I.

EXPERIMENTS

V. PRELIMINARIES A. ANALYSIS The dataset is the pre-requisite thing to be collected at the early stage. The dataset is collected from an introductory materials science and engineering class. The class has 25 lectures and enrolled 53 undergrad students. The students are asked to provide feedback after each lecture based on three prompts: 1) “describe what you found most interesting in today‟s class,” 2) “describe what was confusing or needed more detail,” and 3) “describe what you learned about how you learn.” These open-ended prompts are CLEAR DECEMBER 2016

The proposed approach is compared against a range of baselines. They are, 1) MEAD: - a centroid-based summarization system that scores sentences based on length, centroid, and position 2) LEXRANK: a graph-based summarization approach based on eigen vector centrality 3) SUMBASIC:- an approach that assumes words occurring frequently in a document

13


cluster have a higher chance of being included in the summary 4) BASELINE-ILP:- a baseline ILP framework without data imputation For the ILP based approaches, make use of bigrams as concepts (bigrams consisting of only stop words are removed) and sentence frequency as concept weights. Uses all the sentences in 25 lectures to construct the concept-sentence co-occurrence matrix and perform data imputation. It allows to leverage the co-occurrence statistics both within and across lectures. For the soft-impute algorithm, perform grid search (on a scale of [0, 5] with step size 0.5) to tune the hyper-parameter λ. To make the most use of annotated lectures, split them into three folds. In each one, tune λ on two folds and test it on the other fold. Finally, report the averaged results. In all experiments, summary length is set to be 30 words or less, corresponding to the average number of words in human summaries.

Table 2: Summarization results evaluated by ROUGE and human judges. In Table2, present summarization results evaluated by ROUGE and human judges. Shaded area indicates that the performance difference with proposed approach is statistically significant (p < 0.05) CLEAR DECEMBER 2016

using a two-tailed paired t-test on the 36 document-summary pairs. ROUGE is a standard evaluation metric that compares system and reference summaries based on ngram overlaps. Proposed approach outperforms all the baselines based on three standard ROUGE metrics. When examining the imputed sentence concept co-occurrence matrix, it notices some interesting examples that indicate the effectiveness of the proposed approach, shown in Table 3.

Table3: Associated bigrams do not appear in the sentence, but after Matrix Imputation, they yield a decent correlation (cell value greater than 0.9) with the corresponding sentence. Because ROUGE cannot thoroughly capture the semantic similarity between system and reference summaries, so further perform the human evaluation. For each lecture and prompt, present the prompt, a pair of system outputs in a random order, and the human summary to five Amazon turkers. 14


The turkers are asked to indicate their preference for system A or B based on the semantic resemblance to the human summary on a 5-Likert scale („Strongly preferred A‟, „Slightly preferred A‟, „No preference‟, „Slightly preferred B‟, „Strongly preferred B‟).They are rewarded $0.08 per task. Make use of two strategies to control the quality of the human evaluation. First, it requires the turkers to have a Human Intelligence Task (HIT) approval rate of 90% or above. Second, it inserts some quality checkpoints by asking the turkers to compare two summaries of same text content but different sentence orders. Turkers who did not pass these tests are filtered out. Due to budget constraints, we conduct pairwise comparisons for three systems. The total number of comparisons is 3 system-system pairs×12 lectures×3 prompts ×5 turkers = 540 total pairs. Then calculate the percentage of “wins” (strong or slight preference) for each system among all comparisons with its counterparts. Results are reported in the last column of Table 2. The proposed approach is preferred significantly more often than the other two systems. Regarding the interannotator agreement, found 74.3% of the individual judgments agree with the majority votes when using a 3-point Likert scale („preferred A‟, „no preference‟, „preferred B‟). A. RESULTS Table 4 presents example system outputs. This offers intuitive understanding to the proposed approach.

CLEAR DECEMBER 2016

Table4: Example reference and system summaries. VII. CONCLUSION Summarize student feedback using an integer linear programming framework with data imputation. The proposed approach allowed sentences to share co-occurrence statistics and alleviates sparsity issue. Experiments have shown that the proposed approach performs competitively against a range of baselines and shows promise for future automation of student feedback analysis. Future work, take advantage of the high-quality student responses and explore helpfulness-guided summarization to improve the summarization performance. Also investigate whether the proposed approach benefits other informal text such as product reviews, social media discussions or spontaneous speech conversations, in which the same sparsity issue occurs and the language expression is diverse. Hope to extend ILP model to consider discourse coherence, sentence aggregation, and referring expression generation. 15


REFERENCES [1] WencanLuo, FeiLiu, ZitaoLiu, DianeLitman, "Automatic Summarization of Student Course Feedback", Proceedings of NAACL-HLT, 2016. [2] Xiangmin Fan, Wencan Luo, Muhsin Menekse, Diane Litman, and Jingtao Wang, "CourseMIRROR: Enhancing large classroom instructor-student interactions via mobile interfaces and natural language processing", In Works-In-Progress of ACM Conference on Human Factors in Computing Systems, 2015.

[3] Dimitrios Galanis, Gerasimos Lampouras, and Ion Androutsopoulos, "Extractive multi-document summarization with integer linear programming and support vector regression", In Proceedings of COLING, 2012. [4] Kai Hong, John Conroy, Benoit Favre, Alex Kulesza, Hui Lin, and Ani Nenkova, "A repository of state of the art and competitive baseline summaries for generic news summarization", Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC‟14), 2014.

simplification and lexical expansion", Information Processing and Management, 43(6):1606–1618, 2007. [9] Kristian Woodsend and Mirella Lapata, "Multiple aspect summarization using integer linear programming", In Proceedings of EMNLP, 2012

Machines aren‟t generally too great at describing pictures in human terms, but Google has now released ‘Show and Tell’, a free software algorithm capable of automatically captioning images with an extremely high degree of accuracy. Point Show and Tell at any photographic image and the algorithm will generate a caption in full English sentences. The results could have many uses, including assisting in searches for specific image content or for providing automatic audio captioning of photos for blind users.

[5] Chin-Yew Lin, "ROUGE: a package for automatic evaluation of summaries", In Proceedings of the Workshop on Text Summarization Branches Out, 2004. [6] Wencan Luo, Xiangmin Fan, Muhsin Menekse, Jingtao Wang, and Diane Litman, "Enhancing instructor-student and student-student interactions with mobile interfaces and summarization", In Proceedings of NAACL (Demo), 2015. [7] Dragomir R. Radev, HongyanJing, Małgorzata Sty´s, and Daniel Tam, "Centroid-based summarization of multiple documents", Information Processing and Management, 40(6):919–938, 2004. [8] Lucy Vanderwende, Hisami Suzuki, Chris Brockett, and Ani Nenkova, "Beyond SumBasic: Taskfocused summarization with sentence

CLEAR DECEMBER 2016

Now, Google has released an updated version of Show and Tell, which claims faster performance and up to 93.9% captioning accuracy, which should help to avoid some of the incorrect, and in some cases amusing, captions generated by previous technologies. This latest version of the algorithm has also been made freely available as part of the Tensor Flow open source software library, which is free to download and use.

16


Approaches for the Detection of Opinion/Review Spams Sandini S1, Sreelakshmi K2, Varsha E3 M.Tech Computational Linguistics Government Engineering College, Sreekrishnapuram sandinisukumar@gmail.com1, sreelakshmiknarayanan@gmail.com2, varshaedakkat23@gmail.com3

With the advancements in the field of information technology, a new era of online marketing and advertisement have evolved. In the present scenario, before purchasing a product, customers peek through the reviews and ratings of the product in different websites. Leveraging this fact, spammers tend to mislead the customer by providing fake reviews about products and brands. Hyper spams in which fabricated positive reviews or opinions acts as a promotion for the product. Defaming spams in which awkward negative reviews about the product effect the reputation and demand for the product. In this context, the detection of opinion/review spams has a crucial role and has gathered a lot of attention. Various techniques are proposed for the detection of opinion spam. Here we will discuss some effective and relevant methodologies like Opinion spam detection based on Sentiment classification and Featured base-opinion mining, supervised spam detection and unsupervised spam detection.

CLEAR DECEMBER 2016

1. Opinion spam detection based on Sentiment classification and featured base-opinion mining This methodology consists of two major steps. First, sentiment classification is performed to gain knowledge about the sentiment contained in the review. It determines whether the review is positive, negative or neutral. Second, feature baseopinion mining is performed to obtain the opinion of the reviewer about individual features of the reviewed product and to get the spam free content (base-opinion). For the detection of spam, important linguistics features like the word, POS and ngrams are considered. They help in discovering the deceptions and lies in the content of the review, which is extracted from the input review query. To identify and categorize many spam features, analysis of customer review need to be done. This phase recognizes metadata of the review and information about the reviewed entity. Metadata about the review includes the awarded star rating, user-id of the reviewer, host IP address, MAC address 17


of the machine used, geographical location of the reviewer, time of publication, time took to write the review etc. Analyzing these features help us to identify irregular behavior patterns of reviews and reviewers. Before the text mining operation, the text document should be pre-processed into small units of text called features of the product. Text mining analyzes the review text in natural language and derives information from them. It is followed by text categorization which helps in finding text duplication and similarity between the different reviews. There are also non-review spams, which includes advertisements, irrelevant questions and answers, some random text etc. For identifying such content machine learning techniques like linear classifiers are used. 2. Supervised spam detection This approach demands the need of a labeled review spam data set. Various features of the review text are used for labeling. There are four major approaches to get the features from review text. First, linguistics characteristics like quantity and complexity of words, average word length, the number of digits etc. are used as features. Fake reviews have more quantity of words and the complexity than truthful ones. Average word length and number of digits are more in the case of truthful reviews. Second, POS tagging classes like nouns, prepositions, adjectives etc. are used as features. Original reviews contain more number of nouns, prepositions and adjectives, while fake ones have more of pronouns, verbs, adverbs and connectors. Third, N-gram is used as a CLEAR DECEMBER 2016

feature to model the content and context of review using text categorization method. Fourth, the sentiment is considered as a feature. Fake negative reviews contain words that showcase the high degree of negativity and vice versa in the case of fake positive opinions. The above-mentioned features are used to train the classifiers like Naive Bayes, Decision Tree and Support Vector Machines. The classifiers decide whether the review content is fake or not. Naive Bayes approach based model is easy to build and it makes use of independence property among different predictors based on Bayes theorem. This approach performs faster in classification and its performance is not degraded by the irrelevant attributes. Support Vector Machines use a hyper plane to divide the whole set of training class points into two disjoint sets. Decision tree model builds a decision tree through an incremental process, by forming small sets of training data. Tree structure, which is built as a result of classification building and a regression model, contains decision for the problem statement. 3. Unsupervised spam detection Manually labeling the training set is a difficult and expensive task. Unsupervised spam detection can be used as an alternative for the supervised technique. This approach makes use of unlabelled data set to model the opinion spam detection as an instance of unsupervised Bayesian clustering. The two clusters, spam and non-spam are modeled as latent variables. The review text is represented by these latent variables which 18


are conditioned by a set of behavioral and linguistic features of the review. This is achieved by using inference techniques like Markov chains and the cluster for spam and non-spam category is generated by using the stationary distribution. The features considered can be classified into two classes: Author features and Review features. Author features include review content similarity, the maximum number of reviews and reviewing activity. Review features includes extreme rating (positive or negative), rating deviation from average rating and earliness of the time of review. Atypical behavior of the author is an important parameter that can be used for the identification of authors who writes opinion spams. For example, consider a scenario where an author writes negative reviews for a particular brand only when the other users write positive opinions. Then this author is suspicious and the reviews by him may be

fake ones. Different kinds of abnormal behavior patterns can be modeled to show spamming. These models can assign a score to each reviewer based upon the similarity of user behavior to spamming behavior represented by the model. All the scores are added up to obtain the spamming score for an author and classified to the set of spammer or non-spammer. REFERENCES [1] https://books.google.co.in/books?id=Gt8g72 e6MuEC [Available Online: Accessed on 20th December 2016] [2] http://www.ijritcc.org/download/browse/Vol ume_4_Issues/May_16_Volume_4_Issue_5/1463 375082_16-05-2016.pdf [Available Online: Accessed on 20th December 2016] [3] http://www.vladsandulescu.com/opinionspam -detection-literature-review/ [Available Online: Accessed on 19th December 2016]

Google Allo is an instant messaging application developed by Google. Introducing Google Allo, a smart messaging app helps you say more and do more. It includes a virtual assistant and provides a "smart reply" function that allows users to reply to messages with automatic suggestions instead of typing. Allo was announced at Google I/O on May 18, 2016 and launched on September 21, 2016. It is a step up on various intelligent personal assistants like Cortana and Siri.

CLEAR DECEMBER 2016

19


Entity focused sentence generation Rahul M M.Tech Computational Linguistics Government Engineering College, Sreekrishnapuram Rahulgullan9@gmail.com

Information dumping into World Wide Web increasing day by day. As a result, we can say that W3 is a huge source of information, in future become largest knowledge base. As part of this transition introduced the concept semantic web. Which enabled information retrieval based on semantic also. In order to cope up with the semantic concept, the document not only contains free text but also semantically enriched markup. Also, the search engines are modified to operate with semantically rich documents. Semantics are added to the web by organizing it into large entity-relationship graphs. DBpedia, Freebase Yago, etc. express semantic relationships by representing entities as nodes and relations as arcs between nodes. This representation can be used for applications that require exploration of complex relation among entities. Multi-document summarization is one of the applications that makes use of entity focused approach to generate descriptions from multiple documents. Multi-document summarization allows extraction of information from multiple documents which are all based on the same topic. The generated summary should both concise and unambiguous. Summarization from multiple documents is rather complex than summarization from a single document.

CLEAR DECEMBER 2016

Consider generating a summary for a named entity such as companies, we use summarization approaches like knowledge base driven or data driven or sometimes both. The resulting summaries would have some format similar to Wikipedia descriptions. The sentences generate from RDF triples found in DBPedia and Freebase. The RDF stands for Resource Description Framework, specifications by W3C. It is a meta data for modeling of information that is implemented in web resources. Triples are the expressions that contain a subject, object, and the relation between them (predicate). The subject denotes the resource, and the predicate denotes traits or aspects of the resource and expresses a relationship between the subject and the object. The primary process of sentence generation uses the method of targeted generation (TD). TD system generates templates for the target sentence. For that information about a particular entity (e.g. Companies) are taken from Wikipedia articles and DBPedia entries. For each RDF relation, the entity act as the subject identify all sentences in the corresponding article that holds entities in the relation. The template for that relation is generated by replacing the entities. For example 'Facebook was founded by Mark Zuckerberg' is converted into <company> was founded by 20


<founder>. If more than one founder occurred the same will be represented as a single entity. Sometimes there will occur more than one entity in the same sentence (e.g. the entities such as company founder location occur simultaneously). In this manner, possible templates are generated. At generation time the template slots will be filled with information about target entity realized from RDF entries of the target entry. If there are multiple entities then conjunctions are inserted. This method fails in some condition when data for filling the slots are not available. These problems can be solved by removing the phrases from the sentence that could not be filled; otherwise, the entire sentence can be discarded.

sentences. The entities of the pairs are then replaced with the placeholder tags. The place holder tag is used to indicate the entity type and the pattern is the word around the tags. The pattern now will have the form < L >< T1 >< M >< T2 >< R >. L stand for left, M for middle and R for right, are the words. T1 and T2 be the entity type. It is also a template based approach, the difference is templates are not aligned to a relation between the entity and the input entity; only the type of entity (organization, person, location etc.) is mentioned using the tag. By matching the pattern against the web text new entity pairs are generated. A sentence is said to match a pattern if the order of type of entity is matched.

TD system can generate many possible sentences for each relation. To choose the most suitable one newly generated sentences are clustered based on the relation. Since the number of replaced relation gives the amount of information about the target entity, the sentences are scored by counting the replaced relation. The shorter sentences are also weighted. If a sentence required more post-processing activity, it scored less. The sentence with a high score for each relation is then chosen for the description. DataDriven generation is somewhat similar to TD system. Except for that DD method generates description using sentences which are taken from the web. The underlying concept is that it produce sentences which realize the relation between the input entity and other entities. Realization is based on a method called boot strapping approach. In this approach it starts with an entity pair. Which is called seed set. That represents a small subset of the desired relations. And keep on generating additional relations. Here the patterns are generated by reading the text of web based on the seed set and retrieve the

The entities are therefore assumed to be related because they are expressed similar to the seed pair. In this manner the patterns are learned and generated entity pairs. Here the Bing search engine result can be used to match against learned patterns. The selected sentences are ranked by number of matches and added to the description.

CLEAR DECEMBER 2016

Here also we can add a post processing step to remove the noise and redundancy present in the generated description. The redundancy occurs when two or more sentences convey the same information. It can be removed by checking sentence which is equal to or the subset of any other sentence in the same description. If yes, then remove that sentence. Here the noise refers to the incomplete sentences, say ending with '..' or 'etc.'. As well as the news stories. Both can be removed by make use of regular expressions. If both methods are combined the summarization will become more efficient and accurate. This is called hybrid approach.

21


M.Tech Computational Linguistics Dept. of Computer Science and Engg, Govt. Engg. College, Sreekrishnapuram Palakkad www.simplegroups.in simplequest.in@gmail.com

SIMPLE Groups Students Innovations in Morphology Phonology and Language Engineering

Article Invitation for CLEAR- March-2017 We are inviting thought-provoking articles, interesting dialogues and healthy debates on multifaceted aspects of Computational Linguistics, for the forthcoming issue of CLEAR (Computational Linguistics in Engineering And Research) Journal, publishing in March 2017. The suggested areas of discussion are:

The articles may be sent to the Editor on or before 10th March, 2017 through the email simplequest.in@gmail.com. For more details visit: www.simplegroups.in Editor,

Representative,

CLEAR Journal

SIMPLE Groups

CLEAR DECEMBER 2016

22


Hello world, The amount of information stored across heterogeneous platforms increases exponentially. So the techniques for retrieving the relevant information is of growing concern. Retrieving semantic similar short texts is an issue of critical importance for many applications like question answering systems and web search. Currently, the focus of information retrieval community has shifted to techniques utilizing the semantics in the text. Entity focused summary generation captures the complex relationship between entities. Integer linear programming framework with data imputation is an effective approach to the applications like student feedback systems. A major issue encountering in the e-commerce field is review spams. So the techniques for the detection of review spams need immediate concentration.

This issue of CLEAR focuses on the researches done on Semantically Similar Short Text Retrieval, Entity-centric Summarization, Summarization of Feedback Using ILP Framework and Detection of Review Spams. The articles are penned with the hope of shedding some light to these trending fields of information retrieval. CLEAR is grateful to all who have given their valuable time and effort for introducing reviving ideas. Simple group invites more strivers in this field. Wish you all the success in your future endeavors‌!!!

Sreelakshmi K

CLEAR DECEMBER 2016

23


CLEAR DECEMBER 2016

24


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.