Clear Journal March 2019 Edition

Page 1



DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING VISION To become a Centre of Excellence in Computing and allied disciplines.

MISSION To impart high quality education in Computer science andEngineering that prepares the students for rewarding, enthusiastic and enjoyable careers in the industry, academia and other organizations.

COURSES: M.Tech in Computational Linguistics B.Tech in Computer Science & Engineering

ACTIVITIES: Machine Learning club FOSS Clubs, IEEE SB


CLEAR Journal (Computational Linguistics in Engineering And Research) M. Tech Computational Linguistics, Dept. of Computer Science and Engineering, Govt. Engineering College, Sreekrishnapuram, Palakkad 678633 www.simplegroups.in simplequest.in@gmail.com geccl1820@googlegroups.com

Chief Editor Shibily Joseph Assistant Professor Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad-678633 Editors Aiswarya K Surendran Divya Visakh Sreeja V Cover page Zeenath M T

Editorial…… .....………… 3 CLEAR MARCH 2019 Invitation……… ......... …29 Last word......... ………....30

Personalized Image Captioning with Context Sequence Memory Networks ............................................. 7 Anusha K Sasindran

MetaSoundex Algorithm for Privacy Preserving Record Linkage........... 10 Dincy Davis

Guess me if you can: Acronym Disambiguation.............................14 Haritha Kadimuttath

Global Encoding for Abstractive Summarization ................................ 18 Jannya V

Emotion Based Priority Prediction for Bug reports ............................... 21 Rengitha R

An Approach to Feature Extraction for identification of Suicidal Ideation in Tweets ............................................25 Vishnu Prasad

Layout Divya Visakh

2 CLEAR MARCH 2019


Dear Readers, Here is the latest edition of CLEAR Journal, which comes with some new articles based on the trending topics like Personalized Image Captioning, MetaSoundex Algorithm, Acronym Disambiguation, Abstractive Summarization, Emotion based Priority Prediction, Feature Extraction for identification of Suicidal Ideation in Tweets etc. The previous edition could cover different articles related to Word Sense Disambiguation, Sense Aware Neural Model, Text Categorization using Disconnected Recurrent Neural Networksand Intelligent Question Answering System. We are very happy that we could make new readers which give us very much motive to make improvements and keep going well. As always, we are working on it based on your valuable feedbacks, and expect more. On this hopeful prospect, I proudly present this edition of CLEAR Journal to our faithful readers and look forward to your opinions and criticisms.

BEST REGARDS, Shibily Joseph (Chief Editor)

3 CLEAR MARCH 2019


PLACEMENTS • Resmi P of M.Tech Computational Linquistics, 2017-2019 batch got placed at“Litmus7 Systems Consulting Pvt Limited”as Assosiate Engineer Trainee

UGC NET QUALIFIERS

RESMI P of M.Tech Computational Linguistics (2017-2019) qualified UGC NET, December 2018 Exam

SANDEEP NITHYANANDAN of M.Tech Computational Linguistics (20172019) qualified UGC NET December 2018 Exam

Simple Group Congratulates All for their Achievements!!!


DEPARTMENT NEWS • MACHINE LEARNING FORUM

M.Tech First year students (2018-20) attended a Machine Learning Forum which was organized by RED TEAM ACADEMY on 23rdFebruary 2019 at Marina Residency, Calicut. • INTRODUCTION TO DEEP LEARNING CLASS

As a part of SPECTRA 2019, CSE Department organized a workshop on Introduction to Deep Learning by Chandrasekhar Lakshmi Narayanan, (IIT Palakkad)


• leCun’s Trap : M.Tech students organised an event named “LeCun‟s Trap” for

B.Tech students as a part of CSE Association SPECTRA'19 on February 19, 2019


Personalized Image Captioning with Context Sequence Memory Networks Anusha K Sasindran B.Tech S8 CSE Government Engineering College, Sreekrishnapuram

anushaksasindran@gmail.com

Image captioning is a task of automatically generating a descriptive sentence of an image.It is regarded as one of the frontier AI problem.It requires an algorithm not only to understand the image content in depth beyond category or attribute levels, but also to connect its interpretation with a language model to create a natural sentence. Personalization is one of the main issues of normal image captioning.It is required to generate a descriptive sentence for an image, accounting for prior knowledge such as the users active vocabularies or writing styles in previous documents. Personalized image captioning is applicable to a wide range of automation services in photo-sharing social networks.Main challenge occur on two post automation tasks is hashtag prediction and post generation. To achieve personalized image captioning tasks, propose a memory network model named as context sequence memory network (CSMN). The unique updates of CSMN include: •

exploiting memory as a repository for multiple context information.

appending previously generated words into memory to capture long term information. 7 CLEAR MARCH 2019

adopting CNN memory structure to jointly represent nearby ordered memory slots.

A novel instagram dataset is used for evaluation of personalized image captioning.

Seperately collected instagram posts are used for post completion and hashtag prediction. Image posts from Instagram, which is one of the fastest growing photo sharing social networks is used. Pinterest categories are used to obtain image posts of diverse users. Next, a series of filtering are processed. First,language filtering is applied to include only english posts. Then we apply filtering rules for the lengths of captions and hashtags. A seperate vocabulary dictionary is needed to be build for two tasks, by choosing the most frequent V words in our dataset. For instance, the dictionary for hashtag prediction includes only most frequent hashtags as vocabularies. V is set to be 40K for post completion and 60K for hash prediction after thorough tests. Before building the dictionary, urls, unicodes except emojis, and special characters are removed.


The input is a query image I q of a specific user, and the output is a sequence of words: y t = y 1 , . . . , y T.yt corresponds to a list of hashtags in hashtag prediction, and a post sentence in post generation. Optional input is the context information to be added to memory, such as active vocabularies of a given user. Hashtag prediction as sequence prediction instead of prediction of a bag of orderless tag words. Since hash tags in a post tend to have strong co-occurrence relations, it is better to take previous hashtags into account to predict a next one. Construction of Context Memory : Memeory is needed to construct to store three types of context information; •

Figure 1 represents Memory Setup State-Based Sequence Generation : This approach doesnot involve the use of RNN and its variants, but sequentially store all of previously generated words into the memory. It enables to predict each output word by selectively attending on the combinations of all previous words, image regions, and user context. To predict a word y t at time step t based on the memory state,Letting one- hot vector of previous word to y t1 , first generate an input vector qt at time t to our memory network as we first generate an input vector q t at time t to our memory network as qt = ReLU (Wq x t + bq )

Mage memory for representation of a query image

where xt = Web y t1

User context memory for TF-IDF weighted D frequent words from the query users previous posts.

where W e b #R 512V and W q #R 1,024512 Next q t is fed into the attention model of context memory.

Word output memory for previously generated words.

Figure 2: Prediction Step and Word output memory update

8 CLEAR MARCH 2019


It indicates which part of input memory is important for input q t at current time step. Then CNN is applied to the attended output of memory. Using a CNN significantly boosts the captioning performance. It is mainly due to that the CNN allows us to obtain a set of powerful representations by fusing multiple heterogeneous cells with different filters. Finally, the word that attains the highest probability s selected. Unless the output word y t is the EOS token, Repeat generating a next word by feeding y t into the word output memory and the input at time step t + 1. This approach is greedy in the sense that the model creates the best sequence by a sequential search for the best word at each time step. Teacher forced learning method is used that provide the correct memory state to predict next words. References [1] “User Conditional Hashtag Prediction forImages”,http://www.thespermwhale.c om/jaseweston/papers/imagetags.pdf, Accessed online on 29March 2019. [2] “Long-term Re- current Convolutional Networks for Visual Recognition and Description”,https://arxiv.org/pdf/1411.4 389.pdf, Accessed online on 29March 2019. [3] “From Captions to VisualConcepts and Back”,https://arxiv.org/pdf/1411.4952.p df, Accessed online on 29 March 2019

9 CLEAR MARCH 2019

. .

Neurodegenerative identified using intelligence:

diseases artificial

Researchers have developed an artificial intelligence platform to detect a range of neurodegenerative disease in human brain tissue samples, including Alzheimer's disease and chronic traumatic encephalopathy, according to a study conducted at the Icahn School of Medicine at Mount Sinai and published in the Nature medical journal Laboratory Investigation. Their discovery will help scientists develop targeted biomarkers and therapeutics, resulting in a more accurate diagnosis of complex brain diseases that improve patient outcomes.The researchers will present the mini cheetah's design at the International Conference on Robotics and Automation, in May. They are currently building more of the four-legged machines, aiming for a set of 10, each of which they hope to loan out to other labs.


MetaSoundex Algorithm for Privacy Preserving Record Linkage Dincy Davis M.Tech Computational Linguistics Government Engineering College, Sreekrishnapuram dincydavis17@gmail.com Record linkage aims at linking records that refer to the same real-world entity, such as persons.However, in many cases, data owners are only willing or allowed to provide their data for such data integration if there is sufficient protection of sensitive information to ensure the privacy of persons.Privacy Preserving Record Linkage (PPRL) addresses this problem by providing techniques to match records while preserving their privacy allowing the combination of data from different sources for improved data analysis and research. For this purpose, we are using Phonetic encoding techniques in which convert the words into an encoded form based on their pronunciation.In this paper we present a hybrid protocol, known as MetaSoundex that allow the linking of databases between organizations. The experiments proved that the proposed technique has higher accuracy in record linkage than the widely known phonetic matching technique, Soundex. One of the novel techniques to enforce privacy preserving record linkage, which might contain misspelled data, is the use of phonetic encoding. The phonetic encoding involves conversion of words into an encoded form based on their pronunciation. In phonetic encoding, each encoded word can map to more than one word, as a result, it is hard for the intruders to extract original information. Hence

10 CLEAR MARCH 2019

privacy of data is preserved, while performing record linkage process.

Figure 1: Privacy preserving record linkage problem In Figure 1 we illustrate an example of realworld cases where privacy preserving record linkage is needed. In this figure, each of the two hospitals holds a database of patients information of its own. They would like to find out the common patients they share (eg, Angel Smith and Divine Scavo ) in order to perform collaborative research on the shared data. However, due to the requirement of HIPAA (Health Insurance Portability and Accountability Act of 1996), they cannot


exchange data in clear texts. Moreover, we notice that for the patient Divine Scavo , allattributes are the same at both hospitals except the height (one is 162.5 cm and the other is 162.6 cm). If traditional schemes are used, the record belonging to the same patient will belabeled as a mismatch, leading to an error result. In order to avoid the mismatch for the records belonging to the same patient, we need new software to enable privacy preserving record linkage for error-prone data. In this paper presents a comprehensive record linkage method, which meets the privacy preserving policies. Soundex and Metaphone are the most commonly used phonetic encodings for record linkage. However, both techniques have their own disadvantages of low precision and low accuracy respectively. Hence, a hybrid phonetic encoding technique, namely, MetaSoundex, is developed. Based on the experimental analysis of performance evaluation of various widely used phonetic matching algorithms by Koneru et al., it is observed that Soundex has highest accuracy than other algorithms. From experiments, it is observed that MetaSoundex has higher accuracy in record linkage than Soundex. Here enclosed the methodology of phonetic algorithms which convert the word into an encoded format based on their pronunciation, Soundex and MetaSoundex to perform privacy preserved record linkage between two databases. The functionality of these two phonetic matching algorithms is illustrated as below. The illustration also includes the functionality of Metaphone

11 CLEAR MARCH 2019

algorithm, which is used in the formulation of MetaSoundex. Soundex Algorithm- Russell, Robert C (1922): Soundex is the most widely known of all phonetic.The Soundex code for a name consists of a letter followed by three numerical digits: the letter is the first letter of the name, and the digits encode the remaining consonants. The steps for generating phonetic code using the Soundex algorithm are as below. 1. Retain the first letter of the word. 2. Change letters to digits as follows: • A, E, I, O, U, H, W, Y -> 0 • B, F, P, V -> 1 • C, G, J, K, Q, S, X, Z -> 2 • D,T -> 3 • L -> 4 • M, N -> 5 • R -> 6 3. Remove all pairs of consecutive digits. 4. Remove all zeros from the resulting string. 5. Pad the resulting string with trailing zeros and return the firstfour positions,which will be of the form <uppercase letter><digit><digit><digit>. Example:A265= Azuron, Ackermann…. Disadvantages: • Dependence on initial letter • Silent consonants. Soundex has high recall compared to other algorithms but because of its low precision the algorithm is not very efficient. Metaphone Phonetic Algorithm - L. Philips (1990): Metaphone is an algorithm, which considers set of letters as an alternative to letter by letter encoding(as in


Soundex), to identifythe phonetic variations in words. To obtain Metaphone code following steps should be followed. I. Remove all repeating neighboring letters except letter C. II. Vowels are only kept when they are the first letter. III. The beginning of the word should be transformed using thefollowing rules: • KN -> N; GN -> N; PN-> N; AE-> E; WR-> R; X -> S; WH -> W IV. Apply the rules for each alphabet as follows: • •

• •

• • • • • • •

B->B :Remove B letter at the end, if it is after M letter. C -> With X: CIA-> XIA, CH -> XH With S: CI-> SI, CE -> SE, CY->SY With K: C-> K, SCH -> SKH D -> With J: DGE-> JGE, DGY -> JGY, DGI -> JGI With T: D-> T F-> F G -> GH->H, :Except it is at the end or before a vowel. GN-> N and GNED -> NED, if they are at the end. With J: GI-> JI, GE -> JE, GY -> JY With K: G-> K H-> Remove all H after a vowel but not before a vowel. J-> J K-> K, if CK -> K L-> L M-> M N-> N P-> P, if PH -> F 12 CLEAR MARCH 2019

• • •

• • •

Q-> K R-> R S -> With X : SH-> XH, SIO -> XIO, SIA -> XIA With S : otherwise T -> With X: TIA-> XIA, T IO -> XIO With 0: TH-> 0 TCH-> CH T otherwise V-> F W-> Silent if not followed by a vowel W if followed by a vowel. X-> KS

Y-> Silent if not followed by a vowel Y if followed by a vowel • Z-> S Example: HWRT =Howard,Haward,.... Disadvantages: Less accuracy in obtaining proper matches to the misspelled word (For example, Clemons have Metaphone code KLMNS but Clemon have KLMN). MetaSoundex - Koneruet. al. (2017):To overcome the limitations in both algorithms, a new algorithm is proposed, namely, Meta Soundex which the high precision as in Metaphone and high accuracy as in Soundex. MetaSoundex is a hybrid protocol of Soundex and Metaphone. Figure 2: Proposed MetaSoundex algorithm


The steps for generating phonetic code using MetaSoundex algorithm are as shownbelow: 1. Find the Metaphone of the word W. • M=Metaphone(W) 2. Find the Soundex of M. • S=Soundex(M) 3. Change the letters of S to digit as follows: • A, E, I, O, U -> 0 • J, Y -> 1 • D,T-> 3 • S, Z, C-> 4 • X, G, H, K, Q-> 5 • N,M-> 6 • B, F, V, P, W-> 7 • L-> 8 • R-> 9 MetaSoundex has the highest accuracy in linking the records compared to the existing Soundex algorithm used in PPRL. It overcomes the limitations of both Soundex and Metaphone algorithm. It would able to get required matches even if there occur deletion error, unlike Metaphone and first letter dependency of Soundex also removed in MetaSoundex. This algorithm can be used to prevent any crucial information getting

exposed to other people when the communication lines are being attacked. References [1] “Performance evaluation of phonetic matching algorithms on English words and street names”, https://www.scitepress.org/Papers/2016/ 59263/59263.pdf.,Accessed online on 13October 2018. [2] “Privacy preserving record linkage using metasoundexalgorithm”,https://ieeexplor e.ieee.org/document/8260671, Accessed online on 13October 2018. [3] “taxonomy of privacy-preserving record linkagetechniques”,https://www.academi a.edu/33761785/A_taxonomy_of_privac ypreserving_record_linkage_techniques, Accessed online on 13 October 2018. [4] “Challenges for privacy preservation in dataintegration”,https://dl.acm.org/citati on.cfm?id=2629604,Accessed online on 13 October 2018.

Artificial intelligence cuts lung cancer screening false positives:Lung cancer is the leading cause of cancer deaths worldwide. Screening is a key for early detection and increased survival, but the current method has a 96 percent false positive rate. Using machine learning, researchers at the University of Pittsburgh and UPMC Hillman Cancer Center have found a way to substantially reduce false positives without missing a single case of cancer.Comparing the model's assessment against the actual diagnoses of these patients, the researchers found that they would have been able to save 30 percent of the people with benign nodules from undergoing additional testing, without missing a single case of cancer.The three factors that were most important to the model are the number of blood vessels surrounding the nodule, the number of nodules and the number of years since the patient quit smoking. 13 CLEAR MARCH 2019


Guess me if you can: Acronym Disambiguation Haritha Kadimuttath M.Tech Computational Linguistics Government Engineering College, Sreekrishnapuram harithakadimuttath@gmail.com

Acronyms are abbreviations formed from the initial components of words or phrases. Acronyms are efficient for communication, but it may cause difficulty to people who are not familiar with the subject matter. To avoid this, a framework is introduced which automatically reduce the ambiguities. Acronym disambiguation in enterprise are challenging due to several reasons. First, acronym may be highly ambiguous since acronym used in enterprise could have multiple internal and external meaning. Second, there is no comprehensive knowledge base such as Wikipedia available in enterprise. The system should be generic to work for any enterprise. In this frame work it takes enterprise corpus as input and produce high quality acronym disambiguation system as output. Disambiguating concepts and entities in a context sensitive way is a fundamental problem in natural language processing. Acronym disambiguation is a subset of the more general problem of Word Sense Disambiguation (WSD), which is to decide the sense of words in context. Acronyms are abbreviations formed from the initial components of words or phrases (e.g., AI from Artificial Intelligence). As acronyms cans horten long names they are widely used in the natural language to make 14 CLEAR MARCH 2019

communications more efficient. As they make communication efficient acronyms used almost everywhere in enterprises, including notifications, emails, reports and social network posts. Figure below shows a sample enterprise social network post. As we can see, acronyms are frequently used there. There are no universal standards for making abbreviations; therefore sometimes they could be difficult to understand, especially for people who are not familiar with the specific areas, such as new employees and patent lawyers. The enterprise acronym disambiguation task is challenging due to the high ambiguity of acronyms, e.g., SP could stand for Service Pack, SharePoint or Surface Pro in Microsoft. And there is one additional challenge compared with previous disambiguation tasks: in an enterprise document, an acronym could refer to either an internal meaning (concepts created by the enterprise that may or may not be found outside) or an external meaning (all concepts that are not internal). For example,regarding the acronym AI, Asset Intelligence is an internal meaning mainly used only in Microsoft, while Artificial Intelligence is an external meaning widely used in public. A good acronym disambiguation system


should be able to handle both internal and external and external meanings.

According to authors in the mining module, the candidate meaning mining is done using a technique Hybrid Generation. In this method treat a phrase asa candidate for anacronym when initial letters of the phrase matches the acronymand phrase and acronym co occur in at least one document. And also it is a validcandidate for the acronym in public knowledge bases. For each candidate calculate itspopularity score. Calculate two types of popularity score: Marginal Popularity(MP)and conditional popularity (CP).

Figure 1: Acronyms in enterprise The system is particularly useful to develop a system that can automaticallyresolve the true meanings of acronyms in enterprise documents. The system couldbe run online as a querying tool to handle any ad-hoc document,or run online toannotate acronyms with their true meanings in a large corpus. In the online mode,the true meanings can be further indexed by an enterprise search engine, so thatwhen users search for the true meaning, documents containing the acronym can alsobe found. In this approach the acronym and its correct meaning is output by checking both the internal and external meaning. This method has three modules. They are mining module, training module and testing module. The system mines the acronym meaning pair from the plain text and then ranks it.Then the ranked data is trained. And then test how much the pair is confident and based on that the final selection is done.

15 CLEAR MARCH 2019

Figure 2: Framework Conditional Popularity can more reasonably reveal how often the acronym isused to represent each meaning candidate. However, due to the data sparsity issue in enterprises, many valid candidates may get zero value for conditional popularity since they may never co-occur with the acronyms in the enterprise corpus. The Margina lPopularity does not have this problem since it is calculated from the raw counts of the candidates. People often create many variants (including abbreviations, plurals oreven misspellings) for the same meaning,


therefore many mined meaning candidatesare actually equivalent. It is important to deduplicate these variants before sending them to the disambiguation module.This process is called candidate deduplication.Incontext harvesting, harvest context words for each meaning candidate. These contextswords could be used to calculate context similarity with the query context.

Figure 3: Distant supervision example For meaning candidate generation first train a candidate ranking model to rankcandidates with respect to the likelihood of being the genuine meaning for the targetacronym. Automatically generate training data via distant supervision.LambdaMARTalgorithm is used to train the model.

After getting the ranking results,apply a confidence estimation step to decide whether to trust the top ranked answer. There are two motivations behind. First,candidate generation approach is not perfect; therefore encounter cases in whichthe genuine meaning is not in our candidates. For such cases, the top ranked answer is obviously incorrect. Second,training data is biased towards the internal meanings since external meanings may rarely appear with full names. In this step, train a confidence estimation model,which will estimate the top results confidence. Similar to the ranker training, here the training data is also automatically generated. Run the learned ranker on some distant labeled data (generated from a different corpus), and then check if the top ranked answer is correct or not. Any classification algorithms can be used here. In this system utilize the MART boosted tree algorithm to train the model. The confidence estimation features are,

Candidate ranking feature are,

Figure 2: Confidence Estimation Features

Figure 2: Candidate Ranking Features 16 CLEAR MARCH 2019

The proposed system is a novel, end-to-end framework to solve acronym disambiguation problem. It takes the enterprise corpus as


input and produces a high-qualityacronym disambiguation system as output. The disambiguation models are trainedvia distant supervised learning, without requiring any manually labeled training examples. Different from all the previous acronym disambiguation approaches, our system is capable of accurately resolving acronyms to both enterprise-specific meaningsand public meanings. The system can be easily deployed to any enterprises without requiring any domain knowledge.

References [1] “Guess me if you can :Acronym Disambiguation for Enterprise”, https://aclanthology.info/papers/P182021/p18-2021, Accessed online on 5 July 2018. [2] “ALICE: an algorithm to extract abbreviationsfromMEDLINE

”,https://www.ncbi.nlm.nih.gov/pub med/15905486, Accessed online on 19May 2005. [3] “Mining acronym meaning and their expansion using query click log”, https://www.microsoft.com/enus/rese arch/wpcontent/uploads/2016/02/fp0 99-taneva.pdf, Accessed online on 1 May 2013. [4] “Acronym Disambiguation: A domain independent approach”, https://arxiv.org/abs/1711.09271, Accessed online on 25 November 2017.

Being able to buy anything you want with the touch of a finger may have seemed like a fantasy a few years ago, but it‟s now a reality. Merging touch screen technology with one click shopping, touch commerce allow consumers to buy products easily from their phones. After linking their payment information to a general account and enabling the feature, customers are able to buy everything from clothes to furniture with just a finger print.

17 CLEAR MARCH 2019


Global Encoding for Abstractive Summarization Jannya V M.Tech Computational Linguistics Government Engineering College, Sreekrishnapuram wejannya@gmail.com

Conventional sequence to sequence(seq2seq) was used in neural abstractive summarization. This model suffers from semantic irrelevance and repetition. To improve the system global encoding framework is proposed. Global encoding framework controls the information flow from the encoder to the decoder. Framework consists of a convolutional gated unit toperform global encoding to improve the representations. Evaluations are based on the LCSTS and the English Gigaword. By analyzing the system it is capable of generating summary of higher quality and reducing repetition Nowadays users mainly reads a journal,stories and articles which are lengthy and repetitive only based on the summary. Therefore,the abstractive summarization is important.Abstractive summarization is based on sequence mapping task that the source text need to mapped to the target summary. Therefore, sequence-to-sequence learning can be applied to neural abstractive summarization, whose model consists of an encoder and a decoder. Attention mechanism has been used in seq2seq models where the decoder extracts information from the encoder based on the attention scores on the source-side information.

18 CLEAR MARCH 2019

Figure 1.1: Conventional attention based seq2seq model on Gigaword Many attention-based seq2seq models have been proposed for abstractive summarization which outperformed the conventional statistical methods. Recent studies shows there are many problems in the attention mechanism. Zhou pointed out that there is no obvious alignment relationship between the source text and the target summary, and the encoder outputs contain noise for the attention.For example, in the summary generated by the seq2seq in Figure 1.1, oďŹƒcially is followed by the same word, as the attention mechanism still attends to the word with high attention score. Attention-based Seq2seq model for abstractive summarization can suffer from repetition and semantic irrelevance causing grammatical errors and insufficient reflection of the main idea of the source text.


To tackle this problem, the system proposes a model of global encoding for abstractive summarization. The model set a convolutional gated unit to perform global encoding on the source context. The gate based on convolutional neural network (CNN) filters each encoder output based on the global context due to the parameter sharing, so that the representations at each time step are refined with consideration ofthe global context. The experiments are conducted based on LCSTS and Gigaword, two benchmark datasets for sentence summarization.

Figure 1: Structure of Convolutional Gated Unit

our

proposed

Attention-based seq2seq The model is based on the seq2seq model with attention. For the encoder, aconvolutional gated unit is used for global encoding. Based on the outputs from theRNN encoder, the global encoding renews the representation of the source context witha CNN to improve the connection of the word representation with the global context. The RNN encoder receives the word embedding of each word from the source text sequentially. The final hidden state with the information of the whole source text becomes the initial hidden state of the decoder. Here our encoder is a bidirectional LSTM encoder, where the encoder outputs from both directions at each time step are concatenated,

To implement a unidirectional LSTM decoder to read the input words and generate summary, a fixed target vocabulary embedded in high-dimensional spacepriority to NER-induced type than WordNet-induced type.

Global Encoding The model is based on the seq2seq model with attention. For the encoder, we set a convolutional gated unit for global encoding. Based on the outputs from the RNN encoder, the global encoding refines the representation of the source contextwitha CNN to improve the connection of the word representation with the global context

19 CLEAR MARCH 2019

At each time step, the decoder generates a summary word yt, a distribution of the target vocabulary Pvocab shows that sampling token represents the end of the sentence.The hidden state of the decoder st and Encoders output hi and the Weight matrix Wa , Global attention �t,i Context vector Ct.


for the connection between the annotation at each time step and the global information.

where the representations are computed through the attention mechanism with itself and packed into a matrix. C refers to cell state in the LSTM and g(.) refers to a non-linear function. Convolutional Gated Unit A gated unit is implemented on top of the encoder output. Here the convolutional kernal is used to extract the model. Convolutional units can extract the features in the sentence.To strengthen the global information self-attention is implemented.Gated unit set a gate to filter the source annotation from the RNN encoder.Figure 1.1 shows the structure of the proposed Convolutional Gated Unit.

Finally,the system is able to reduce repetition in the generated summaries, and it is more robust to inputs of different lengths, compared with the conventional seq2seq model. References [1] “Global Encoding for Abstractive Summarization”, https://arxiv.org/abs/1805.03989,

Accessed online on 28 March 2019. [2]

Convolution block is described as,

On top of the new representations generated by the CNN module, we further implement self-attention upon these representations so as to dig out the global correlations. Selfattention encourages the model to learn long-term dependencies and does not create much computational complexity, so we implement its scaled dot-product attention

20 CLEAR MARCH 2019

“DeepRecurrentGenerativeDecoderforA bstractiveTextSummarization”,https://a clweb.org/anthology/D17-1222, Accessed online on 28 March 2019..

[3] “IncorporatingCopyingMechanisminSeq uence-toSequenceLearning”,https://www.aclwe b.org/anthology/P16-1154,Accessed online on 28 March 2019. [4]

“LCSTS:ALargeScaleChineseShortText SummarizationDataset”,https://arxiv.or g/abs/1506.05865,Accessed online on 28 March 2019..


Emotion Based Priority Prediction for Bug Reports Rengitha R M.Tech Computational Linguistics Government Engineering College, Sreekrishnapuram rengi23.r@gmail.com

Open bug repository, which is also known as issue tracking system, has been widely adopted for software projects to support software development. Such bug reporting systems are Bugzilla, Mantis, Google Code Issue Tracker, GitHub Issue Tracker, and JIRA. These bug tracking systems often rely on unstructured natural language bug descriptions. Bug reports contain product name, product component, description, and severity. Based on such information, triagers or developers often manually prioritize the bug reports for investigation. The manual prioritization of bug report is a time consuming task. In order to automate the priority prediction, automated approaches have been proposed to predict the priority of bug reports can be used. To facilitate the automated task, uses emotion analysis of bug reports. To automate priority for bug report the system uses natural language processing technique and machine learning.

21 CLEAR MARCH 2019

This system point out an emotion based approach to predict priority of bug reports. To facilitate the automated task, employ the emotion analysis of bug reports. The emotion words in the summary of a bug report can explain the positive and negative feelings of the reporter. Extract the history data of bug reports and apply natural language processing techniques on each bug report to pre-process it. From the pre-processed bug reports, perform feature modelling to identify the useful features of each bug report for training. As hidden emotion may affect the priority of bug reports, the system performs an emotion analysis to identity emotionswords from each bug report using emotionbased corpus and assigns it an emotion value. Finally, train and test a classifier with the resulting features. In this approach categorizes the bug reports into five classes: P1, P2, P3, P4 or P5, where P1 has top priority and P5 has the least priority.


Lemmatization is performing stemming of all filtered words to convert into their ground word. For example, there is no difference in between write and writes. The preprocessing of a bug report can be formalized as, br = < d „ , p > d „ = < t1, t2, ..., tn>

Figure 1: Overview of the system.

where dâ€&#x; is preprocessed textual description of bug report br, p represents its priority and t1 , t2 , ..., tn represents the n terms or tokens involved in d .

Data Aquisition: Software bugs are generally reported into issue tracking systems which help in continuous monitoring of reported bugs. A bug report br is from a set of bug reports (BR) which can be formalized as, br = < d, p > where d represents the textual information and p represented the associated priority of each bug report. Preprocessing: These preprocessing techniques include tokenization, Parts of Speech (POS) tagging, stop-word removal, and lemmatization. Tokenization is breaking up a sequence of textual document into words is referred as tokenization. In this process, each word is called a token which is tagged with a POS tag after misspelling correction. Stop-Word Removal is the frequently used words like the, in, am, are, is, I, he and that which has no meaning in actual are known as stopwords. Such words do not carry much information in the context of a bug report.

22 CLEAR MARCH 2019

Figure 2: Overview of preprocessing. Emotion Value Calculation: In order to calculate the emotion value of each bug report, the words from each preprocessed bug report are taken into consideration and compared with the list of words from emotion word corpus. The emotion corpus contains positive and negative scores for


each word. Use the sum of the positive score and negative score of the emotion words of each bug report to calculate its emotion score. To assign an emotion-value to each bug report, filter out the words of each bug report as emotion words if they are in the corpus, otherwise ignored. Feature extraction: In order to build a model to predict the priority of a new bug report, select the bug reports from Bugzilla and analyze the bug reports and examine them to identify feature words from the summary of each bug report. Additionally, calculate the emotion value of each bug report using emotion words that are involved in the summary of the bug report. Finally, a high dimension matrix is created where each bug report is a row of the matrix and its emotion value and feature words are the columns of the matrix. A feature vector can be defined as, br =<emo, f 1, f 2, .....f n > wherebr represents the bug report, emo is the calculated emotion value of each bug report and f1, f2 ,.. fn represents the features words. Training and Prediction: The training and prediction are two main steps of the classification module. The feature vectors produced by feature extraction module are taken into consideration for training and testing. The proposed approach utilizes the Support Vector Machine (SVM) to captures the relationship between extracted features and priority levels of bug reports. The support vector machine is used for classification for the following reasons: First, it scales relatively well to high

23 CLEAR MARCH 2019

dimensional data set using kernel trick. Second, the trade-off between classifier complexity and error can be controlled explicitly. Third, the flexible threshold can be applied during the selection of priority levels for the imbalanced dataset. In training,it is going to build a model capturing the relationship between explanatory variable with a dependent value. Set of bug reports are the explanatory variable and priority levels are dependent value. To train the support vector classifier going to use the decision surface for the priority prediction of each bug report,divide our reports into five categories P1, P2, P3, P4, and P5. The decision plane is called hyper-plane which is used for the training of this approach. Once the support vectors are defined with the training set, each bug report from the testing dataset may be prioritized by comparing the results of the following equation for each priority class defined in training. Finally, compare the results of all priority classes to pick the best one. The system helps users and developers by assigning an appropriate priority level to future bug reports in an automated way and saves the valuable time of developers. References [1]

“Emotion Based Automated Priority Prediction for Bug Reports�, https://ieeexplore.ieee.org/stamp/stamp.j sp?arnumber=8401501, Accessed online on 28 March 2019.


[2] “Bug

Reports Prioritization: Which Features and Classifier to Use”, https://ieeexplore.ieee.org/document/678 6091,Accessed online on 28 March 2019. [3] “A Novel Way of Assessing Software Bug Severity Using Dictionary of

CriticalTerms”,https://www.sciencedirec t.com/science/article/pii/S187705091503 2238 Accessed online on 28 March 2019.

. Robotic 'gray goo':Researchers have demonstrated for the first time a way to make a robot composed of many loosely coupled components, or 'particles.' Unlike swarm or modular robots, each component is simple, and has no individual address or identity. In their system, which the researchers call a 'particle robot,' each particle can perform only uniform volumetric oscillations (slightly expanding and contracting), but cannot move independently. The concept of "gray goo," a robot comprised of billions of nanoparticles, has fascinated science fiction fans for decades. But most researchers have dismissed it as just a wild theory. Ultra-low power chips help make small robots more capable:An ultra-low power hybrid chip inspired by the brain could help give palm-sized robots the ability to collaborate and learn from their experiences. Combined with new generations of low-power motors and sensors, the new application-specific integratedcircuit (ASIC) -- which operates on milliwatts of power --could help intelligent swarm robots operate for hours instead of minutes. Mini cheetah is the first four-legged robot to do a backflip:MIT's new mini cheetah robot is springy and light on its feet, with a range of motion that rivals a champion gymnast. The fourlegged powerpack can bend and swing its legs wide, enabling it to walk either right-side up or upside down. The robot can also trot over uneven terrain about twice as fast as an average person's walking speed. The researchers will present the mini cheetah's design at the International Conference on Robotics and Automation, in May. They are currently building more of the four-legged machines, aiming for a set of 10, each of which they hope to loan out to other labs.

24 CLEAR MARCH 2019


An Approach to Feature Extraction for Identification of

Suicidal Ideation in Tweets Vishnu Prasad S6 Computer Science and Engineering Government Engineering College, Sreekrishnapuram vishnu1998prasad@gmail.com

According to World Health Organization, suicide is the second leading cause of death among 15-29 year-olds across the world. In fact, close to 800,000 people die due of suicide each year. The number of people who attempt suicide is much higher. While an individual suicide is often a solitary act, it can often have a devastating impact on families. Many suicide deaths are preventable and it is important to understand the ways in which individuals communicate their depression and thoughts for preventing such deaths. Suicide prevention is mainly hinged on surveillance and monitoring of suicide attempts and self-harm tendencies.The younger generation has started to turn to the Internet (Chan and Fang, 2007) for seeking help, discussing depression and suicide related in formation and offering support. The availability of suicide related material on the Internet plays an important role in the process of suicide ideation. Due to this increasing availability of content on social media websites (such as Twitter, Facebook and Reddit etc.), and blogs (Yates et al., 2017) there is an urgent need to identify affected individuals and offer help. Suicidal ideation refers to thoughts of killing oneself or planning

25 CLEAR MARCH 2019

suicide, while suicidal behavior is often defined to include all possible acts of selfharm with the intention of causing death (Costello et al., 2002). Although Twitter provides a unique opportunity to identify at risk of individuals (Jashinsky et al., 2014) and a possible avenue for intervention at both the individual and social level, there exist no best practices for suicide prevention using social media. While there is a developing body of literature on the topic of identifying patterns in the language used on social media that expresses suicidal ideation (De Choudhury et al., 2016), very few attempts have been made to employ feature extraction methods for binary classifiers that separate text related to suicide from text that clearly indicates the author exhibiting suicidal intent. A number of successful models (Yates et al., 2017) have been used for sentence level classification, however, ones that are successful for being able to learn to separate suicidal ideation from depression as well as less worrying content such as reporting of a suicide, memorial, campaigning, and support etc, require a greater analysis to select more specific features and methods to build an accurate and robust model. The drastic impact that


suicide has on surrounding community coupled with the lack of specific feature extraction and classification models for the identification of suicidal ideation on social media. Suicide prevention by suicide detection (Zung,1979) is one the most effective ways to drastically reduce suicidal rates. The major practical application of this work lies in it‟s easy adaptability to any social media forum (Robinson et al., 2016), wherein it can be used directly for analyzing text based content posted by its users and flag it if the content is concerning. The main contributions of the system are; •

The creation of a labeled dataset for learning the patterns in tweets exhibiting suicidal ideation by manual annotation. Proposed a set of features to be fed into classifiers to improve the performance. Employed four binary classifiers with the proposed set of features and compared them against baselines utilizing varied approaches to validate the proposed methodology.

Suicidal intent present : •

Text conveys a serious display of suicidal ideation; e.g., I want to die or I want to kill myself or I wish my last suicide attempt was successful;

Care was taken to classify only those posts as suicidal where suicide risk is not conditional unless some event is a

26 CLEAR MARCH 2019

clear risk factor eg: depression, bullying, substance use; •

Posts where suicide plan and/or previous attempts are discussed; e.g., ”The fact that I tried to kill myself and it didn‟t work makes me more depressed.”

Tone of text is sombre and not flippant, eg: This makes me want to kill myself, lol, ”This day is horrible, I want to kill myself hahaha” are not included in this category.

Suicidal intent absent : •

The default category for all posts.

Posts emphasizing on suicide related news or information; e.g., Two female suicide bombers hit crowded market in Maiduguri.

Posts such as Suicide squad sounds like a good option; no reasonable evidence

to suggest that the risk of suicide is present; includes posts containing song lyrics, etc, were marked within this category.

Posts pertaining to condolence and suicide awareness; e.g., ”5 suicide prevention helplines in India you need to know about”, Politician accused of driving his wife to suicide.

The overall methodology is divided into three phases. The initial phase consists of preprocessing the text within a tweet, the second phase involves feature extraction from preprocessed tweets for the training and


testing of binary classifiers for the suicidal ideation identification, and the final phase actually classifies and identifies tweets exhibiting suicidal ideation Preprocessing : Preprocessing is achieved by applying a series of filters, in the order given below to process the raw tweets. Which involves 1. Identification and elimination of user

mentions in tweet bodies having the format of @username, URLs as well as retweets in the format of RT. 2. Removal of all hashtags with length

> 10 due to a great volume of hashtags being concatenated words, which tends to amplify the vocabulary size inadvertently.

Classification : Suicidal Ideation detection is formulated as a supervised binary classification problem. For every tweet ti ∈ D, the document set, a binary valued variable y D, the document set, a binary valued variable yi∈ D, the document set, a binary valued variable y ∈ {0, 1} is introduced, where yi = 1 denotes that the tweet ti exhibits Suicidal Ideation. To learn this, the classifiers must determine whether any sentence in ti possesses a certain structure or keywords that mark the existence of any possible Suicidal thoughts. The features presented above are the used to train classification models to identify tweets

exhibiting Suicidal Ideation. Linear classifiers such as Logistic Regression as well as Ensemble Classifiers including Random Forest (Liaw et al., 2002), Gradient Boosting Decision Tree (Friedman, 2002) and XGBoost (Chen and Guestrin, 2016) are employed for classification.

3. Stopword removal.

Feature Extraction : Tweets exhibiting suicidal ideation lack a semirigid pre-defined lexico-syntactic pattern. Hence, they warrant the use of hand engineering and analyzing a set of features (Wang et al., 2016) in contrast to sentence and word embeddings in a supervised setting using Deep Learning Models such as Convolutional Neural Networks(Kim, 2014) (CNN). The proposed methodology utilizes the following set of features for classification. •

Statistical Features.

LIWC Features.

Part of Speech counts.

27 CLEAR MARCH 2019

This model helps to analyze tweets, by developing a set of features to be fed into classifiers for identification of Suicidal Ideation using Machine Learning. When annotated by humans, 15.76% of the total dataset of 5213 tweets was found to be suicidal. Both linear and ensemble classifiers were employed to validate the selection of features proposed for Suicidal Ideation detection. Comparisons with baseline models employing various strategies such as Negation Resolution, LSTMs, Rule-based methods were also performed. The major contribution of this work is the improved performance of the Random forest classifier as compared to other classifiers as well as the baselines. This indicates the promise of the proposed set of features with a bagging


based approach with minimal correlation show as compared to other classifiers. In the future, there is scope for larger amounts of data to be scraped from more social media websites as well as investigate the performance withdeep learning models such as CNNs, LSTM-CNNs, etc.

References [1]

“A Computational Approach to Feature Extraction for Identification of Suicidal IdeationinTweets”,https://aclweb.org/ant hology/P18-3013, Accessed online on 29 March 2019 .

[2]

“Use of the internetand traditional media among young people”, https://repository.hkbu.edu.hk/cgi/view ontent.cgi article=1099&context=coms_ja, Accessed online on 29 March 2019.

Cognitive technology is in the same vein as machine learning and virtual reality except that it‟s a broader concept. For example, the cognitive technology umbrella includes things like natural language processing and speech recognition. Combined, these different technologies are able to automate and optimize a lot of tasks that were previously done by people, including certain aspects of accounting and analytics.

[3] “Convolutional neural net-works for sentenceclassification”,https://arxiv.org/p df/1408.5882.pdf , Accessed online on 29 March 2019.

Faster robots demoralize co-workers A Cornell University-led team has found that when robots are beating humans in contests for cash prizes, people consider themselves less competent and expend slightly less effort -- and they tend to dislike the robots. The study, "Monetary-Incentive Competition Between Humans and Robots: Experimental Results," brought together behavioral economists and roboticists to explore, for the first time, how a robot's performance affects humans' behavior and reactions when they're competing against each other simultaneously. 28 CLEAR MARCH 2019


M.Tech Computational Linguistics Dept. of Computer Science and Engg, Govt. Engg. College, Sreekrishnapuram Palakkad www.simplegroups.in simplequest.in@gmail.com geccl1820@googlegroups.com

SIMPLE Groups Students Innovations in Morphology Phonology and Language Engineering

Article Invitation for CLEAR- JUNE 2019

We are inviting thought-provoking articles, interesting dialogues and healthy debates on multifaceted aspects of Computational Linguistics, for the forthcoming issue of CLEAR (Computational Linguistics in Engineering and Research) Journal, publishing on JUNE 2019. The suggested areas of discussion are:

The articles may be sent to the Editor on or before 10th June, 2019 to the email as follows : geccl1820@googlegroups.com. For more details visit: www.simplegroups.in

CLEAR Journal

29 CLEAR MARCH 2019

SIMPLE Groups


Hello Everyone, This latest edition of CLEAR journal comes with some trending topics based on Personalized Image Captioning, MetaSoundex Algorithm, Acronym Disambiguation, Abstractive Summarization, Emotion based Priority Prediction, Feature Extraction for identification of Suicidal Ideation in Tweets etc. The first article deals with image captioning with context Sequence Memory Networks. Image captioning is a task of automatically generating a descriptive sentence of an image. Second article deals with Privacy preserving Record linkage. It aims at linking records that refer to the same real world entity such as persons. Third article is all about Acronym Disambiguation .It deals with a framework which automatically reduce the ambiguities. A global encoding technique is introduced for abstractive summarization to solve the semantic irrelevance and repeatation in the next article. The concept consists of a convolutional gated unit toperform global encoding to improve representations. Emotional base Priority Prediction point out an emotion based approach to predict priority of the bug reports. The Final article is all about Identification of Suicidal Ideation In Tweets. There is a developing body of literature on the topic of identifying patterns in the language used on social media that expresses suicidal ideation. Article deals with that concept itself. These articles are made based on the recent works or researches done in the field of computational linguistics. CLEAR is thankful to all who have given their valuable time and effort for contributing their thoughts and ideas . Simple group invites more aspirers in this field. Wish you all the success in your future endeavours‌!!!

-AISWARYA K SURENDRAN



Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.