CLEAR JUNE 2019
Page 1
CLEAR JUNE 2019
Page 2
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING VISION To become a Centre of Excellence in Computing and allied disciplines.
MISSION To impart high quality education in Computer science and Engineering that prepares the students for rewarding, enthusiastic and enjoyable careers in the industry, academia and other organizations.
COURSES:
ACTIVITIES:
M.Tech in Computational Linguistics
Machine Learning club
B.Tech in Computer Science & Engineering
FOSS Clubs, IEEE SB
CLEAR JUNE 2019
Page 3
CLEAR Journal (Computational Linguistics in Engineering And Research) M. Tech Computational Linguistics, Dept. of Computer Science and Engineering, Govt. Engineering College, Sreekrishnapuram, Palakkad-678633 www.simplegroups.in simplequest.in@gmail.com
Editorial……………………… 3 CLEAR JUNE 2019 Invitation………………………29 Last word……………………....30
geccl1820@googlegroups.com
Chief Editor Shibily Joseph Assistant Professor Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad-678633 Editors Aiswarya K Surendran Divya Visakh Sreeja V Cover page Divya Visakh Layout Divya Visakh
➢ A User-Centric Machine Learning Framework for Cyber Security Operations Center .....................................7 Anu P. C ➢ Commonsense Located Near Knowledge : Automatic Extraction ..................................10 Sari H ➢ CNN for Text-Based Multiple Choice Question Answering ....................................13 Shameeha N ➢ Detection of Ironic Tweets for Sentiment Analysis .....................................16 Thasneema Mullappally
CLEAR JUNE 2019
Page 4
Dear Readers, Here is the latest edition of CLEAR Journal, which comes with some new articles based on the trending topics : A User-Centric Machine Learning Framework for Cyber Security Operations Center, Commonsense LocatedNear Knowledge: Automatic Extraction, CNN for Text-Based Multiple Choice Question Answering, Detection of Ironic Tweets for Sentiment Analysis. The previous edition covered different articles related like Word Sense Disambiguation, Sense Aware Neural Model, Text Categorization using Disconnected Recurrent Neural Networks, Intelligent Question Answering System etc. We are very happy that we could make new readers which give us very much motive to make improvements and keep going well. As always, we are working on it based on your valuable feedbacks, and expect more. On this hopeful prospect, I proudly present this edition of CLEAR Journal to our faithful readers and look forward to your opinions and criticisms.
BEST REGARDS, Shibily Joseph (Chief Editor)
CLEAR JUNE 2019
Page 5
Workshop On Deep Learning A one day workshop on "Deep Learning and Matlab" was held at GEC Sreekrishnapuram organized on 27th March 2019, handled by Mr. Mani Sankar. The workshop was conducted at Edusat room for M.Tech students. Basics of Deep learning and Matlab were introduced to the students using the Matlab toolkit and sample programs
Results 100% result for Third semester M.Tech Computational Linguistics 2017-2019 batch. Remarkable result for first semester M.Tech Computational Linguistics 2018-2020 batch.
Simple Group Congratulates All for their Achievements !!! CLEAR JUNE 2019
Page 6
A User-Centric Machine Learning Framework for Cyber Security Operations Center Anu P C M.Tech Computational Linguistics Government Engineering College, Sreekrishnapuram anuanupc9@gmail.com
Cyber security incidents will cause significant financial and reputation impacts on enterprises. In order to detect malicious activities, the SIEM (Security Information and Event Management) system is built in companies or government. SIEM system is used to normalize security events from different preventive technologies and alerts. Analysts in the security operation center (SOC) investigate the alerts to decide if it is truly malicious or not. Generally the number of alerts is overloaded with majority of false positive and exceed SOC's capacity to handle all alerts. Because of this, potential malicious attacks and compromised hosts may be missed. Machine learning is a powerful approach to reduce the false positive rate and improve the productivity of SOC analysts. The whole machine learning system is implemented in production environment and fully automated from data acquisition, daily model refreshing, to real time scoring, which greatly improve SOC analyst's efficiency and enhance enterprise risk detection and management. This technique has four broad phases: •
Raw Data and Data Preprocessing
CLEAR JUNE 2019
• • • •
User Feature Engineering and Label Generation Machine Learning Algorithms and Implementations Model Validation Results Model Implementation and Active Learning
Phase 1: The raw data is collected from Symantec internal security logs. It consists of alerts from SIEM system, notes from analyst's investigation, and logs from different sources, including firewall, IDS/IPS, HTTP/FTP/DNS trafic, DHCP, vulnerability scanning, Windows security event, VPN and so on. For network trafic data or other data sources with dynamic IP addresses, entity resolution is required as in large enterprises, many internal IP addresses are dynamically assigned to users, causing the IP addresses of users to change over time. Without accurate dynamic IP to user mapping, the correlation of activities over different network logs will be challenging and inaccurate. IP to user mapping is applied to network traffic data to make sure that user ID is appended as primary key for data summarization, data correlation and feature engineering. Finally, analyst's investigation notes are usually stored in a ticketing system as free-form text. The notes typically include the following information: the reason why Page 7
an alert was triggered, supporting information from internal system logs and external resources (such as VirusTotal and IPVoid), and investigation conclusion on whether an alert is true positive or not. Phase 2: The features are created at individual user level as our main goal is to predict the user's risk. We have created over 100 features to describe a user's behavior. The features include: summary features created from statistical summaries (number of alerts per day), temporal features generated from time series analysis (event arrival rate), relational features derived from social graph analysis (user centrality from user- event graph) , etc. After all features are generated, we need to attach target or "label" for our machine learning models. The initial labels are created by mining analyst's investigation notes. Text mining techniques, such as key word/topic extraction and sentiment analysis, are used to extract the user from the notes. From the users with annotations, generally very few of them (<2%) marked as "risky"after text mining. There are two concerns if we only use these users for machine learning: â&#x20AC;˘ Majority of the users without annotations are left out of model, but they may have valuable information. â&#x20AC;˘ Many machine learning models do not work well for highly unbalanced classi_cation problem. In order to alleviate these two issues, label propagation techniques are needed to derive more labels. The main idea here is, if we have knowledge about certain risky users, we can label other user with "similar" behaviour as risky. Phase 3: Multi-layer Neural Network (MNN) with two hidden layers, Random CLEAR JUNE 2019
Forest (RF) with 100 Ginisplit trees, Support Vector Machine (SVM) with radial basis function kernel and Logistic Regression (LR) are the machine learning methods are used here. In model performance measures, as a common practice, modeling data should be randomly split into training and testing sets and different models should be evaluated on test holdout data. In contrast to the AUC that evaluates model on the whole test data, detection rate and lift reflect how good the model is in discovering risky users among different portions of predictions. To calculate these two metrics, the results are first sorted by the model scores (in our case, the probability of a user being risky) in descending order. Detection rate measures the effectiveness of a classification model as the ratio between the results obtained with and without the model. For example, suppose there are 60 risky users in test data, from top 10% of the predictions, the model captures 30 risky users, the detection rate is equal to 30/60=50%. List measures how many times it is better to use a model in contrast to not using a model. Using the same example above, if the test data has 5,000 users, the lift is equal to (30/500)/(60/5000)=5. Higher lift implies better performance from a model on certain predictions. Phase 4: To validate the effectiveness of our machine learning system, we take one month of model running results and calculate the performance measures. We split the data randomly into training (75% of the samples) and testing (remaining 25%) sets.. It is promising that Random Forest is able to detect 80% of the true risky cases with 20% highest predictions. Finally, we evaluate the model lift also on top 5% to 20% predictions. For top 5% predictions, Multi-layer Neural Network (MNN) achieves lift value of 6.82, meaning that it is almost 7 times better than current Page 8
rule-based system. If we look at the average lifts on top 5% to 20% predictions, Multilayer Neural Network is the highest. This is very encouraging. Currently the machine learning system has been implemented in a real enterprise production. The features and labels are being updated daily from historical data. Then the machine learning model is refreshed and deployed to the scoring engine daily to make sure it captures the latest patterns from the data. After that, the risk scores are generated in real time when new alerts are triggered, so SOC analysts can take action right away for high risk users. Finally, SOC analyst's notes will be collected and fed back into historical data for future model refinement. The whole process has been streamlined automatically from data integration to score generation. The system also actively learns new insights generated from analyst's investigations. References . [1] “A novel intrusion detection method using deep neural network for in-vehicle network security”, Vehicular Technology Conference, 2016. [2] “Reducing false positives in intrusion detection systems using data-mining techniques utilizing support vector machines, decision trees, and naive Bayes for off-line analysis”, SoutheastCon, 2016. [3] “A comparative analysis of SVM and its stacking with other classification algorithm for intrusion detection”, Advances in Computing, Communication, & Automation (ICACCA), 2016.
CLEAR JUNE 2019
[4] “Comparative analysis of machine learning algorithms along with classifiers for network intrusion detection”, Smart Technologies and Management for Computing, Communication, Controls, Energy and Materials (ICSTM), 2015.
Killer robots are not science fiction – they have been part of military defence for a while Humans will always make the final decision on whether armed robots can shoot, according to a statement by the US Department of Defence. Their clarification comes amid fears about a new advanced targeting system, known as Atlas, that will use artificial intelligence in combat vehicles to target and execute threats. While the public may feel uneasy about so-called “killer robots”, the concept is nothing new – machine-gun wielding “Swords” robots. Robots can mean any technology with some form of autonomous element that allows it to perform a task without the need for direct human intervention. Autonomous systems have long been embedded in the military and we should prepare ourselves for the consequences.
Page 9
Commonsense Located Near Knowledge: Automatic Extraction Sari H M.Tech Computational Linguistics Government Engineering College, Sreekrishnapuram sariskp@gmail.com
LocatedNear relation is a kind of commonsense knowledge describing two physical objects that are typically found near each other in real life. By incorporating commonsense knowledge as the background, like ice is cold (HasProperty), chair and table are typically found near each other (LocatedNear), etc... is how the system work. These kinds of commonsense facts are used in many tasks like textual entailment and visual recognition tasks. The main aim of system is to automatically extract the commonsense LocatedNear relation between physical objects from textual corpora. LocatedNear is defined as the relationship between two objects typically found near each other in real life. The LocatedNear relation is emphasized for these reasons: • LocatedNear facts provide helpful prior knowledge to object detection tasks in complex image scenes. • This commonsense knowledge can benefit reasoning related to spatial facts and physical scenes in reading comprehension, question answering, etc. • Existing knowledge bases have very few facts for this relation
CLEAR JUNE 2019
Two novel tasks for extracting LocatedNear relation from textual corpora are carried out. One is sentence-level relation classification problem which judges whether or not a sentence describes two objects (mentioned in the sentence) being physically close by. The other task is to produce a ranked list of LocatedNear facts with the given classified results of large number of sentences. Both the two tasks are used to automatically populate the existing commonsense knowledge bases. Given a sentence s mentioning a pair of physical objects <ei; ej>, the pair <s; ei; ej> is called an instance. For each instance, the problem is to determine whether ei and ej are located near each other in the physical scene described in the sentence s. For example, suppose ei is ‘dog’, ej is ‘cat’ and s = “The King puts his dog and cat on the table”. As it is true that the two objects are located near in this sentence, a successful classification model is expected to label this instance as True. However, if s2 = “My dog is older than her cat”, then label of the instance <s2; ei; ej> is False, because s2 just talks about a comparison in age. The baseline methods for the binary classification task used are feature-based methods and LSTM-based neural architectures. Page 10
In the feature-based method, SVM classifier is used to extract features. The relation extraction models used are: • Bag of Words (BW): the set of words that appear in the sentence. • Bag of Path Words (BPW): the set of words that appears on the shortest dependency path between objects in the dependency tree of the sentence. • Bag of Adverbs and Prepositions (BAP): the existence of adverbs and prepositions in the sentence. • Global Features (GF): the length of the sentence, including the number of nouns, verbs, etc. in the sentence. • Shortest Dependency Path features (SDP): the same features as with GF but in dependency parse trees of the sentence and the shortest path between them. • Semantic Similarity features (SS): the cosine similarities between the pre-trained GloVe word embeddings of the two object words. The working of LSTM-based neural architecture is described below. The existence of LocatedNear relation in an instance <s; ei; ej> depends on two major information sources: one is from the semantic and syntactical features of sentence s and the other is from the object pair <e1, e2>. The design of LSTM-based model with two parts is shown in the lower part of Figure 1. The left part is for encoding the syntactical and semantic information of the sentence s, while the right part is encoding the semantic similarity between the pretrained word embeddings of e1 and e2. CLEAR JUNE 2019
The two problems of LSTM-based neural architecture is the irrelevant words in the sentence can introduce noise into the model; and the large vocabulary of original sentences induce too many parameters, which may cause over-fitting.
Figure 1: LSTM-based neural architecture. To address the issues, a normalized sentence representation method merging the three most important and relevant kinds of information about each instance: lemmatized forms, POS (Part-of-Speech) tags and dependency roles are proposed.
For the sentence normalization, first replace the two nouns in the object pair as “E1” and
Page 11
“E2”, and keep the lemmatized form of the original words for all the verbs, adverbs and prepositions, which are highly relevant to describing physical scenes. Then, replace the subjects and direct objects of the verbs and prepositions with special tokens indicating their dependency roles. For the remaining words, simply use its POS tags to replace the originals. Figure 1 represents the real example of normalized sentence representation, where the object pair is <dog, garden>. To capture more structural information of the sentence, the distances from each token to E1 and E2 are encoded. Then, leverage the LSTM to encode the whole sequence of the tokens of normalized representation plus position embedding. In the meantime, two pretrained GloVe word embeddings of the original two physical object words are fed into a hidden dense layer. Finally, concatenate both outputs and then use sigmoid activation function to obtain the final prediction. The binary cross-entropy is used as the loss function and RMSProp as the optimizer. A dropout rate of 0.5 is used in the LSTM and embedding layer to prevent overfitting. The upper part of Figure 1 shows the overall workflow of automatic framework to mine LocatedNear relations from raw text. For this purpose a vocabulary of physical objects is constructed and all candidate instances are generated. For each sentence in the corpus, if a pair of physical objects ei and ej appears as nouns in a sentence s, then apply the sentence-level relation classifier on this instance. The relation classifier yields a CLEAR JUNE 2019
probabilistic score indicating the confidence of the instance in the existence of LocatedNear relation. Finally, all scores of the instances from the corpus are grouped by the object pairs and aggregated, where each object pair is associated with a final score. These mined physical pairs with scores can easily be integrated into the existing commonsense knowledge base.
The multi-level sentence normalization turns out to be more useful than the feature-based method and the LSTM-based neural architecture. In the future, the system can be used in better leveraging distant supervision to reduce human efforts, for incorporating knowledge graph embedding techniques and also applying the LocatedNear knowledge into downstream applications in computer vision and natural language processing.
References [1] “Automatic Extraction of Commonsense LocatedNear Knowledge”,https://aclweb.org/antholog y/P18-2016, Accessed online on 31 May 2019. [2] “Commonsense Knowledge Base Completion”, https://aclweb.org/anthology/P16-1137, Accessed online on 31 May 2019.
[3] “GloVe: Global Vectors for Word Representation”, https://nlp.stanford.edu/pubs/glove.pdf , Accessed online on 31 May 2019. Page 12
CNN for Text-Based Multiple Choice Question Answering Shameeha N M.Tech Computational Linguistics Government Engineering College, Sreekrishnapuram shameeha99@gmail.com
The task of Question Answering is at the very core of machine comprehension. Answering questions based on a particular text requires a diverse skill set. It requires look-up ability, ability to deduce, ability to perform simple mathematical operations (e.g. to answer questions like how many times did the following word occur?), ability to merge information contained in multiple sentences. This diverse skill set makes question answering a challenging task. A new approach for text-based question answering is the use of CNN model. Convolutional Neural Network (CNN) model for text-based multiple choice question answering where questions are based on a particular article. Given an article and a multiple choice question, this model assigns a score to each question-option tuple and chooses the final option accordingly. This model is tested on Textbook Question Answering (TQA) and SciQ dataset. This model outperforms several LSTM-based baseline models on the two datasets. Working of this model is described below. Given a question based on an article, usually a small portion of article is needed to answer the concerned question.
CLEAR JUNE 2019
Hence it is not fruitful to give the entire article as input to the neural network. To select the most relevant paragraph in the article, this model take both the question and the options into consideration instead of taking just the question into account for the same. The rationale behind this approach is to get the most relevant paragraphs in cases where the question is very general in nature. For example, consider that the article is about the topic carbon and the question is “Which of the following statements is true about carbon?”. In such a scenario, it is not possible to choose the most relevant paragraph by just looking at the question. The model select the most relevant paragraph by word2vec based query expansion followed by tf-idf score.Neural Network Architecture: Here word embeddings are used to encode the words present in question, option and the most relevant paragraph. As a result, each word is assigned a fixed d-dimensional representation. The proposed model architecture is shown in Figure 1. Let q, oi denote the word embeddings of words present in the question and the ith option respectively. Thus, q ∈ Rd x lqand oi ∈ Rd×l o where lq and lo represent the number of words in the question and option respectively. The question-option tuple (q, oi) is embedded using Convolutional Neural Page 13
‘
+l0 )
. The sentences in the most relevant paragraph are embedded using the same CNN. Let sj denote the word embeddings of words present in the jth sentence i.e. sj ∈ R d×ls where ls is the number of words in the sentence. Then,dj = CNN (sj) ∀ j = 1, 2, .., nsent where nsents is the number of sentences in the most relevant paragraph and dj is the output of CNN. The rationale behind using the same CNN for embedding question-option tuple and sentences in the most relevant paragraph is to ensure similar embeddings for similar question-option tuple and sentences. Next, the model use hi to attend on the sentence embeddings. Formally, aij
=
hi . dj
|| hi || . || dj || rij Network (CNN) with a convolution layer followed averagepooling. The convolution layer has three types of filters of sizes fj ×d ∀ j = 1, 2, 3 with size of output channel of k. Each filter type j produces a feature map of shape (lq + lo − fj + 1) × k which is average pooled to generate a k-dimensional vector. The three k-dimensional vectors are concatenated to form 3k-dimensional vector. This method uses average pooling to ensure different embedding for different questionoption tuples. Hence, hi = CNN ([q; oi ]) ∀ i = 1, 2, .., nq where nq is the number of options, hi is the output of CNN and [q; oi] denotes the concatenation of q and oi i.e. [q; oi] ∈ R d×(lq
CLEAR JUNE 2019
=
exp(aij)
Σj=iexp(aij) mi = Σj=1 rij dj where ||.|| signifies the l 2 norm, exp(x) = exand hi · dj is the dot product between the two vectors. Since aij is the cosine similarity between hi and dj , the attention weights rij give more weighting to those sentences which are more relevant to the question. The attended vector mi can be thought of as the evidence in favor of the ith option. Hence, to give a score to the ith option, we take the cosine similarity between hi and mi i.e. scorei =
hi · mi
||hi ||.||mi ||
Page 14
Finally, the scores are normalized using softmax to get the final probability distribution. pi =
exp(scorei)
Σi=1 exp(scorei) where pi denotes the probability for the ith option. References [1] “CNN for Text-Based Multiple Choice Question Answering”, https://aclanthology.info/papers/P182044/p18-2044, Accessed online on 11 May 2019.
[2] “Are You Smarter Than A Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension”, https://aclanthology.info/papers/CVPR1 7-TQA/cvpr17-tqa, Accessed online on 11 May 2019. [3] “Crowdsourcing Multiple Choice Science Questions”, https://www.arxiv.arg/abs/papers/1707.0 6209, Accessed online on 11 May 2019. [4] “Dynamic Coattention Networks For Question Answering”, http://www.arxiv.arg/abs/1611.01604, Accessed online on 11 May 2019.
Teaching language models grammar really does make them smarter A deep learning models to a set of psychology tests to see which ones grasp key linguistic rules. To give computers some of innate feel for language, researchers have started training deep learning models on the grammatical rules that most of us grasp intuitively, even if we never learned how to diagram a sentence in school. Grammatical constraints seem to help the models learn faster and perform better, but because neural networks reveal very little about their decision-making process, researchers have struggled to confirm that the gains are due to the grammar, and not the models’ expert ability at finding patterns in sequences of words. To peer inside the models, researchers have taken psycholinguistic tests originally developed to study human language understanding and adapted them to probe what neural networks know about language.
CLEAR JUNE 2019
Page 15
Detection of Ironic Tweets for Sentiment Analysis Thasneema Mullappally M.Tech Computational Linguistics Government Engineering College, Sreekrishnapuram thasneema93@gmail.com
Irony is a common phenomenon in human communication and widely used in social networks and micro blogging websites. It is identified generally when the actual meaning differs from the intended meaning. That is speaker says a negative sentiment in a positive way. Regular Sentiment analysis which categorizes any opinions expressed in a statement into positive, negative and neutral will fail when there is an Irony present. Therefore it is important to detect irony in the statement in order to improve such sentiment analysis systems. Below are the some examples for the ironic tweet. 1. I feel so blessed to get ocular migraines. 2. Oh how I love being ignored. 3. Thoroughly enjoyed shoveling the Drive away today! 4. Absolutely adore it when my bus is late 5. Oh Battery low..... What a wonderful time
roughly be classified as either rule-based or (supervised and unsupervised) machine Learning-based. While rule-based approaches mostly rely upon lexical information and require no training, machine learning invariably makes use of training data and exploits different types of information sources (or features), such as bags of words, syntactic patterns, sentiment information or semantic relatedness. Below figure shows an overall work flow or architecture of the proposed Irony detection system for English tweets. The main processes in the proposed system are Dataset Collection, Data Pre-processing, Feature extraction and Model Training.
For human readers, it is clear that the author of example 1 does not feel blessed at all, which can be inferred from the contrast between the positive sentiment expression I feel so blessed, and the negative connotation associated with getting ocular migraines. Although such connotative information is easily understood by most people, it is difficult to access by machines. There are various techniques used for Irony detection. Recent approaches to irony can CLEAR JUNE 2019
Page 16
Figure 1: Architecture of Irony Detection System The dataset is a collection of English tweet. The first task is to pre-process the selected dataset. The pre-processed dataset is then used in the feature extraction phase, where various features are extracted from it. The extracted features are used to train the system. The trained system will be able to predict whether a particular tweet is Irony or not. In prediction phase, features are also extracted from the input tweet. These extracted features are given to trained classification model in order to predict the correct label. That is Irony or not. There are four types of features extracted from tweets for Classification. That are Sentiment related features, Punctuation related features, Syntactic features, and Frequency related features. A very popular type of Irony that is widely used in both regular conversations as well as short messages such as tweets, is when an emotionally positive expression is used in a negative context. A similar way to express Irony is to use expressions having contradictory sentiments. That is it has positive and negative sentiment in a single tweet as in "I just love when you test my patience!!". In order to extract the sentiment related feature, need to split the tweet into two and use polarity difference in sentiment as one of the feature for Ironic Tweet. Sentiment-related features are not enough to detect all kinds of Irony that might be present. In addition, they do not make use of all the components of the tweet. It also exploits behavioral aspects such as low tones, facial gestures-or exaggeration. These aspects are translated into a certain use of punctuation or repetition of vowels when the message is written. To detect such aspects, extract a set of features called punctuation17
CLEAR JUNE 2019
related features. Some Irony tweets contain irregular use of punctuation marks, Such as in example "OHH.... Wonderful DAY!!!!!!!!". Along with the punctuation-related features, some common expressions are used usually in an ironic context. It is possible to correlate these expressions with the punctuation to decide whether what is said is Ironic or not. Sometimes people use unusual syntactic pattern of words to express their feelings in Ironic context. People tend to make complicated sentences or use uncommon words to make it ambiguous to the listener/reader to get a clear answer. This is common when the person's purpose is to hide his real feeling or opinion by using irony. So the frequency of wordâ&#x20AC;&#x2122;s usage is also an important frequency related features. The features which are extracted given to a Supervised Machine Learning Algorithm called Random Forest Classifier. Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction. Random Forest depends on random selection of variables and data to develop large number of decision trees. Random Forest uses decision trees to create bootstrap by selecting random features, it is considered as special case from bagging. However, the main idea of RF is that Shallow trees which are called stumps will be pruned and tuned and the output after tuning and pruning will be aggregated, then RF will rely on such aggregation. The aggregation will lead to accurate prediction by eliminating the error from trees. Random forest classifier predicts whether a Tweet is Irony or not, based on the extracted features. In future, In order to improve Accuracy of the system, need to find out more sophisticated features, with high feature importance. The proposed Irony detection system is a binary classifier. It predicts
whether it is Irony or not. In future need to build a multiclass irony classification task to define whether it contains a specific type of irony (verbal irony, situational irony, or another type of verbal irony) or is not ironic. The proposed system work with data-set which doesn't contains emoticons. Usual tweets may contain emoticons. So need to upgrade system to work with corpus which contains emoticons. References [1] “Dissecting tweets in search of irony”, https://www.aclweb.org/anthology/S 18-1090 , Accessed online on 23 May 2019. [2] “SemEval-2018 Task 3: Irony Detection in English Tweets”, https://www.aclweb.org/anthology/S 18-1005, Accessed online on 23 May 2019.
[3] “A Pattern Based Approach for Sarcasm Detection”, https://ieeexplore.ieee.org/document/ 7549041/, Accessed online on 23 May 2019.
Bringing human-like reasoning to driverless car navigation Autonomous control system 'learns' to use simple maps and image data to navigate new, complex routes. With aims of bringing more human-like reasoning to autonomous vehicles, researchers have created a system that uses only simple maps and visual data to enable driverless cars to navigate routes in new, complex environments.
18
CLEAR JUNE 2019
Gauging language proficiency through eye movement This is a study that tracks eye movement to determine how well people understand English as a foreign language. A study by MIT researchers has uncovered a new way of telling how well people are learning English: tracking their eyes. Using data generated by cameras trained on readers’ eyes, the research team has found that patterns of eye movement particularly how long people’s eyes rest on certain words, correlate strongly with performance on standardized tests of English as a second language. Our eyes do not move continuously along a string of text, but instead fix on particular words for up to 200 to 250 milliseconds. And also take leaps from one word to another that may last about 1/20 of a second. But if you are learning a new language, your eyes may dwell on particular words for longer periods of time, as you try to comprehend the text. The particular pattern of eye movement, for this reason, can reveal a lot about comprehension, at least when analyzed in a clearly defined context.
M.Tech Computational Linguistics Dept. of Computer Science and Engg, Govt. Engg. College, Sreekrishnapuram Palakkad www.simplegroups.in simplequest.in@gmail.com geccl1820@googlegroups.com
SIMPLE Groups Students Innovations in Morphology Phonology and Language Engineering
Article Invitation for CLEAR- SEPTEMBER 2019
We are inviting thought-provoking articles, interesting dialogues and healthy debates on multifaceted aspects of Computational Linguistics, for the forthcoming issue of CLEAR (Computational Linguistics in Engineering and Research) Journal, publishing on SEPTEMBER 2019. The suggested areas of discussion are:
The articles may be sent to the Editor on or before 10th September , 2019 to the email as follows : geccl1820@googlegroups.com. For more details visit: www.simplegroups.in
19
CLEAR JUNE 2019
CLEAR Journal
SIMPLE Groups
Hello Everyone, This latest edition of CLEAR journal comes with some trending topics based on various topics including :
A User-Centric Machine Learning Framework for Cyber Security Operations Center, Commonsense LocatedNear Knowledge: Automatic Extraction, CNN for Text-Based Multiple Choice Question Answering, Detection of Ironic Tweets for Sentiment Analysis. In SIEM (Security Information and Event Management) system is used to normalize security events from different preventive technologies and alerts. The security operation center (SOC) investigate the alerts to decide if it is truly malicious or not. Machine learning is a powerful approach to reduce the false positive rate and improve the productivity of SOC and is implemented in producing environment and fully automated from data acquisition, daily model refreshing, to real time scoring. LocatedNear relation is a kind of commonsense knowledge describing two physical objects that are typically found near each other in real life. By incorporating commonsense knowledge as the background, ‘HasProperty’ and ‘LocatedNear’ is how the system work. Textbased question answering is the use of Convolutional Neural Network (CNN) model for text-based multiple choice question answering where questions are based on a particular article. Irony is a common phenomenon in human communication and is identified generally when the actual meaning differs from the intended meaning. Regular Sentiment analysis which categorizes any opinions expressed in a statement into positive, negative and neutral will fail when there is an Irony present. Therefore it is important to detect irony in the statement in order to improve such sentiment analysis systems.
These articles are made based on the recent works or researches done in the field of computational linguistics. CLEAR is thankful to all who have given their valuable time and effort for contributing their thoughts and ideas . Simple group invites more aspirers in this field.
Wish you all the success in your future endeavours…!!! 20
CLEAR JUNE 2019
DIVYA VISAKH
21
CLEAR JUNE 2019