MIM (Mobile Instant Messaging) Classification using Term Frequency-Inverse Document Frequency (TF-ID

International Journal of Modern Research in Engineering & Management (IJMREM) ||Volume|| 2||Issue|| 2 ||Pages|| 01-05 || February 2019|| ISSN: 2581-4540

MIM (Mobile Instant Messaging) Classification using Term Frequency-Inverse Document Frequency (TF-IDF) and Bayesian Algorithm 1,

Kashaf-u-Duja, 2, Muhammad Bux Alvi, 3, Tariq Jameel Saifullah Khanzada, 4, Nisha Kumari

1,4,

Institute of Information and Communication Technology, Mehran University of Engineering and Technology Jamshoro 2, Department of Computer Systems Engineering, The Islamia University of Bahawalpur 3, Department of Computer Systems Engineering, Mehran University of Engineering and Technology Jamshoro

---------------------------------------------------ABSTRACT-----------------------------------------------------The focus of the study is based on binary sentiment classification on aspect level to develop a hybrid sentiment classification framework of WhatsApp MIMs (Mobile Instant Messages). It has been carried out into two phases i.e. training phase and testing phase. The training phase, 75% data is used for training dataset. Pre-processing techniques like tokenization, removing stop words, case normalization, removing punctuation and stemming are applied to acquire cleaner dataset to be used as input. The output is sent to the classifier after applying TF-IDF for feature weighting. In the second phase, the classifier is trial with 25% testing dataset. Bernoulli’s Naïve Bayesian classifier which is an improved form of traditional Naïve Bayesian classifier is used to classify sentiments. There are 417 messages in total where 244 and 173 are classified as positive and negative respectively. The proposed model has achieved satisfactory results up to 81.73% in comparison to base-line classification model by getting 12 points higher accuracy i.e. 69.23%.

KEYWORDS: Mobile Instant Messages (MIMs), Naïve Bayesian, Sentiment classification, TF-IDF, WhatsApp ------------------------------------------------------------------------------------------------------------------------------------------Date of Submission: 30 January 2019 Date of Accepted: 03 February 2019 -------------------------------------------------------------------------------------------------------------------------------------------

I. INTRODUCTION Web development has changed human interaction and communication substantially and has prompted huge and quick development in user generated data [4]. It is estimated that 95% of available data is unstructured. To extract information and create knowledge from raw resources it needed to be processed properly and analyzed correctly because knowledge present in text data is not directly accessible through computers [1]. With the striking development of social media platforms like Facebook, Twitter, WhatsApp, WeChat etc, more and more people post online texts on different platforms to express their opinions on social issues and share their reviews [5]. Significant consideration has focused on examining this data in terms of the sentiment it conveys, which has resulted in the emergence of the sentiment analysis research field. It involves the computational analysis of usergenerated data, such as reviews, to determine its orientation (positive, negative or neutral). There are two main reasons to automate sentiment analysis: first, the abundance of online data is beyond human analysis; and second, public opinion is a significant consideration when governments, institutions, and individuals are making decisions [4]. Utilization of WhatsApp text data has increased more problems such as word-shortening, neologism, and spelling variations. Traditional machine learning methods have proved inadequate to accomplish the task. To address this problem, we proposed a methodology based on binary sentiment classification on aspect level. This work is focused on developing a hybrid sentiment classification framework for WhatsApp MIMs using recursive preprocessing and machine learning combined approach to achieve higher accuracy for closed domain dataset obtained from the WhatsApp group containing 417 messages. This dataset is labeled manually consisting of 244 positive and 173 negative opinions. The dataset uses a cleaner data through preprocessing for better accuracy and naïve Bayesian machine learning algorithm is used to develop the model to test its suitability.

www.ijmrem.com

IJMREM

Page 1

MIM (Mobile Instant Messaging) Classification using Term… II. LITERATURE REVIEW [1] Proposed a novel hybrid method with a recursive preprocessing approach for sentiment analysis on online twitter data consists of 6090 tweets. The dataset is labeled manually with 3111 positive, 1114 negative and 1865 neutral tweets. Multinomial Naïve Bayesian, Linear SVM and Neural Network algorithms are used to develop different hybrid models to test their suitability. Bag-of-words, TF-IDF and N-Gram are used as feature engineering models. Hold out splitting method is used to evaluate the accuracy where 80% and 20% data is used for training testing respectively. The model acquires 86.18% overall accuracy with 82% baseline accuracy. Reference [2] compares six commonly used preprocessing techniques on two Twitter datasets for sentiment analysis. The recommended preprocessing techniques are lemmatization, replacing repetitions of punctuation, replacing contractions, and removing numbers. While five preprocessing techniques: replace URLs and user mentions, replace contractions, remove numbers, replace repetition of punctuation and lemmatization for a classic machine learning sentiment analysis is a winning combination. [3] Uses preprocessing techniques and merged 10 existing sentiment lexicons to make a high-coverage lexical resource (HCLr). Seven classifiers are used to evaluate their efficiency where SVM with 34.16% outperforms among all. While the second best classifier was found to be boosted Naïve Bayesian with the overall accuracy of 30.61%. They have proposed a two-phase hybrid method [4]. The first phase, contextual analysis consists of preprocessing techniques while the second phase, ensemble clustering phase consists of feature extraction and unsupervised machine algorithms. A sentiment lexicon SentiWordNet 3.0 is used to measure the strength of each term’s polarity. The proposed method increased the accuracy rate by an average of 3.0% when applying contextual analysis procedures. Feature weighting schemes including TF-IDF enhance the performance from (520) %.

III. METHODOLOGY Fig.1. shows methodology in this paper which comprises of 12 steps explaining further.

Figure 1 MIMs classification model Data Collection: we have created a group on WHATSAPP named as “Internet; Good or Bad” consisting of 15 members. A total of 417 messages manually labeled as 244 “Favor” and 173 “Against” are collected. A copy of the history of a group chat is been extracted using the email chat feature in “.txt document” format which is then converted into “.csv” file to be used [8]. Tokenization: A process of breaking down the corpus into individual elements [6]. It is also termed as word segmentation [1]

www.ijmrem.com

IJMREM

Page 2

MIM (Mobile Instant Messaging) Classification using Term…

Figure 2 MIM after tokenization Removing Stop Words: Stop words are unnecessary word that commonly appear in the text such as so, and, or, the … [2]. There are 153 English language stop words that need to be removed because they possess insignificance with most of datasets [1].

Figure 3 MIM after removing stop words Case Normalization: An irreversible process that converts the terms into lower case [1].

Figure 4 MIM after case normalization Removing Punctuation: A classic technique in information retrieval and data mining that removes punctuation marks from the text [2].

Figure 5 MIM after removing punctuation Stemming: Converts the word into its root forms, effective for polarity detection [1] and generally yields good results [2].

Figure 6 MIM after stemming Term Frequency-Inverse Document Frequency (TF-IDF): A commonly used scoring scheme used to evaluate the importance of a token in a document and ultimately in the given dataset. It can be used to remove stop words, punctuations, most frequent and least frequent tokens successfully [1]. Term Frequency measures how frequently a term occurs in a document. Inverse document frequency factor decreases the weight of terms that occur very frequently in the document set and increases the weight of terms that rarely occur [7]. Mathematically [4], (1) Where, ▪ tfi,j is the term normalization of term i ▪ idfi is the inverse document frequency of term i. Bernoulli’s Naïve Bayesian: Naïve Bayesian is a probabilistic classifier based on the Bayesian theorem to calculate the probability of a data sample belonging to a specific class widely used in sentiment classification. The Bayesian theorem supposes all features are completely independent of each other [3]. The probability of a sample belonging to a class can be computed using the following formula.

www.ijmrem.com

IJMREM

Page 3

MIM (Mobile Instant Messaging) Classification using Term…

(2) Where, ▪ P (c|x) is the posterior probability of class (c, target) given predictor (x, attributes). ▪ P (c) is the prior probability of class. ▪ P (x|c) is the likelihood which is the probability of predictor given class. ▪ P (x) is the prior probability of predictor. The Bernoulli Naïve Bayesian algorithm is a modified form of traditional Naïve Bayesian, where the weight of each term is equal to 1 if it exists in the sentence and 0 if not [2].

IV. TOOLS AND TECHNOLOGIES Python 3.0 (Anaconda Python Distribution) is used to acquire the results of the model. Python libraries like NumPy (Numerical Python), NLTK (Natural Language Tool Kit), Sci-kit learn and Matplotlib are used for scientific computing (arrays, mathematical calculations), preprocessing, machine learning and plotting library (for graphs etc) respectively.

Figure 7 Tools and technologies used

V. RESULTS AND DISCUSSION Hold out splitting is used to evaluate the accuracy of the proposed model where 75% data is used for training and 25% data is used for testing the classifier. The model attained the accuracy of 81.73% with 69.23% baseline accuracy. The results show that the proposed hybrid binary sentiment classification model with preprocessing techniques have achieved satisfactory results by getting 12 points higher accuracy.

In Fig. 8 GRAPH 1(a) shows the message count, there were total 417 messages where 244 and 173 are labeled as favor and against respectively. While GRAPH 1(b) shows the results after elimination of 4 repeated messages which left 240 favor messages.

Figure 8 Impact of preprocessing at message level

www.ijmrem.com

IJMREM

Page 4

MIM (Mobile Instant Messaging) Classification using Termâ&#x20AC;Ś In Fig. 9 GRAPH 2(a) shows the results before preprocessing while GRAPH 2(b) shows the results after preprocessing. It can be clearly concluded that the preprocessing techniques trims the lengthy and verbose messages into important useful tokens to acquire a cleaner dataset to get better results.

Figure 9 Impact of preprocessing at token level

VI. CONCLUSION AND FUTURE WORK The proposed model is based on binary sentiment classification on aspect level to develop a hybrid sentiment classification framework with preprocessing techniques to process WhatsApp MIM dataset. A machine learning technique is used to develop a sentiment classification model with TF-IDF feature weighting scheme. The model attains satisfactory results as compared to the baseline accuracy. For future work, it is suggested to increase the dataset to get better results as more data leverages better accuracy. Furthermore, applying more preprocessing techniques with the well-ordered winning combination to extract significant features of sentiment classification.

REFERENCES [1]

[2]

[3]

[4] [5]

[6] [7] [8] [9] [10] [11] [12]

Alvi, M.B., Mahoto, A.N., Alvi, M., Unar, M.A., Shaikh, M.A, Hybrid Classification Model for Twitter Data- A Recurssive Preprocessing Approach, 5th International Multi-topic ICT Conference (IMTIC), 2018, 1-6 Symeonidis, S., Effrosynidis , D., & Arampatzis, A, A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis, Expert System With Applications,110, 2018, 298-310 Abdi, A., Shamsuddin, S. M., Hasan, S., & MD, J. P, Machine learning-based multi-documents sentiment-oriented summarization using linguistic treatment, Expert Systems with Applications,109, 2018, 66-85 Al-Sharuee, M. T., Liu, F., & Pratama. M, Sentiment analysis: An automatic contextual analysis and ensemble clustering approach and comparison, Data and Knowledge Engineering, 115, 2018, 194-213 Liu,Y., Bi, J.W., & Fan, Z.P, Multi-class sentiment classification: The experimental comparisons of feature selection and machine learning algorithms, Expert Systems With Applications, 80, 2017, 323339 A. Faraz, An elaboration of text categorization and automatic text classification through mathematical and graphical modeling, An International Journal (CSEIJ), 5(2), 2015, 239-248. Ahmed, I., Guan, D., & Chung, C.T, SMS Classification Based on NaĂŻve Bayes Classifier and Apriori Algorithm Frequent Itemset, International Journal of Machine Learning and Computing, 4(2), 2014 Patil, S, WhatsApp Group Data Analysis with R, International Journal of Computer Applications, 154 (4), 2016, 31-36 Tang, Y., Hew, K.F, Is mobile instant messaging (MIM) useful in education? Examining its technological, pedagogical, and social affordances, Educational Research Review, 21, 2017, 85-104 Appel, O., Chiclana, F., Carter, J., & Fujita, H., A hybrid approach to the sentiment analysis problem at the sentence level, Knowledge-Based System, 108, 2016, 110-124 Katz, G., Ofek, N., & Shapira, B, ConSent: Context-based sentiment analysis, Knowledge-Based Systems, 84, 2015, 162-178 Fersini, E., Messina, E., & Pozzi, F. A, Sentiment analysis: Bayesian Ensemble Learning, Decision Support Systems, 68, 2014, 26-38

www.ijmrem.com

IJMREM

Page 5

Turn static files into dynamic content formats.

Create a flipbook

MIM (Mobile Instant Messaging) Classification using Term Frequency-Inverse Document Frequency (TF-ID

Published on Mar 11, 2019

ijmrem journal

The focus of the study is based on binary sentiment classification on aspect level to develop a hybrid sentiment classification framework of WhatsApp MIMs (Mobile Instant Messages). It has been carried out into two phases i.e. training phase and testing phase. The training phase, 75% data is used for training dataset. Pre-processing techniques like tokenization, removing stop words, case normalization, removing punctuation and stemming are applied to acquire cleaner dataset to be used as input. The output is sent to the classifier after applying TF-IDF for feature weighting. In the second phase, the classifier is trial with 25% testing dataset. Bernoulli’s Naïve Bayesian classifier which is an improved form of traditional Naïve Bayesian classifier is used to classify sentiments. There are 417 messages in total where 244 and 173 are classified as positive and negative respectively. The proposed model has achieved satisfactory results up to 81.73% in comparison to base-line classification model by getting 12 po