Imperial Journal of Interdisciplinary Research (IJIR) Vol-3, Issue-2, 2017 ISSN: 2454-1362, http://www.onlinejournal.in
Survey on Spam Detection Techniques in Online Review Systems Reshma P R M.Tech, Dept.of CSE, Thejus Engineering College, Vellarakkad Abstract: Now a days it is a common practice for ecommerce Web sites to enable their customers to write reviews of products that they have purchased. Online audits have turned into the most vital asset of clients conclusions. These surveys are utilized progressively by people and associations to settle on buy and business choices. Unfortunately, driven by the desire for profit or publicity, fraudsters have produced deceptive (spam) reviews.. In this paper, we make an endeavor to identify whether a survey is a spam or a non spam audit, with a specific end goal to give a put stock in survey to help the client in taking the correct purchasing choice. Keywords: Review Writing, Spam detection Spam, Non-spam, Product reviews, Fake reviews
1. Introduction In the time of Web2.0, individuals are progressively utilizing internet business and supposition sharing sites. These sites permit individuals to exhibit their own encounters, feelings, dispositions and sentiments with respect to items and administrations as well as political and financial issues in the general public. Thusly, lately, the volume of client contributed buyer surveys presented on such sites has been expanding drastically. Such conclusions, starting from clients' encounters with respect to specific items or subjects, clearly influence future client buy choices. As it were, obstinate postings in online networking influence planned potential buyers to settle on or turn around buy choices. Along these lines, client audits are imperative for people. Alternately, an extensive extent of positive surveys draw in more clients for a specific item or brand. Positive audits can bring significant financial picks up. Additionally, negative surveys frequently cause deals misfortune. In this manner, there is a developing pattern of shippers depending progressively on overall population's assessments to reshape their organizations by enhancing their items, administrations, and showcasing.
workstation post surveys about it grumbling about screen determination, the producer will be directed to modify the item to accomplish consumer loyalty and, therefore, higher market achievement. Considering the bounty of data with respect to items and dealers on various sentiment sharing sites and the significance of these clients' conclusions for people and associations, assessment mining procedures and techniques have been proposed to help organizations and people in social affair and examining the expansive volume of client audits. The across the board sharing and usage of client sentiment has raised a spam assaults issue on sites containing client audits. Since anybody can without much of a stretch create surveys and present them via web-based networking media without any imperatives, certain sellers or item suppliers manhandle this circumstance to advance their items, image and store, or to malign their rivals unjustifiably. For instance, assume various clients utilizing a specific computerized camera post negative suppositions with respect to picture quality. These surveys display horrible impressions of the computerized camera to potential clients. Accordingly, the camera maker may employ a man or group to post fake positive audits about the camera's picture quality. So also, the maker may request that the employed people create pessimistic audits of contenders' items. These audits created by people who have not actually encountered the subjects of the surveys are called spam surveys; spam surveys may likewise be called fake surveys, non-honest to goodness audits, or false surveys. Correspondingly, a man utilized to compose spam surveys is an individual spammer. On the off chance that a spammer works with different spammers to accomplish certain destinations, the spammers will be called aggregate spammers. The expansion of individual and gathering spammers drastically discourages the precision of consequences of assessment mining and estimation investigation and, thusly, raises concerns with respect to the reliability of sentiment postings in online networking.
For instance, when numerous clients who have purchased a specific model of Asus portable
Imperial Journal of Interdisciplinary Research (IJIR)
Page 1140
Imperial Journal of Interdisciplinary Research (IJIR) Vol-3, Issue-2, 2017 ISSN: 2454-1362, http://www.onlinejournal.in After summarizing reviews, review spam’s are mainly 3 types. (1)Un-truthful review Again classified into two • Positive spam review: Undeserving conclusion to an item • Negative spam review: Negative opinion to pick another product (2)Reviews on brands (3)Non reviews like advertisements
2. Survey In this section, a survey of three categories of review spam detection techniques is given in detail. More different methodologies are likewise under review and research .The related and wide review around there may bring about the advancement of more effective and efficient audit spam discovery procedures. An exhaustive reference on different related works and papers are performed to get more information. Likewise , a relative review is done to recognize the upsides and downsides of various identification strategies which can be utilized as a part of future execution of such frameworks. From the study and reference to different logical papers and articles, the point by point clarification of three procedures are given beneath and in whatever remains of the report. This paper gives a broad overview and examination of different review spam detection techniques. In this study fundamentally concentrate in view of three papers: Analyzing and detecting review spam, Conceptual level Similarity Measure based Review Spam Detection, High-Order Concept Associations Mining and Inferential Language Modeling for Online Review Spam Detection
2.1. Analyzing and Detecting Review Spam [1] makes the accompanying two principle commitments:
Review spam classification: It shows an arrangement of audit spam. [1] discovered three fundamental sorts of spam surveys.
Review spam investigation and discovery: It proposes some novel systems to study audit spam and spam recognition.
Reviews shows that they definitely contain type 2 and type 3 spam reviews. Mainly 3 types of spam’s Type1: Duplicates from different userids on the same product. Type2: Duplicates from the same userid on different products. Type3: Duplicates from different userids on different products
Imperial Journal of Interdisciplinary Research (IJIR)
[1] spam detection strategy includes: (1)Detect duplicates and near-duplicates (2)Detect spam reviews of type 2 and type 3 based on supervised learning using manually labeled training examples, and (3)Detect type 1 spam by exploiting the three types of duplicates above and other relevant information. Mainly spam detection based on features.[1] have three types of features: (1)Review centric features: characteristics of reviews. (2)Reviewer centric features: characteristics of reviewers. (3)Product centric features: characteristics of products. For some features, we need to divide products and reviews into three types based on their average ratings (rating scale: 1-5): Good (rating>=4), bad (rating<=2.5) and Average.
2.2. Conceptual Level Similarity Measure Spam Detection Conceptual level similarity measure used for detecting spam reviews based on the product features that have been commented in the reviews. It mainly concentrates on different review format Format1: pros and cons -pros and cons are separately mentioned by the reviewer. Format2: pros, cons and detailed review -along with pros and cons detailed review is asked to the reviewer. Format3: free format -there is no separation of pros and cons in the review. Spam reviews categorized into Duplicated review: set of features corresponding to the two reviews are exactly identical Near duplicated review: number of matching features of two reviews in between threshold and <100% Non spam reviews are two types Partially related review: number of matching features in 2 reviews < threshold Unique review: number of matching features between two reviews is zero. Proposed technique in[2] consists of the following components: (1)Review Data Store (2)Conceptual level Similarity Measure (3)Human Perception (4)Spam-Non-Spam review assessment 2.2.1. Review Data store: Review data store contains two main parts Review region extractor : It Identifies and extracts only the relevant review region of a given web page and leaving out the other irrelevant information.
Page 1141
Imperial Journal of Interdisciplinary Research (IJIR) Vol-3, Issue-2, 2017 ISSN: 2454-1362, http://www.onlinejournal.in Review Extractor :It It extracts the individual reviews from the review region extracted by review region extractor (identifying pros and cons as separate reviews), and stores it in two raw review databases, one for pros and another for cons in the format 2.2.2. Conceptual level Similarity Measure: Phase 1:Involves feature extraction from the reviews stored in the raw review database and Storing it in the feature database. Phase II: Involves feature matrix construction and Spam_Detect_Conceptual components for Detecting spam and non spam reviews Then Construct a review feature matrix ‘M’ of order of m X n 2.2.3. Spam Detection based on Concepts: Module accepts feature matrix as the input then finds the matching number of the features. Equivalent synonyms between the reviews, which is used to detect the spam and the non spam reviews. Conceptual level similarity measure between two review documents Ri and Rk is defined as : Sim(
=
-
……(1)
Similarities measured is used to classify the reviews as spam and non-spam based on a threshold value T If sim(Ri, Rk) is in [T, NC], then Ri and Rk are spam. If sim(Ri, Rk) = NC, then the reviews Ri and Rk considered as duplicates near duplicates. If sim(Ri, Rk) = 0, then the review Ri and Rk, are unique, otherwise partially relates
2.3. High-order Concept Association Mining and Inferential Language Modeling for Online Review Spam Detection One of the main contributions of [3] re- ported in this paper is the design and development of a novel methodology for automatic detection of untruthful reviews. [3] propose a probabilistic language model (LM) based method to estimate the content overlapping between any pairs of reviews. For the application of untruthful review spam detection, a pair of reviews d1 and d2 with |d1| ≤ |d2| will be compared each time. In other words, a shorter review is denoted as d1, and it can be treated as a long query. In [3] purpose is to apply −KL(d1 k d2), that is, the negative divergence, as a quantitative measure to estimate the content overlapping between two reviews d1 andd2. In particular, even if a word from d1 is not found in d2, it may not necessarily mean that there is no overlapping between d1 and d2. When spammers create untruthful reviews, some of the words may be deliberately “translated” to other
Imperial Journal of Interdisciplinary Research (IJIR)
similar words so as to give the readers an impression that the reviews are not copied from other reviews. Type-style and Fon One main contribution of paper [3] is the development of a text mining method for the discovery of high-order concept associations like “fabulous” → “fantastic” for review spam detection. The proposed high-order concept discovery method is underpinned by the context-sensitive text mining approach. It consists of three main phases: (1) concept extraction; (2) concept pruning; (3) association extraction. 2.3.1. Concept Extraction In this Mutual Information is an information theoretic method to compute the dependency between two entities defined by MI(ti,tj) = log2 Pr(ti,tj) Pr(ti Pr(tj) By applying the concept extraction procedure to a review collection, discover high level concepts and represent them as concept vectors 2.3.2. Concept Pruning To compute the relevance score of a concept: Dom(ci,Dj) Rel(ci,Dj) = n Σ Dom(ci,Dk) k=1 Higher the value of Rel(ci;Dj), the more significant the concept ci is in domain (context) Dj .Only the concepts with relevance score greater than the threshold will be selected 2.3.3. Association Extraction Extraction of the association relations based on the notion of “subsumption”. Spec(cx,cy) denotes that concept cx is a specialization of another concept cy. Degree of a subsumption relation is derived by: Spec(
=
The degree of subsumption (specificity) of cx to cy is based on the ratio of the sum of the minimal association weights of the common terms of the two concepts to the sum of the term weights of the concept cx. Only select the most subsumption relationships with the highest association values according to “δ” filtering threshold
Page 1142
Imperial Journal of Interdisciplinary Research (IJIR) Vol-3, Issue-2, 2017 ISSN: 2454-1362, http://www.onlinejournal.in
3. Comparison Table1: Comparison of three review spam detection techniques TECHNIQUE METHOD PROS TYPE Analyzing And Detecting Review Spam
Supervised learning method
Finds LR is most suitable Identifies 3 types of duplicates
Type1 Type2 Type3
Conceptual Level Similarity Measure Based Review Spam Detection
Conceptual level similarity measured
Larger number of duplicated spam reviews identified
Type1
High-order Concept Associations Mining And Inferential Language Modeling For Online Review Detection
Conceptual level similarity measured
Larger number of duplicated spam reviews identified
Type1
[3] Raymond Y. K. Lau, S. Y. Liao, Ron Chi- Wai Kwok, Kaiquan Xu, Yunqing Xia, Yuefeng Li ," Text Mining and Probabilistic Language Modeling for Online Review Spam Detection" ACM Trans. Manag. Inform. Syst. 2, 4, Article 25 (December 2011) [4] M. Brennan, S. Wrazien, and R. Greenstadt, “Using machine learning to augment collaborative filtering of community discussions,” in Proc.9th Int. Joint Conf. Auton. Agents Multiagent Syst. (AAMAS), Toronto,ON, Canada, 2010, pp. 1569–1570. [5] A. Whitby, A. Jøsang, and J. Indulska, “Filtering out unfair ratings in Bayesian reputation systems,” Icfain J. Manage. Res., vol. 4, no. 2,pp. 48–64, 2005
4. Conclusion In this paper, firstly a brief description of three review spam detection techniques is given. The wide and elaborative study on the topic helped in understanding the various methods used at each steps of techniques. A comparative study of these three methods on various aspects helped in understanding thoroughly about each of the methods. Language model based spam detection method does not rely on review features for spam identification. In conceptual level similarity measure results comparison with human perception makes an unrealistic approach of detecting spam reviews. In analyzing and detecting review spam mainly discuss the classification of spam and non spam reviews. In conceptual level similarity measure review spam detection categorize spam’s and non spam reviews by duplication in reviews. Language model used for detailed study of duplicated reviews.
5. References [1] Nitin Jindal and Bing Liu .," Analyzing and Detecting Review Spam ", Seventh IEEE International Conference on Data Mining 2007. [2] Siddu P. Algur, Amit P.Patil, P.S Hiremath, S. Shivashankar, " Conceptual level Similarity Measure based Review Spam Detection", in 2010 IEEE conference
Imperial Journal of Interdisciplinary Research (IJIR)
Page 1143