IJSRD - International Journal for Scientific Research & Development| Vol. 4, Issue 05, 2016 | ISSN (online): 2321-0613
Recommendation based on Sentiment Analysis in Big-Data Environment Yamini V. Nikam1 M.B.Vaidya2 Department of Computer Engineering 1,2 Amruthvahini College of Enggineering., Sangmner, India 1,2
Abstract— With the rapid development of Internet, the number of users and items of recommender system is growing exponentially. As a result, the single-node machine implementing these algorithms is time-consuming and unable to meet the computing needs of large data sets. Therefore, implementing the algorithm distributed will reduce the required computing time. To improve the performance, we proposed a distributed collaborative filtering recommendation algorithm combining Sentimental Analysis on Hadoop. Key words: Hadoop, Big-Data
Apache Hadoop is an open-source distributed computing framework that can be composed of a large number of low-cost hardware to run the application on a cluster. It provides applications with a set of stable and reliable interface designed to build a high reliability and good scalability of distributed systems. Hadoop implements Google's MapReduce programming model. MapReduce is a distributed computing model and is also the core of Hadoop. II. SYSTEM ARCHITECTURE
I. INTRODUCTION Collaborative Filtering (CF) is a promising technique in recommender systems. It provides personalized recommendations to users based on a database of user preferences, from which users having similar tastes are identified. It then recommends to a target user items liked by other, similar users. CF-based recommender systems can be classified into two major types depending on how they collect user preferences: user-log based and ratings based. User-log based CF obtains user preferences from implicit votes captured through users’ interactions with the system (e.g. purchase histories as in Amazon.com ). Ratings based CF makes use of explicit ratings users have given items (e.g. 5-star rating scale as in MovieLens ). Such ratings are usually in or can easily be transformed into numerical values (e.g. A to E). Some review hubs, such as the Internet Movie Database (IMDb), allow users to provide comments in free text format, referred to as user reviews. User reviews can also be considered as type of “user ratings”, although they are usually natural language texts rather than numerical values. While research on mining user preferences from reviews, a problem known as sentiment analysis or sentiment classification ), is becoming increasingly popular in the text mining literature, its integration with CF has only received little research attention. A. Hybrid Recommendation Algorithm Based On Hadoop: The collaborative filtering is the most successful technique in use of recommendation and is the most widely used recommendation algorithm in recommender system. Compared with the traditional user-based, item-based collaborative filtering algorithms, Slope One algorithm is simpler and more efficient. However, the algorithm relies on the targeting predicting item ratings which are already rated by a large number of users. If users have no similar interests with the predicting user, the rating of the predicting user will encounter a lot of interference. With the rapid development of Internet, the number of users and items of recommender system is growing exponentially. As a result, the single-node machine implementing these algorithms is time-consuming and unable to meet the computing needs of large data sets. Therefore, implementing the algorithm distributed will reduce the required computing time.
Fig. 1: Integrating Collaborative Filtering and Sentiment Analysis The steps are as follows1) Data Preparation POS Tagging Negation Tagging Feature generalization 2) Review Analysis 3) Opinion Dictionary Construction MapReduce Job-1 : Work of the first MapReduce job is to collect all the user rated both the items. MapReduce Job-2 : second MapReduce job will find the similarity between items using correlation formula. III. ALGORITHM A. Algorithm-1: Map-1 Emit the user_id as key and (item and rating) as value Job-1 Input:key-line offset value- row of input file contains (item_id,user_id,rating) Output:Key- user_id Value-(item_id,rating) pair Require: Input dataset containing User_id, Item_id, rating fields
All rights reserved by www.ijsrd.com
1422
Recommendation based on Sentiment Analysis in Big-Data Environment (IJSRD/Vol. 4/Issue 05/2016/348)
B. Procedure:
H. Procedure
user_id, item_id, rating = line.split('\t') key=user_id value=item_id value.append(rating) emit(key,value)
C. Algorithm-2: Reducer-1 For each user, emit a row containing their "postings"(item , rating pairs) Input:key- user_id value- Sequence of (user_id , rating) Output:Key- user_id Value-row contain all posting of user (item_id , rating) D. Procedure:
item_count = 0 item_sum = 0 final = [] for item_id, rating in values { item_count += 1 item_sum += rating final.append((item_id, rating)) Key=user_id Value= item_count, item_sum, final Emit (key,value) }
E. Algorithm-3: Map-2: The output drops the user from the key entirely, instead it emits the pair of items as the key Job-2: Require Output of job-1 as input to the job-2. Input:key-user_id value- row containing all the posting of user (user_id , rating) Output:Key- (item_id , item_id) Value-(rating , rating) F. Procedure:
item_count , item_sum, ratings = values for item1, item2 in combinations(ratings, 2) { key=(item1[0], item2[0]) value=( item1[1], item2[1]) Emit=(key , value) }
G. Algorithm-4: Reduce-2: Sum components of each co rating pair across all users who rated both item x and item y, then calculate correlation similarity. Input:key- (item_id , item_id) value- sequence of rating pair(rating , rating) Output:Key- (item_id , item_id) Value-(similarity , n)
sum_xx, sum_xy, sum_yy, sum_x, sum_y, n = (0.0, 0.0, 0.0, 0.0, 0.0, 0) sum_x += item_x n += 1 item_pair, co_ratings = pair_key, lines item_xname, item_yname = item_pair for item_x, item_y in lines: { sum_xx += item_x * item_x sum_yy += item_y * item_y sum_xy += item_x * item_y sum_y += item_y similarity = normalized_correlation(n, sum_xy, sum_x, sum_y,sum_xx, sum_yy) Key=(item_xname,item_yname) Value= (similarity) Emit(key,value) } IV. PERFORMANCE ANALYSIS
In this chapter, presenting the results of performance evaluation of practical implementation. Performance of system is measured by comparing with previous work, algorithm and time complexity. In this project, IMDB’s MovieLense dataset is used. First system performance of the data set analyzed and then algorithm. Our goal in the performance analysis is to optimize algorithm and compare results with other algorithms. A. Data Set: MovieLens data sets were collected by the GroupLens Research Project at the University of Minnesota. The GroupLens Research Project is a research group in the Department of Computer Science and Engineering at the University of Minnesota. Members of the GroupLens Research Project are involved in many research projects related to the fields of information filtering, collaborative filtering, and recommender systems. The project is lead by professors John Riedl and Joseph Konstan. The project began to explore automated collaborative filtering in 1992, but is most well known for its world wide trial of an automated collaborative filtering system for Usenet news in 1996. The technology developed in the Usenet trial formed the base for the formation of Net Perceptions, Inc., which was founded by members of GroupLens Research. Since then the project has expanded its scope to research overall information filtering solutions, integrating in content‐based methods as well as improving current collaborative filtering technology. This data set consists of: 100,000 ratings (1‐5) from 943 users on 1682 movies. Each user has rated at least 20 movies. Simple demographic info for the users (age, gender, occupation, zip) The data was collected through the MovieLens web site (movielens.umn.edu) during the seven‐month period from September 19th, 1997 through April 22nd, 1998. This data has been cleaned up ‐ users who had less than 20 ratings or did not have complete demographic information were removed from this data set.
All rights reserved by www.ijsrd.com
1423
Recommendation based on Sentiment Analysis in Big-Data Environment (IJSRD/Vol. 4/Issue 05/2016/348)
V. RESULTS ANALYSIS 1) For dataset size 12500 reviews
Table 1: 2) For dataset size 25,000 reviews
Table 2: 3) For dataset size 50,0000 reviews
Table 3: 4) For dataset size 1,00,000 reviews
Table 4: VI. CONCLUSION With the scale of recommender system continues to expand, the number of users and items of recommender system is growing exponentially. Many algorithm need to meet the computing needs of large data sets. In this paper, the parallel hybrid recommendation algorithm can run on a number of low-cost computers. It improves speed and solves high recommendation performance under large data sets. Experiments show that the improved parallel hybrid recommendation algorithm compared with the former one improves the speed and reduces the time consumption. We proposed a rating inference approach to integrating sentiment analysis and CF. Such approach transforms user preferences expressed as unstructured, natural language texts into numerical scales that can be understood by existing CF algorithms.
VII. FUTURE WORK As noted, our rating inference approach transforms textual reviews into ratings to enable easy integration of sentiment analysis and CF. We nonetheless recognize the possibility to perform text-based CF directly from a collection of user reviews. A possible solution is to model text-based CF as an information retrieval (IR) problem, having reviews written by a target user as the “query” and those written by other similar users as the “relevant documents”, from which recommendations for the target user can be generated. This remains as an interesting research direction for future work. VIII. REFERENCES [1] A David Goldberg, David Nichols, Brian M. Oki and Douglas Terry, “Using collaborative filtering to weave an information Tapestry.”, Communications of the ACM Dec 1992 v35 n12 p61(10) [2] Yehuda Koren, Yahoo Research, Robert Bell and Chris Volinsky, AT&T Labs—Research, “Learning Collaborative Information Filters” Published by the IEEE Computer Society 0018- 9162/09 2009 IEEE [3] Hao Ma, Irwin King, Michael Rung-Tsong Lyu, “Matrix Factorization Technique for Recommendation System ”, IEEE Transaction on Knowledge and Data Engineering, Vol. 24, No. 6 June 2012. [4] Manos Papagelisa, Dimitris Plexousakis"Qualitative analysis of user based and item based prediction algorithms for recommendation agents". Elsevier Engineering Applications of Artificial Intelligence 18 (2005) 781–789 2005 [5] Kai Yu, Xiaowei Xu, Jianhua Tao, Martin Ester,HansPeter Kriegel “Instance selection techniques for memory based collaborative filtering. “ Proc. Second SIAM International Conference on Data Mining (SDM'02) [6] Prem Melville and Vikas Sindhwani, Recommender Systems, Encyclopedia of Machine Learning, 2010. [7] Montaner, M.; Lopez, B.; de la Rosa, J. L. (June 2003). "A Taxonomy of Recommender Agents on the Internet". Artificial Intelligence Review 19(4):285–330. [8] Adomavicius, G.; Tuzhilin, A. (June 2005). "Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions". IEEE Transactions on Knowledge and Data Engineering 17 (6): 734–749. doi:10.1109/TKDE.2005.99. [9] Herlocker, J. L.; Konstan, J. A.; Terveen, L. G.; Riedl, J. T. (January 2004). "Evaluating collaborative filtering recommender systems". ACM Trans. Inf. Syst. 22 (1): 5– 53. doi:10.1145/963770.963772. [10] John S. Breese, David Heckerman, and Carl Kadie (1998). "Empirical analysis of predictive algorithms for collaborative filtering". In Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence (UAI'98). [11] B. Pang, L. Lee, and S. Vaithyanathan, ‘Thumbs up? Sentiment classification using machine learning techniques’, in Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 79–86, (2002). [12] P.D. Turney, ‘Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews’, in Proceedings of the 40th Annual Meeting of
All rights reserved by www.ijsrd.com
1424
Recommendation based on Sentiment Analysis in Big-Data Environment (IJSRD/Vol. 4/Issue 05/2016/348)
the Association for Computational Linguistics, pp. 417– 424, (2002). [13] M. Hu and B. Liu, ‘Mining and summarizing customer reviews’, in Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 168–177,(2004). [14] B. Pang and L. Lee, ‘Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales’, in Proceedings of the 43rd Annual Meeting of the Association for Computation Linguistics, pp. 115–124, (2005). [15] L. Terveen, W. Hill, B. Amento, D. McDonald, and J. Creter, ‘Phoaks: A system for sharing recommendations’, Communications of the ACM, 40(3), 59–62, (1997). [16] Schelter, S., Boden, C., et aI., "Scalable similarity-based neighborhood methods with Map Reduce". Proc. 6th ACM conference on Recommender systems, Dublin, Ireland, pp. 163- 170,2012. [17] Hadoop: Open source implementation of MapReduce,http : //lucene.apache.org/hadoop/. [18] Lammel, R.: Google's MapReduce Programming ModelRevisited. Science of Computer Programming 70, 1-30, 2008. [19] Zhao, W., Ma, H., et al.: Parallel K-Means Clustering Based on MapReduce. In: CloudCom 2009. LNCS, vol. 5931, pp. 674- 679,2009. [20] K. Shvachko, H. Kuang, S. Radia, and R. Chans1er, "The Hadoop distributed file system," Mass Storage Systems and Technologies, IEEE/ NASA Goddard Conference, vol. 0, pp. 1- 10,2010.
All rights reserved by www.ijsrd.com
1425