Movie reviews score based on text mining techniques. Uriel Chareca LINKOPING UNIVERSITET ABSTRACT In this paper, I introduce a model to score an abstract, or a full review, based on text mining techniques. Since the lexicon used in movie reviews vary according to its final classification, a segmentation of common words by grade can be defined, and therefore a score predicted. Specifically, in my work I used an inverted index, in order to identify the most probable score based on its ranked retrieval – using tf-idf weighting. INTRODUCTION With the broad access to information nowadays, we can access different opinions on every single topic and try to summarize into a single conclusion. In previous years, movie lovers used to read the local newspaper in order access a qualified review of an interesting new release. Nowadays, from the several sites than combine many of the world’s most respected critics, Metacritic.com and Rottentomatoes.com are the most recognized ones. While Rottentomatoes presents this type of information, it targets to an audience more eager for the latest movie business scoops and rumors; Metacritic is recognized for its more professional approach and a broader scope of analysis including TV, Music and Games reviews as well. The former also presents a sleek and simpler site; therefore it was used as a basis for this report. Metacritic curates a large amount of recognized critic reviews, and apply a weighted average (based on internal profile criteria, considering the prestige of the critic and/or its publication) that summarizes the range of their scores into a single “Metascore” in a range of 0-100 (fig 1). This task is not as simple as it looks, as not all reviews presents a proper score, many critics still believe a text speaks more than a sintetic grade, or even used different scales i.e.: 1-10 numerical scales, 0-4 (or 5) stars or letter grades (A-F). Therefore a score algorithm that considers the lexic used to convert a text to a 0-100 point scale, would be both useful and interesting to analyse score criteria. (fig 1)
(fig 2)
RETRIEVAL OF REVIEWS (attached in the appendix the code used and commented) Along their internal converted scores, the site provides a small abstract of each review, usually from the summary provided in the proper reviews sites. In order to standardize the retrieval algorithm, these abstracts were used as basis for this paper analysis. (fig 2). At the home page source, i identified the significative string segments that directed to each particular movie site. In each individual movie site I was able to store all the available reviews along their grades, into lists and particular .txt files (to avoid retrieval every time we run the program). From the individual reviews, the first step was to group them into slots of 10 points, in order to have a considerable corpus on each slot and avoid over-fitting in specific scores. Below in figure 3 the histogram of the 2190 reviews extracted. It worth’s noting that no review was retrieved with a score below of 20. (fig 3)
Once, the reviews are extracted and classified, the next step is to preprocess them, aiming to separate them into relevant tokens. In this process, I started by separate them into sentences, next filter out usual stop words, short connectors and punctuation signs. Now we can recognize the more common set of words by score (fig 4), but this exercise alone won’t allow us to efficient classify a new review as many share words or include non-signficative terms. (fig 5)
20 30 40 'film' 'film' 'film' 'enough' 'even' 'movie' 'like' 'much' 'like' 'movie' 'like' 'much' 'even' 'movie' 'story' 'little' 'story' 'even' 'make' 'thriller' 'feels' 'makes' 'close' 'narrative' 'never' 'could' 'tale' 'something' 'drama' 'arthur' 'thriller' 'feels' 'films' 'almost' 'funny' 'isnt' 'also' 'least' 'life' 'comedy' 'never' 'little' 'feels' 'portrait' 'make' 'gain' 'sense' 'style' 'love' 'theres' 'would' 'making' 'things' 'cast' 'much' 'anything' 'comedy' 'nothing' 'artist' 'everything'
50 60 'film' 'film' 'movie' 'movie' 'like' 'story' 'much' 'much' 'even' 'even' 'story' 'like' 'less' 'enough' 'better' 'characters' 'enough' 'good' 'good' 'also' 'many' 'director' 'still' 'drama' 'director' 'never' 'feels' 'often' 'never' 'less' 'actors' 'plot' 'another' 'tale' 'comedy' 'work' 'doesnt' 'could' 'make' 'documentary'
70 'movie' 'film' 'like' 'story' 'even' 'much' 'make' 'time' 'never' 'drama' 'plot' 'well' 'work' 'good' 'director' 'enough' 'star' 'also' 'keep' 'little'
80 'film' 'movie' 'like' 'much' 'even' 'exhilarating' 'films' 'characters' 'funny' 'sense' 'story' 'would' 'best' 'darkness' 'never' 'something' 'cinema' 'drama' 'feels' 'make'
90 100 'film' 'film' 'like' 'like' 'movie' 'movie' 'films' 'shining' 'characters' 'also' 'much' 'could' 'place' 'even' 'cinema' 'room' 'leave' 'director' 'looking' 'documentary' 'love' 'everything' 'often' 'feature' 'tale' 'greatest' 'things' 'nichols' 'though' 'people' 'work' 'place' 'years' 'plot' 'actors' 'something' 'almost' 'star' 'along' 'appeal'
In addition to calculate the most frequent terms (which as we seen is not the only thing relevant), we need to include into the analysis the rare terms as well. Therefore it was created an inverted index
dictionary, calculating for each token in our set of words, the number of documents where it appears along with the number of times they show up in each document. For example: ‘film’:[9, {2: 8, 3: 15, 4: 23, 5: 39, 6: 52, 7: 43, 8: 51, 9: 15, 10: 6}], appears in 9 documents (all 9 of 10 points slots), most times (52) on grade 60-70. This info is used to then create a weight matrix to consider each log frequency (tf) weight of all terms t in document d that is defined as: (fig 6)
But to also consider the document frequency of rare terms, we calculate the tf-idf weight by multiplying this term w by the idf (inverse document frequency): (fig 6)
Now, we can produce a vector for each query or document (groups of reviews under a specific grade) based on the tf-idf weight of each token individualized. In both cases we included a length-normalization step, and only consider the nonzero terms to avoid very sparse vectors and simplify calculations. Once we have the vector of the query, we aim to find the grade to which it has a higher similarity, which could be defined by the higher ranking order of the cosine(query,document): (fig 7)
Example: Review: “Yes, this one is even better: funnier, brawnier and ingeniously constructed for appeal to both devoted fans and reluctant converts.” (real score: 100) Chart of tf-idf weight by words, along the cosine multiplication of each possible grade match: (fig 8)
Tokens tf idf even 0,000 devoted 0,305 constructed 0,225 better 0,033 fans 0,070 converts 0,417 brawnier 0,609 ingeniously 0,417 appeal 0,163 funnier 0,305 reluctant 0,112
20 0,060 0,037 0,000 0,048 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,013
30 0,061 0,000 0,000 0,000 0,031 0,000 0,000 0,000 0,000 0,000 0,031 0,006
40 0,049 0,000 0,000 0,040 0,033 0,000 0,000 0,000 0,000 0,025 0,000 0,011
50 0,042 0,020 0,026 0,040 0,026 0,020 0,000 0,000 0,020 0,000 0,020 0,029
60 0,040 0,000 0,000 0,032 0,023 0,000 0,000 0,018 0,018 0,000 0,026 0,016
70 0,038 0,000 0,022 0,027 0,022 0,000 0,000 0,000 0,022 0,000 0,017 0,013
80 0,040 0,000 0,019 0,028 0,019 0,000 0,000 0,000 0,019 0,025 0,025 0,020
90 0,046 0,000 0,000 0,036 0,000 0,000 0,000 0,000 0,000 0,000 0,000 0,001
100 0,069 0,056 0,056 0,056 0,056 0,056 0,056 0,056 0,056 0,056 0,056 0,148
Therefore, the most similar grade would be 100, but in order to average results I included the option in different models to rank the top 1, top 3 or top 9 grades, by a normalized weighted average. Below an example of the output with 9 grades considered: key words: even/devoted/constructed/better/fans/converts/brawnier/ingeniously/appeal/funnier/reluctant ----------------------------10 Score - Normalized Similarity 57% 5 Score - Normalized Similarity 11% 8 Score - Normalized Similarity 7% 6 Score - Normalized Similarity 6% 2 Score - Normalized Similarity 5% 7 Score - Normalized Similarity 4% 4 Score - Normalized Similarity 4% 3 Score - Normalized Similarity 2% 9 Score - Normalized Similarity 0% ----------------------------FINAL AVERAGE SCORE 80
Finally, I validated the models versus our training data and a new test set data from latest dvd releases (2340 new reviews). Here are the main findings along their histograms of grades distributions (in red real data, in blue the output from the algorithm explained) (fig 9)
MSE RMSE
Training set TOP1 TOP3 407 312 20 18
TOP9 302 17
(fig 10)
(fig 11)
MSE RMSE
Test Set TOP1 TOP3 632 434 25 21
TOP9 403 20
(fig 12)
CONCLUSION In this paper I presented a model to assign a score to movie reviews, based on text mining techniques. The analysis of over 2000 reviews abstracts allowed to create a dictionary of common terms, used to classify new reviews according to a similarity coefficient. 3 models where presented; first, a single best fit (Top 1), plus a weighted average of the best Top 3 and Top 9 grades matched. As expected, when tested on new data, the model shows a worst fit but still not too far away from initial statistics, ruling out an over-fitting. Even if intuitevely, from the above histograms the model considering only a single match may appear to be closer to the real data distribution, a statistic as Root Mean Squared Error reject that idea, and confirms It shouldn’t be selected. The Top 3 model appears to be the best choice. Even if it provides a worst visual fit, it presents a broader range of distribution of scores (heavier tails) and still have the flexibility of an weighted average as the Top 9 model (similar level of RMSE). REFERENCES www.metacritic.com “Introduction to Information Retrieval” by CD Manning, P Raghavan, H Schutze; Cambridge Press; 2008