Academy awards prediction based on asociation analysis

Page 1

Academy Awards prediction based on association analysis. Uriel Chareca LINKOPING UNIVERSITET ABSTRACT In this paper, I introduce a model to predict Academy Awards (OSCARS ®) based on association analysis of previous awards received. Using this information is possible to rank a set of prediction rules by filtering frequent combinations of awards, while considering a minimum support and a minimum confidence INTRODUCTION As we get close to the end of year, top lists of movies are published by recognized critics and lots of articles are written about whose Actor/Actress/Director/Movie are front running for the long desired Academy Award, Oscars ®. The Oscars are the most recognized awards, but not the only prize of the called “Awards Season”; a period that takes place between early January and the night of the Academy Award ceremony (in 2013 was held at Feb. 24th). This covers several ceremonies that recognize the best of the year, given by different organizations (or guilds) as the Critic Choice Awards (this year held at Jan 10th), Golden Globes (foreign press guild, Jan 13th), Producers Guild (Jan. 26th), Directors Guild (Feb. 2nd) and Writers Guild (Feb 17th) along many others. From Oscar buzz and rumors, to complex algorithms that considered reviews grading or box office performance, everyone has their own recipe to predict who is going to end up successful the big night. As mentioned, the Oscars are the highlight and the last of the many awards given, therefore the aim of this paper is to focus on this information to try to identify the most related awards and how you can predict based on this historical information. RETRIEVAL OF AWARDS DATABASE Even if the Academy Awards have been given since 1929, not all the other awards have the same longevity, for instance the Critics Choice Awards (as a guild, many states has their own award) started in 1996. Therefore, the scope of this study was focus only in the last 20 years, creating a training data as homogenous as possible. Worth mentioning that some new categories has also been added or discarded thru the years as well; for instance Best Animated Movie was only separated as a single category in year 2001 (with ‘Shrek’ winning the award), but the first animated film to be nominated for Best Movie was “Beauty and the Beast” in 1991. There is no general database available that covers the whole scope considered, so a manual extraction of each award year by year was required. The result was a database that included information of awards (the nomination information was left out of the scope of this paper, to reduce the number of associations) for 379 movies, 70 different awards, from 7 different ceremonies The 7 considered galas were the named OSCARS, Golden Globe (GG), Critic Choice Awards (CCA), Producers (PROD), Critics (CCA), Screen Actors (SAG), Directors (DIR) and Writers Guild (WRI). From the major ones, they were only excluded the Spirit Awards and the BAFTA, which are focused more into independent and british films respectively. Some extra considerations: one distinction considered in the analysis was the Oscar Top Nominated movie, which is not an award “per-se”, but is considered commonly as a siginificative information and it’s available before the ceremony so it could be used as an explanatory variable as well. Nevertheless out of the 24 Oscars given last year, short features categories (Best Short Documentary, Action Film, Animated) were excluded from the analysis as not related price is awarded in the other ceremonies considered. Also, Some awards had shared prices in the past, as 2003 CCA Best Actor award shared by Daniel Day Lewis for “Gangs of New York” and Jack Nicholson for “About Schmidt”. Last, the analysis was made by movies, and not by specific actors names, even if a movie may have more than one actor nominated for the same category, usually only one is the front running, so it’s extremely rare that one win in one ceremony, but a different one in another ceremony.


ASSOCIATION ANALYSIS As the available combinations of awards is extremelely large (there were found 2853 different itemsets), the first step would be to filter all the significative ones, trying to mine from the data only the frequent set of items. Defining the support of an itemset, as the ratio of the data set that contains that specific combination of items (in this case I will consider a specific number of threshold, instead of a ratio). We can start to identify all frequent itemsets as the one who comply with a minimum support of 6 cases. From this set of items with minimum support (found 881 cases), frequent itemsets, our goal is to identify all the rules x y that have a minimum confidence. Confidence can be defined as the conditional probability that an itemset having x, also has y= P(x;y)/P(x). As a minimum support is an antimonotone constrain, therefore any subset of a frequent itemset would comply as well with the contrain, or if an itemset doesn’t comply, neither will be a superset from it. (fig 1-2) (Fig 1-2)

Apriory Algorithm used: 1. Scan the database once to get the frequent 1-itemsets 2. Generate candidates to frequent (k+1)-itemsets from frequent k-itemsets a. By first, self joining from Lk candidates b. We could then prune, if subsets are not in Lk 3. Test the candidates against database (min support) and keep the frequent itemset, to then create the itemsets of length k+1 4. Terminate when no frequent or candidate itemsets can be generated, otherwise go back to step and generate further candidates. From our list of frequent itemsets, we can now focus on identifying the association rules beneath them, one set of items of a frequent itemset imply another item (consequence) . From the apriory principle, we don’t need to check its support again, as any subset of frequent itemset is a frequent itemset as well. Nevertheless even if both A and B, part of AB are frequent, doesn’t mean than both AB and BA are statistically significant. At this moment, we need to use the threeshold of minimum confidence. From every frequent itemset, we identify first all the available association rules and calculate for each their respective confidence, to then trim the list for significative ones only. The work for this paper was done on Python language, that allowed further flexibility than Weka (software used in class) to filter rules based on output and apply the threeshold. Next, an example of the association rules created from the frequent itemset of winners of Directors Guild Best Movie, OSCARS Best Movie and Director: freqSet (['DIR_Movie', 'OSCARS_Movie', 'OSCARS_Director']) H1 [ (['DIR_Movie']), (['OSCARS_Movie']), (['OSCARS_Director'])] (['OSCARS_Director', 'OSCARS_Movie']) ---> (['DIR_Movie']) conf: 0.875 (['DIR_Movie', 'OSCARS_Director']) ---> (['OSCARS_Movie']) conf: 0.875 (['DIR_Movie', 'OSCARS_Movie']) ---> (['OSCARS_Director']) conf: 0.875 (['OSCARS_Director']) ---> (['DIR_Movie', 'OSCARS_Movie']) conf: 0.7 (['OSCARS_Movie']) ---> (['DIR_Movie', 'OSCARS_Director']) conf: 0.7 (['DIR_Movie']) ---> (['OSCARS_Director', 'OSCARS_Movie']) conf: 0.7


From this example, we can see how the algorithm works, ranking all available combinations into significative association rules. Nevertheless not all are significative for this project, given the timing variable of dates of ceremonies, the goal is only to predict the OSCAR winners based on previous awards given. Therefore was decided to filter only the single Oscar consequence, in order to proper identify and rank into a prediction model. Using a minimum support of 6 and minimum confidence of 50%, from the over 11 thousand rules created, only aproximately 2700 identify a single awards. But we should only filter the ones that do identify OSCAR awards and are not using as premise an OSCAR awards as well, therefore the list is then trimmed to only 300 rules. MODEL From the available response variables, 21 Oscars Awards as outcome of our study, only 14 awards found a significative rule. From Best Animated film, who has a single rule (CCA Animated with 64% of confidence) to Best Director which founds 78 rules!. On fig. 5 are detailed the top 5 rules for every category. Once the rules are created, a model can be defined to predict the oscar winners based on the order of confidence for each rule. As many rules, consider a combination of results the first single secure option should be the one selected. For example for the category Best Supporting Actor, the first single solution would be the fifth rule, “Golden Globe Best Supporting Actor” “Oscar Best Supporting Actor”, with confidence of 65% :”Django Unchained”. It’s a clear example of an even race, with no front runner as all articles claimed in early 2013. Fig 3 Consequence FreqSet ['OSCARS_ActorSupporting'] ['GG_ActorSupporting','CCA_ActorSupporting'] ['SAG_ActorSupporting','GG_ActorSupporting'] ['SAG_ActorSupporting','GG_ActorSupporting','CCA_ActorSupporting'] ['SAG_ActorSupporting','CCA_ActorSupporting'] ['GG_ActorSupporting'] th

Conf. 88% 88% 86% 78% 65%

2013 AWARDS Prediction WINNER Django, Master Lincoln, Master Lincoln, Django, Master Lincoln, Master Django Django

If we test the model for the last 85 Oscar edition: from the 14 categories predicted, 12 were able to predicted as on Art Direction and Costume Design were not able to identify a single secure winner. 7 categories succesfully selected the winner, including the top 5 : Best Movie, Actor, Actress, Supporting Actor and Supporting Actress. A special case is the Director category, that needed to use up to the rule 74th, as most of the rules gave Ben Affleck (‘Argo’) as winner, when he was not even nominated. Fig. 4 Consequence ['OSCARS_ActorLeading'] ['OSCARS_ActorSupporting'] ['OSCARS_ActressLeading'] ['OSCARS_ActressSupporting'] ['OSCARS_AdaptedScreenplay'] ['OSCARS_Animated'] ['OSCARS_ArtDirection'] ['OSCARS_Cinematography'] ['OSCARS_CostumeDesign'] ['OSCARS_Director'] ['OSCARS_Editing'] ['OSCARS_Movie'] ['OSCARS_OriginalScore'] ['OSCARS_OriginalScreenplay']

FreqSet ['OSC_TopNominated','SAG_ActorLeading'] ['GG_ActorSupporting'] ['SAG_ActressLeading'] ['GG_ActressSupporting','SAG_ActressSupporting','CCA_ActressSupporting'] ['GG_MovieDRAMA','WRI_AdaptedScreenplay','GG_Director'] ['CCA_Animated'] ['PROD_Movie','GG_MovieDRAMA','CCA_Director'] ['OSC_TopNominated'] ['WRI_AdaptedScreenplay','GG_Director'] ['PROD_Movie','GG_MovieDRAMA','DIR_Movie','CCA_Director'] ['PROD_Movie','GG_MovieDRAMA','CCA_Movie','DIR_Movie','CCA_Director'] ['WRI_OriginalScreenplay']

Conf. 100% 65% 68% 100% 100% 64% 55% 67% 58% 86% 100% 75% 65%

2013 AWARDS Prediction Lincoln Django Silver Les Miz Argo Wreck It Ralph N/A Argo N/A Lincoln Argo Argo Argo Zero Dark Thirty

WINNER Lincoln WIN Django WIN Silver WIN Les Miz WIN Argo WIN Brave LOSS Lincoln N/A Life of PI LOSS Anna Kanerina N/A Life of PI LOSS Argo WIN Argo WIN Life of PI LOSS Django LOSS


Fig 5 FreqSet ['OSC_TopNominated','SAG_ActorLeading'] ['SAG_ActorLeading','GG_ActorLeadingDRAMA'] ['SAG_ActorLeading','CCA_ActorLeading','GG_ActorLeadingDRAMA'] ['CCA_ActorLeading','SAG_ActorLeading'] ['SAG_ActorLeading'] ['OSCARS_ActorSupporting'] ['GG_ActorSupporting','CCA_ActorSupporting'] ['SAG_ActorSupporting','GG_ActorSupporting'] ['SAG_ActorSupporting','GG_ActorSupporting','CCA_ActorSupporting'] ['SAG_ActorSupporting','CCA_ActorSupporting'] ['GG_ActorSupporting'] ['OSCARS_ActressLeading'] ['CCA_ActressLeading','GG_ActressLeadingDRAMA','SAG_ActressLeading'] ['SAG_ActressLeading','GG_ActressLeadingDRAMA'] ['CCA_ActressLeading','SAG_ActressLeading'] ['SAG_ActressLeading'] ['CCA_ActressLeading','GG_ActressLeadingDRAMA'] ['OSCARS_ActressSupporting'] ['GG_ActressSupporting','SAG_ActressSupporting','CCA_ActressSupporting'] ['GG_ActressSupporting','CCA_ActressSupporting'] ['SAG_ActressSupporting','CCA_ActressSupporting'] ['GG_ActressSupporting','SAG_ActressSupporting'] ['SAG_ActressSupporting'] ['OSCARS_AdaptedScreenplay'] ['GG_MovieDRAMA','WRI_AdaptedScreenplay','GG_Director'] ['WRI_AdaptedScreenplay','CCA_Director','DIR_Movie'] ['PROD_Movie','WRI_AdaptedScreenplay','DIR_Movie'] ['CCA_Movie','WRI_AdaptedScreenplay','CCA_Director'] ['WRI_AdaptedScreenplay','GG_Director','DIR_Movie'] ['OSCARS_Animated'] ['CCA_Animated'] ['OSCARS_ArtDirection'] ['OSC_TopNominated','GG_MovieDRAMA'] ['OSCARS_Cinematography'] ['OSC_TopNominated','GG_Director'] ['PROD_Movie','OSC_TopNominated','CCA_Movie','DIR_Movie','CCA_Director'] ['PROD_Movie','GG_MovieDRAMA','CCA_Director'] ['OSC_TopNominated','GG_MovieDRAMA'] ['PROD_Movie','GG_MovieDRAMA','OSC_TopNominated'] ['OSCARS_CostumeDesign'] ['PROD_Movie','GG_OriginalScore'] ['PROD_Movie','OSC_TopNominated','GG_OriginalScore'] ['OSC_TopNominated','GG_OriginalScore'] ['PROD_Movie','CCA_Director','GG_OriginalScore'] ['CCA_Director','GG_OriginalScore'] ['OSCARS_Director'] ['OSC_TopNominated','DIR_Movie','GG_Director'] ['PROD_Movie','GG_MovieDRAMA','CCA_Movie','DIR_Movie','GG_Director'] ['PROD_Movie','OSC_TopNominated','GG_Director'] ['GG_Screenplay','DIR_Movie'] ['GG_MovieDRAMA','DIR_Movie','OSC_TopNominated'] ['OSCARS_Editing'] ['GG_MovieDRAMA','CCA_Director','GG_OriginalScore'] ['WRI_AdaptedScreenplay','GG_Director'] ['PROD_Movie','CCA_Movie','CCA_Director','GG_Director'] ['CCA_Director','GG_OriginalScore'] ['GG_MovieDRAMA','DIR_Movie','OSC_TopNominated'] ['OSCARS_Movie'] ['OSC_TopNominated','SAG_Cast'] ['PROD_Movie','GG_MovieDRAMA','DIR_Movie','CCA_Director'] ['PROD_Movie','OSC_TopNominated','SAG_Cast'] ['CCA_Movie','CCA_Director','GG_Director','DIR_Movie'] ['OSC_TopNominated','DIR_Movie','SAG_Cast'] ['OSCARS_OriginalScore'] ['PROD_Movie','GG_MovieDRAMA','DIR_Movie','OSC_TopNominated'] ['PROD_Movie','GG_MovieDRAMA','CCA_Movie','DIR_Movie','CCA_Director'] ['GG_MovieDRAMA','DIR_Movie','OSC_TopNominated'] ['GG_MovieDRAMA','GG_OriginalScore'] ['GG_OriginalScore','CCA_Director'] ['OSCARS_OriginalScreenplay'] ['WRI_OriginalScreenplay','CCA_OriginalScreenplay'] ['WRI_OriginalScreenplay'] ['CCA_OriginalScreenplay']

Consequence ['OSCARS_ActorLeading']

Confidence 100% 90% 88% 83% 79% 88% 88% 86% 78% 65% 86% 86% 73% 68% 64% 100% 90% 89% 83% 63% 100% 100% 100% 100% 100% 64% 55% 70% 67% 67% 64% 60% 88% 86% 86% 86% 75% 100% 100% 100% 100% 100% 100% 86% 86% 75% 75% 100% 100% 100% 100% 100% 75% 75% 75% 75% 75% 100% 65% 56%


CONCLUSION In this paper I presented a model to assign a confidence to an Oscar Award, based on association analysis techniques. Out of the considered 21 awards, only 14 were able to be defined on a significative rule according to the threeshold set. If one reduce the minimum support or the confidence level you can increase this number, but it would then present worst results as the data considered is less reliable. Nevertheless the model presents some interesting findings: As expected a combination of previous results provide further confidence than a single awards, but is remarkable the lack of influence of the Golden Globes, which are a commonly assumed as the main indicators of trend before oscars. Only in Supporting Actor category, the Golden Globes are the top significant awards to predict an Oscar. FreqSet ['DIR_Movie'] ['DIR_Movie'] ['SAG_ActorLeading'] ['WRI_AdaptedScreenplay'] ['SAG_ActressLeading'] ['GG_ActorSupporting'] ['WRI_OriginalScreenplay'] ['CCA_Animated'] ['SAG_ActressSupporting']

Consequence Confidence ['OSCARS_Movie'] 80% ['OSCARS_Director'] 80% ['OSCARS_ActorLeading'] 79% ['OSCARS_AdaptedScreenplay'] 75% ['OSCARS_ActressLeading'] 68% ['OSCARS_ActorSupporting'] 65% ['OSCARS_OriginalScreenplay'] 65% ['OSCARS_Animated'] 64% ['OSCARS_ActressSupporting'] 63%

As best movie is always the last award given, we can then do the exercise of predicting that single award considering all historical data. 267 significant rules were found, with even 63 of 100% confidence!, therefore we can state that at that moment of the night it would be really rare to have a surprise. From this list, the most frequent explanatory variables are Oscar to Best Edition and Directors Guild to Best Director with over 30/63 cases. Last, what better way to proof the trust in the model than to put some money on it! If we had this model last february, and just following the available rules stated, we could have received an aproximately 25% profit. Not a bad ratio for only a single day investment! REFERENCES http://www.imdb.com/oscars/ http://aimotion.blogspot.se/2013/01/machine-learning-and-data-mining.html “Introduction to Information Retrieval� by CD Manning, P Raghavan, H Schutze; Cambridge Press; 2008 http://www.oddsshark.com/entertainment/academy-awards-2013-oscar-odds


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.