Julie Wulff: an Investigation of the Expression and Rating of Sentiment

Page 1

Motivation – Cultural Stereotypes Data Ratings Positivity of Terms Positivity of Reviews Conclusion

What is the Meaning of 5 *’s? An Investigation of the Expression and Rating of Sentiment Daniel Hardt

Julie Wulff

Copenhagen Business School

May 2013

Daniel Hardt, Julie Wulff

What is the Meaning of 5 *’s?


Motivation – Cultural Stereotypes Data Ratings Positivity of Terms Positivity of Reviews Conclusion

Outline 1

Motivation – Cultural Stereotypes

2

Data

3

Ratings

4

Positivity of Terms

5

Positivity of Reviews

6

Conclusion

Daniel Hardt, Julie Wulff

What is the Meaning of 5 *’s?


Motivation – Cultural Stereotypes Data Ratings Positivity of Terms Positivity of Reviews Conclusion

Expressing Sentiment – Scandinavian Style The talk was OK

Daniel Hardt, Julie Wulff

excellent best of the semester impressive ...

What is the Meaning of 5 *’s?


Motivation – Cultural Stereotypes Data Ratings Positivity of Terms Positivity of Reviews Conclusion

Expressing Sentiment – American Style your ideas are wonderful

not bad possibly worth pursuing could be interesting ...

Daniel Hardt, Julie Wulff

What is the Meaning of 5 *’s?


Motivation – Cultural Stereotypes Data Ratings Positivity of Terms Positivity of Reviews Conclusion

Investigation

Stereotype For a given sentiment, a Scandinavian gives a less positive expression than an American.

Is it possible to find evidence for (or against) this stereotype? This question is investigated by analyzing rated Danish and U.S. film and restaurant reviews.

Daniel Hardt, Julie Wulff

What is the Meaning of 5 *’s?


Motivation – Cultural Stereotypes Data Ratings Positivity of Terms Positivity of Reviews Conclusion

Investigation

Distributional differences in the data? 1

Ratings: Are there relatively fewer high ratings?

2

Text: Are there relatively fewer highly positive terms?

3

Ratings vs. Text: Are there fewer high ratings for texts of a given positivity?

Daniel Hardt, Julie Wulff

What is the Meaning of 5 *’s?


Motivation – Cultural Stereotypes Data Ratings Positivity of Terms Positivity of Reviews Conclusion

Investigation

Distributional differences in the data? 1

Ratings: Are there relatively fewer high ratings?

2

Text: Are there relatively fewer highly positive terms?

3

Ratings vs. Text: Are there fewer high ratings for texts of a given positivity?

Daniel Hardt, Julie Wulff

What is the Meaning of 5 *’s?


Motivation – Cultural Stereotypes Data Ratings Positivity of Terms Positivity of Reviews Conclusion

Investigation

Distributional differences in the data? 1

Ratings: Are there relatively fewer high ratings?

2

Text: Are there relatively fewer highly positive terms?

3

Ratings vs. Text: Are there fewer high ratings for texts of a given positivity?

Daniel Hardt, Julie Wulff

What is the Meaning of 5 *’s?


Motivation – Cultural Stereotypes Data Ratings Positivity of Terms Positivity of Reviews Conclusion

Danish Film Reviews

Reviews collected from a Danish movie website www.scope.dk Film Reviews rated 1-6 *’s Downloaded November 2011 829 Films 18,681 Reviews 1,624,049 words

Daniel Hardt, Julie Wulff

What is the Meaning of 5 *’s?


Motivation – Cultural Stereotypes Data Ratings Positivity of Terms Positivity of Reviews Conclusion

U.S. Film Reviews Reviews collected from The Internet Movie Database www.imdb.com Film Reviews rated 1-10 *’s Downloaded January 2012 678 Films 143,620 Reviews 34,599,486 words Only selected reviews for films that were also reviewed in the Scope data

Daniel Hardt, Julie Wulff

What is the Meaning of 5 *’s?


Motivation – Cultural Stereotypes Data Ratings Positivity of Terms Positivity of Reviews Conclusion

Danish Restaurant Reviews

Reviews collected from the review site Yelp www.yelp.dk Restaurant Reviews rated 1-5 *’s Downloaded January 2013 Reviews for restaurants in Copenhagen were selected 3,851 Reviews 581,713 words

Daniel Hardt, Julie Wulff

What is the Meaning of 5 *’s?


Motivation – Cultural Stereotypes Data Ratings Positivity of Terms Positivity of Reviews Conclusion

U.S. Restaurant Reviews

Reviews collected from the review site Yelp www.yelp.com Restaurant Reviews rated 1-5 *’s Downloaded August 2012 Reviews for restaurants in Philadelphia were selected 109,129 Reviews 384,689,609 words

Daniel Hardt, Julie Wulff

What is the Meaning of 5 *’s?


Motivation – Cultural Stereotypes Data Ratings Positivity of Terms Positivity of Reviews Conclusion

Distribution of ratings: Reviews per category U.S. Film data

Danish Film data

Category 10 (out of 10) largest

Category 4 (out of 6) largest

U.S. Film data

4000

35000

3500

30000 25000 20000 15000

3000 2500 2000 1500 1000

10000

500

5000 0 1

Danish Film data

4500

40000

Number of reviews

Number of reviews

45000

2

3

4

5 6 Category

7

8

9

10

Daniel Hardt, Julie Wulff

0

1

2

3

Category

What is the Meaning of 5 *’s?

4

5

6


Motivation – Cultural Stereotypes Data Ratings Positivity of Terms Positivity of Reviews Conclusion

Distribution of ratings: Reviews per category U.S. Restaurant data

Danish Restaurant data

Category 4 and 5 (out of 5) largest

Category 3 and 4 (out of 5) largest

U.S. Yelp data

45000

1200

35000

1000

30000

Number of reviews

Number of reviews

Danish Yelp data

1400

40000

25000 20000 15000

800 600 400

10000 200

5000 0

1

2

3 Category

4

5

Daniel Hardt, Julie Wulff

0

1

2

3 Category

What is the Meaning of 5 *’s?

4

5


Motivation – Cultural Stereotypes Data Ratings Positivity of Terms Positivity of Reviews Conclusion

Terms and categories Terms: Short sequences of 1-3 words (unigram, bigram, trigram) Collected for all four datasets with some domain specific restrictions to avoid bias from the unbalanced rating distribution A term has to occur in data for more than one film or restaurant

Categories: The three different rating scales are normalized to one The rating values are assumed to be continuous and rescaled to run from -0,5 to 0.5

Daniel Hardt, Julie Wulff

What is the Meaning of 5 *’s?


Motivation – Cultural Stereotypes Data Ratings Positivity of Terms Positivity of Reviews Conclusion

Measuring Term “Positivity”

Very positive term appears very frequently in the highest categories, very rarely in the lowest categories The frequency of a term per category is measured and used to find a value reflecting the positivity of a term Based on the frequency data, expected category calculations are used as positivity measurements

Daniel Hardt, Julie Wulff

What is the Meaning of 5 *’s?


Motivation – Cultural Stereotypes Data Ratings Positivity of Terms Positivity of Reviews Conclusion

EC

The expected category is a weighted average of the normalized frequencies and provides the best guess of a category for a given term The measurement deliver a sentiment scale with continuous values for the data

Daniel Hardt, Julie Wulff

What is the Meaning of 5 *’s?


Motivation – Cultural Stereotypes Data Ratings Positivity of Terms Positivity of Reviews Conclusion

U.S. data: most negative terms Negativity -0.469524750825392 -0.467175327464537 -0.465672946715069 -0.46519234965676 -0.458765600043119 -0.458337140445303 -0.458164409069076 -0.457194555033583 -0.456786654079481 -0.456686440739873 -0.456664807460289 -0.454396895268457 -0.453161483368971 -0.453161483368971 -0.450235331734901 -0.450235331734901 -0.450235331734901 -0.450183973779262 -0.449657123831635 -0.449531688381114 -0.449165025128125 -0.448774862116633 -0.448056977475604 -0.446696672003026

Term awful movie this the worst piece 10 worst absolutely no redeeming 1 of 10 horrible piece of horrible piece describe how bad worst piece of awful movie ! worst piece crap !!! <s> awful ! avoid ! </s> this garbage , piece of dreck money back after ever walked out this worthless this laughable <s> !!! </s> money back ! 0 stars ... the worst

Daniel Hardt, Julie Wulff What is the Meaning of 5 *’s? Table: 25 most negative terms IMDb


Motivation – Cultural Stereotypes Data Ratings Positivity of Terms Positivity of Reviews Conclusion

U.S. data: most positive terms Positivity 0.494353290061915 0.493275677450123 0.48397433114752 0.475429294357278 0.474277213937623 0.472867926252183 0.47206559901472 0.469294782093069 0.46791864438272 0.467887873542822 0.466895402323092 0.466813826094778 0.464727383846514 0.464040786449856 0.463489633194074 0.463392598375243 0.463392598375243 0.463392598375243 0.462852295765803 0.462852295765803 0.462671036583064 0.462135374863534 0.46076665585606 0.46061432793814

Term ! 10 ! 10 / . 10 out masterpiece !!! <s> perfection a ++ is my absolute superb script than his father 2nd favorite 11 out of 11 out has changed my ... 10 / changed my then i strongly 5 best movies love , death this incredible movie than a 10 yes yes yes loved everything about moves me brings tears to , great writing

Daniel Hardt, Julie Wulff What is the Meaning of 5 *’s? Table: 25 most positive terms IMDb


Motivation – Cultural Stereotypes Data Ratings Positivity of Terms Positivity of Reviews Conclusion

Danish data: most negative terms Negativity -0.476804770742346 -0.476650569326183 -0.475533663540781 -0.474578452519502 -0.467763812112007 -0.465244027806669 -0.460441034641355 -0.460441034641355 -0.460093822212954 -0.459693407317838 -0.459163290061738 -0.456941914193066 -0.456558184652091 -0.456433744513287 -0.453465199574301 -0.452874260625158 -0.450820743699781 -0.450728805815475 -0.449352829956901 -0.449319563149709 -0.449086621357668 -0.449086621357668 -0.446722141981055 -0.445016601147398

Term dårligste film jeg (worst movie I) , plat (, lame) lorte film (shitty movie) dårlig en (bad one) ret elendig (pretty poor) ringeste (the worst) ikke er værd (not worth) ringe , at (poor , to) dine penge (your money) at spilde (to waste) gang lort (some crap) makværk . </s> (mess) makværk . (mess) noget bras (some junk) ligegyldig film (meaningless film ) , dårligt (bad) dragen (dragon) makværk de dårligste film (worst film) talentløs (talentless) dum film (stupid movie) intet fungerer (nothing works) en <s> spild (a waste) min tid (my time)

Daniel Hardt, Julie Wulff What is the Meaning of 5 *’s? Table: 25 most negative terms Scope


Motivation – Cultural Stereotypes Data Ratings Positivity of Terms Positivity of Reviews Conclusion

Danish data: most positive terms Positivity 0.484165554298609 0.478172084131965 0.476915060831457 0.473845784730026 0.473280978584772 0.465729836191364 0.463060344605433 0.461324301788628 0.460915590776126 0.460873246723046 0.460873246723046 0.46039438514752 0.460061678804873 0.458909257039041 0.458909257039041 0.456749020482633 0.45590688994634 0.454846351795861 0.454846351795861 0.452168357378795 0.451588098388318 0.45077477870325 0.45077477870325 0.450310678130041

Term elsk (love) verdens bedste film (the worlds best movie) ret den bedste! (the best!) go og (go and) ret mesterværk (masterpiece) mesterværk anonym (masterpiece anonymous) ret kanon (pretty great) bedste film (best film) bedste tegnefilm (best cartoon) får 6 (get 6) elsker bare (just love) ses ! </s> (watch) ses ! (watch) kan se igen (can see again) go ! </s> (good) største film (biggest film) så smuk (so beautiful) skal se ! (must see) får 6 stjerner (get 6 stars) bedste film jeg (best film I) utrolig rørende (unbelievably touching) jeg elsker bare (I simply love) den er super (it is super) eneste film (only film)

Daniel Hardt, Julie Wulff What is the Meaning of 5 *’s? Table: 25 most positive terms Scope


Motivation – Cultural Stereotypes Data Ratings Positivity of Terms Positivity of Reviews Conclusion

Positivity of text

The positivity of a text is calculated as the average of the positivity values for each term in the text. A distribution of positivity over categories can be achieved from these calculations.

Daniel Hardt, Julie Wulff

What is the Meaning of 5 *’s?


Motivation – Cultural Stereotypes Data Ratings Positivity of Terms Positivity of Reviews Conclusion

Distribution of positivity over categories Movie data

Positivity

Positivity of movie reviews

0

U.S. Data Danish data -0.5

-0.25

0 Category

0.25

0.5

Daniel Hardt, Julie Wulff

What is the Meaning of 5 *’s?


Motivation – Cultural Stereotypes Data Ratings Positivity of Terms Positivity of Reviews Conclusion

Distribution of positivity over categories Movie data

Restuarant data Positivity of restaurant reviews

Positivity

Positivity

Positivity of movie reviews

0

0

U.S. Data Danish data -0.5

-0.25

0 Category

0.25

U.S. Data Danish data 0.5

Daniel Hardt, Julie Wulff

-0.5

-0.25

0 Category

What is the Meaning of 5 *’s?

0.25

0.5


Motivation – Cultural Stereotypes Data Ratings Positivity of Terms Positivity of Reviews Conclusion

Conclusion Distributional differences - evidence for stereotypes?

Daniel Hardt, Julie Wulff

What is the Meaning of 5 *’s?


Motivation – Cultural Stereotypes Data Ratings Positivity of Terms Positivity of Reviews Conclusion

Conclusion Distributional differences - evidence for stereotypes? Ratings: U.S. reviewers are more likely to give a high rating compared to Danish reviewers. Highly positive reviews are over-represented in the U.S. Data compared to the Danish data.

Daniel Hardt, Julie Wulff

What is the Meaning of 5 *’s?


Motivation – Cultural Stereotypes Data Ratings Positivity of Terms Positivity of Reviews Conclusion

Conclusion Distributional differences - evidence for stereotypes? Ratings: U.S. reviewers are more likely to give a high rating compared to Danish reviewers. Highly positive reviews are over-represented in the U.S. Data compared to the Danish data.

Text: Danish reviewers uses relatively many positive and negative terms when justifying why a film has earned the highest or lowest possible rating compared to U.S. reviewers. The relation between the distribution of positivity and the distribution of reviews differ in the two datasets. Daniel Hardt, Julie Wulff

What is the Meaning of 5 *’s?


Motivation – Cultural Stereotypes Data Ratings Positivity of Terms Positivity of Reviews Conclusion

Conclusion

Ratings vs. Text: The mismatch between high ratings and text across the two datasets indicates that Danes and Americans use rating systems differently. The relation of text positivity to rating differ in the two datasets. The slope of positivity for highly positive and negative reviews is higher in the Danish data compared to the U.S. data.

Daniel Hardt, Julie Wulff

What is the Meaning of 5 *’s?


Motivation – Cultural Stereotypes Data Ratings Positivity of Terms Positivity of Reviews Conclusion

Considerations

Systematic differences in rating systems due to cultural differences? The old Danish grading system.

Danes are less enthusiastic about the films they see? The U.S. film industry influence.

Daniel Hardt, Julie Wulff

What is the Meaning of 5 *’s?


Motivation – Cultural Stereotypes Data Ratings Positivity of Terms Positivity of Reviews Conclusion

Future work

Logistic regression Analyzing user generated reviews from other domains and countries. Automatic sentiment analyzing to reduce the systematic mismatch between text and rating. ?

Daniel Hardt, Julie Wulff

What is the Meaning of 5 *’s?


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.