Magazine recommendations based on social media trends
Steffen Karlsson
Kongens Lyngby 2014 B.Eng-2014
Technical University of Denmark Department of Applied Mathematics and Computer Science Matematiktorvet, building 303B, 2800 Kongens Lyngby, Denmark Phone +45 4525 3351 compute@compute.dtu.dk www.compute.dtu.dk B.Eng-2014
Summary (English)
Issuu uses a recommendation engine, for predicting what a certain reader will enjoy. It is based on collaborative filtering, such as reading history of other similar users and content-based filtering reflected as the document’s topics etc. So far all of those parameters, are completely isolated from any external (non-Issuu) sources causing the Matthew Effect. This project, done in collaboration with Issuu, is the first attempt to solve the problem, by investigating how to extract trends from social media and incorporate them to improve Issuu’s magazine recommendations. Popular social media networks have been investigated and evaluated resulting in choosing Twitter as the data source. A framework for spotting trends in the data has been implemented. To map trends to Issuu two approaches have been used - Latent Dirichlet Allocation model and Apache Solr search engine.
ii
Summary (Danish)
Issuu benytter sig af et anbefalingssystem til at forudsige, hvad der vil glæde en given læser. Det er baseret på collaborative filtering såsom læse historik fra lignende brugere. Derudover er det baseret på indholdsbaseret filtrering, der afspejles som dokumentets tema mv. Hidtil er alle disse parametre fuldstændig isoleret fra eksterne (ikke Issuu) kilder. Dette projekt er udført i samarbejde med Issuu og er det første forsøg på at løse problemet. Dette er gjort ved at undersøge, hvorledes man kan udtrække tendenser fra sociale medier og integere dem, for at forbedre Issuu’s magasin anbefalinger. Populære sociale medier er blevet undersøgt og evalueret, hvilket resulterer i at Twitter blev valgt som datakilde. Et system til at spotte trendenser på i dataen er blevet implementeret. Der er benyttet to forskellige metoder til at integere tendenserne på Issuu - Latent Dirichlet Allocation modellen og Apache Solr søgemaskine.
iv
Preface
This thesis was prepared at the department of Applied Mathematics and Computer Science at the Technical University of Denmark (DTU) in fulfillment of the requirements for acquiring an B.Eng. in IT. The work was carried out in the period September 2013 to January 2014. I would like to thank my supervisor Ole Winther from DTU, my external supervisor Andrius Butkus and Issuu for spending time and resources on having me around.
Lyngby, 10-January-2014
Steffen Karlsson
vi
Contents
Summary (English)
i
Summary (Danish)
iii
Preface 1 Introduction 1.1 Problem definition 1.2 Social media . . . . 1.3 What is a trend? . 1.4 Related work . . . 1.5 Methodology . . . 1.6 Expected results . 1.7 Outline . . . . . .
v
. . . . . . .
1 2 2 5 5 7 8 8
2 Mining Twitter 2.1 Twitter API . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Tweet’s location problem . . . . . . . . . . . . . . . . . . .
9 10 11
3 Trending framework 3.1 Raw data . . . . . 3.2 Normalizing data . 3.3 Detecting trends . 3.4 Recurring trends . 3.5 Trend score . . . .
15 16 18 19 20 21
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . . . .
. . . . .
. . . . .
viii
CONTENTS 3.6
Aggregating trends . . . . . . . . . . . . . . . . . . . . . .
4 From trends to magazines 4.1 LDA . . . . . . . . . . . . 4.2 Using LDA . . . . . . . . 4.2.1 Results using LDA 4.3 Solr . . . . . . . . . . . . 4.4 Using Solr . . . . . . . . . 4.4.1 Results using Solr
22
. . . . . .
25 25 28 30 31 32 33
5 Conclusion 5.1 Improvements of the trending framework . . . . . . . . . . 5.2 Improvements of the LDA model . . . . . . . . . . . . . . 5.3 LDA vs. Solr . . . . . . . . . . . . . . . . . . . . . . . . .
35 36 38 39
A Dataset statistics A.1 Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Hashtag . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41 41 42
B Example: #bostonstrong
43
C Implementation details C.1 Flask . . . . . . . . . C.2 Peewee . . . . . . . . C.3 Database . . . . . . C.4 MySQL . . . . . . . Bibliography
. . . .
. . . .
. . . .
. . . . . .
. . . .
. . . . . .
. . . .
. . . . . .
. . . .
. . . . . .
. . . .
. . . . . .
. . . .
. . . . . .
. . . .
. . . . . .
. . . .
. . . . . .
. . . .
. . . . . .
. . . .
. . . . . .
. . . .
. . . . . .
. . . .
. . . . . .
. . . .
. . . . . .
. . . .
. . . . . .
. . . .
. . . . . .
. . . .
. . . . . .
. . . .
. . . . . .
. . . .
. . . .
47 47 48 49 49 51
List of Figures
1.1
Typical patterns for slow and fast trends. . . . . . . . . .
6
1.2
Project flowchart . . . . . . . . . . . . . . . . . . . . . . .
7
2.1
Mining Twitter flowchart
. . . . . . . . . . . . . . . . . .
9
2.2
Visualization of the problem with the location . . . . . . .
12
2.3
Visualization of the solution to the location problem . . .
12
3.1
Total tweets per hour . . . . . . . . . . . . . . . . . . . . .
16
3.2
Raw tweet count for hashtags . . . . . . . . . . . . . . . .
17
3.3
Weighted tweet count per hour . . . . . . . . . . . . . . .
18
3.4
Normalized hashtags . . . . . . . . . . . . . . . . . . . . .
18
3.5
Example sizes of w and r
. . . . . . . . . . . . . . . . . .
19
3.6
w - r, where r = 2 hours. . . . . . . . . . . . . . . . . . .
20
x
LIST OF FIGURES 3.7
w - r, where r = 24 hours. . . . . . . . . . . . . . . . . . .
21
3.8
Displyaing use of threshold in the trending framwork. . . .
22
3.9
E/R Diagram v2 . . . . . . . . . . . . . . . . . . . . . . .
23
4.1
Plate notation of the LDA model [Ble09] . . . . . . . . . .
26
4.2
LDA topic simplex, with three topics . . . . . . . . . . . .
27
4.3
Representation of topic distribution using dummy data . .
27
4.4
#apple tag cloud . . . . . . . . . . . . . . . . . . . . . . .
28
4.5
From trend to magazines flowchart . . . . . . . . . . . . .
29
4.6
Topic distribution for #apple tweets . . . . . . . . . . . .
29
4.7
Subset of the similar #apple documents using LDA . . . .
30
4.8
Example of tokenizing and stemming
. . . . . . . . . . .
31
4.9
Subset of the similar #apple documents using Solr . . . .
33
5.1
Supported languages by Issuu . . . . . . . . . . . . . . . .
36
5.2
Translation module to improve the solution. . . . . . . . .
37
5.3
Three simultaneously running trending frameworks. . . . .
37
5.4
Top words in the topics. . . . . . . . . . . . . . . . . . . .
38
5.5
LDA per page solution.
. . . . . . . . . . . . . . . . . . .
39
A.1 Top and bottom 10 of used locations . . . . . . . . . . . .
41
LIST OF FIGURES
xi
A.2 Top and bottom 10 of used hashtags . . . . . . . . . . . .
42
B.1 Total tweets per hour . . . . . . . . . . . . . . . . . . . . .
43
B.2 Raw tweet count for hashtags . . . . . . . . . . . . . . . .
44
B.3 Fully processed data . . . . . . . . . . . . . . . . . . . . .
44
B.4 #bostonstrong tag cloud . . . . . . . . . . . . . . . . . .
45
B.5 Subset of #bostonstrong LDA documents . . . . . . . . .
46
B.6 Subset of #bostonstrong Solr documents . . . . . . . . .
46
C.1 E/R Diagram . . . . . . . . . . . . . . . . . . . . . . . . .
49
xii
LIST OF FIGURES
Chapter
1 Introduction
Issuu1 is a leading online publishing platform with more than 15 million publications - a pool that keeps growing by more than 20 thousand new ones each day. The main challenge for the reader then becomes the navigation and discovery of interesting content, among the vast number of documents. To solve the problem Issuu uses a recommendation engine for predicting what a certain reader might enjoy. Currently a whole range of parameters are a part of Issuu’s recommendation algorithm: reader’s location and language preferences (context), reading history of other similar users (collaborative filtering [RIS+ 94], [SM95]), document’s topics (content-based filtering [Sal89]) and document’s overall popularity. Also there are editorial and promoted documents. So far all of those parameters, are completely isolated from any external (non-Issuu) sources.
1
www.issuu.com
2
Introduction
The main problem is that the same magazines constantly get recommended again and again. It highlights the shortcomings with collaborative filtering, rather than reading habits of Issuu users. Issuu does not allow readers to rate magazines, so the read time is used instead. Naturally popular magazines gather their read-times very quick and then are hard to beat by the newly uploaded ones. They get recommended more and by that they only become stronger - a phenomena known as the Mathew Effect [Jac88]. Incorporating local trends (what is happening around the reader) into the recommendations, would address this problem and add a bit more freshness and serendipity.
1.1
Problem definition
How to extract trends from social media and incorporate them, to improve Issuu’s magazine recommendations.
1.2
Social media
In this project, social media is the data source from which trends can be extracted. There are many social media platforms that could be used as the data source for this project. Their suitability was evaluated based on these parameters:
Data - Defines the format of the data, the amount of the data that is available and how semantically rich is it. This is the most important parameter, since it’s all about the quality of the data and will directly impact the ability to extract trends. Text is a preferred data format here. The more data, the better - since it will add stability to the resulting trends. Semantic richness is about, how much meaning can be extracted from the data.
1.2 Social media
3
We should not expect any highly organized and semantically rich taxonomies, since the Twitter is a crowd driven social media instead of editorially curated and organized. In social networks we normally see data being organized as folksonomies2 , where "multiple users tag particular content with a variety of terms from a variety of vocabularies, thus creating a greater amount of metadata for that content [Wal05]. Semantic richness in folksonomies comes from multiple users tagging the data with the same labels, which shows that they agree on what it is about. It can be narrow or broad. In narrow ones only the creator of the content is allowed to label it with tags, while in broad ones multiple users can label a piece of content. Broad folksonomies are more stable and informative given that there are enough users to label things, and is the preferred one in this project. Real-time - Defines the time from something important happening in the world, until it appears on the particular social media network. An API supporting real-time streaming of data is naturally preferable, but a small delay is also acceptable. Accessibility - Defines if there are any restrictions throughout the API, that limits the accessibility of data.
The most popular social networks were evaluated based on these three parameters. The one that fitted the best appeared to be Twitter3 (see Table 1.1). Facebook4 and Google+5 scored well on the data part and real-time but had to be ruled out due to the limited API access and strict privacy settings. "The largest study ever conducted on Facebook on privacy, showed that in June 2011 around 53% of the profiles where private, which where an increase of 17% over 15 months." [Sag12].
2
A term coined by Thomas Vander Wal, combining words folk and taxonomy. www.twitter.com 4 www.facebook.com 5 plus.google.com 3
4
Introduction
Data
Real-time Accessibility
Positive Average of 58 million tweets each day6 . Over 85% of topics are headline or persistent news in nature [KLPM10]. Pseudo real-time location based streaming service. API easy accessible and usable.
Negative Length of the tweet. Lack of reliability according to the location precision of the tweets.
2 hours behind. Unpaid plan limited by a 1% representative subset of data.
Table 1.1: Evaluation of Twitter’s suitability for the project.
Linked-in7 was discarded because of the nature of the data - industry and career oriented. Instagram8 , Pinterest9 and Flickr10 are all big and interesting, but the data they provide is mostly images and thus hard to interpret, also their data is not that close to trending news. Same goes for YouTube11 and Vine12 .
It is worth to mention that trends can be spotted on Issuu as well. One of the problems is that they have a huge delay since Issuu users are not as active as other social networks. Also on Issuu trends would have to be inferred from what people read instead of what they are posting or commenting.
7
www.linkedin.com www.instagram.com 9 www.pinterest.com 10 www.flickr.com 11 www.youtube.com 12 vine.twitter.com 8
1.3 What is a trend?
1.3
5
What is a trend?
Trend can be understood in many different ways depending on the context - stock market, fashion, music, news, etc. The dictionary defines a trend as: "a general direction in which something is developing or changing" - Definition in the dictionary.
In this project trends will be considered a bit differently. Basically Issuu is interested in knowing what topic or event is currently hot in which country (or other even smaller area) and recommend magazines similar to it. On Twitter trends can be spotted by looking at the hashtags, so in this project trending hashtags and trends, will be considered the same thing. Trends are taken as a “hashtag-driven topic that is immediately popular at a particular time"13 . Trends vary in terms of how unexpected they are. Seasonal holidays like Christmas or Halloween are trends, but very expected ones. On the other hand, Schumacher’s skiing accident is a very unexpected one, both types are equally interesting and valuable for Issuu. Another parameter is the speed of how quickly, the trend is raising. We can have slow or fast trends (see Figure 1.1), the priority is spotting trends that raise fast. Trend and popularity is not the same thing. If something becomes popular all of a sudden - it is a trend. But if it keeps being popular, it is not a trend anymore.
1.4
Related work
Extracting trends from Twitter is nothing new. The two widely used approaches are parametric or non-parametric. The most popular one is the parametric approach, where a trending hashtag is being detected, by observing it’s deviation based on some baseline [IHS06], [BNG11], 13
www.hashtags.org/platforms/twitter/what-do-twitter-trends-mean/
1
6
Introduction
0 0 2
Fast trend Slow trend
4 7 5 8
15 12
Count
10
16 18 20 16 22 18 16
0
14
Figure 1.1: Typical patterns for slow and fast trends.
24
17 14 3 0
1 2 1 1 1 1
[CDCS10], using a sliding window. It’s the simplest of approaches and still quite successful, based on the assumption that different trends will behave similarly to one another. It’s known that this is not the case, in the real world - there are many types of trends, with all kinds of patterns. Count
0
To address that problem, other non-parametric methods have been used as well [Nik12]. In those ones the parameters were not set in advance, but were learned from the data instead. Many patterns were observed and grouped into the ones that became trends and the ones that didn’t. New hashtag patterns can then be compared time period ≈ 24 hours to the observed ones using euclidean distance, the similarity can then be used to determinate if it is trending or not.
The requirements for spotting trends at Issuu, are not that strict - there’s no need to capture all the trends from a certain day, but instead just the most significant ones. It makes things simpler and that’s why, the parametric model was chosen for this project. It is the first time, that Issuu is doing a project like this, so the idea was to try the simpler things first, to see if they work. If not the more heavy non-parametric models, could be applied.
1.5 Methodology
1.5
7
Methodology
Figure 1.2 is illustrating the methodology of the project.
!
!
⚙
Data
Trending
Framework
"
I !
Trends
Issuu
Documents
Figure 1.2: Project flowchart
It’s important to note early how this trend data will be used by Issuu because it sets requirements on other parts of the projects. Issuu is using Latent Dirichlet Allocation (LDA) [BNJ03], to extract topics from it’s documents, using the Gensim implementation [ŘS10]. Using the JensenShannon distance (JSD) algorithm it is possible to compare documents to one other, using the LDA topic distribution. This allows Issuu to find similar documents, to the one that is being read, for example. If we can capture trends from social media and express them as text ("virtual document"), we could calculate LDA for the trend (one text file per trend) and use JSD to find similar documents. Issuu is using Apache Solr14 search engine, which takes text as input, and can give similar documents as output. This is another approach and will be investigated, whether this may be used as an alternative/complement to LDA. With all that in mind, the plan is this:
• Access Twitter API15 and retrieve tweets, from a given country on a given time and storing them in a database. 14 15
lucene.apache.org/solr/ dev.twitter.com
8
Introduction • Calculate trends from the tweets, the output of this step are the list of trending hashtags, per given time window. • Find out how to feed those trends, into both LDA topic model and Solr search engine • Get documents as the final result and evaluate.
1.6
Expected results
• Analysis of potential resources for mining data from social media networks, to be used at Issuu as basis for recommendations. • Data mining algorithms (Python16 ) to retrieve all the necessary data. • An algorithm for extracting trends from tweets. • A method of feeding trends into the LDA model and Solr. • Evaluation of the results and final recommendations on the endto-end solution, for incorporating social media data into Issuu’s recommendation engine.
1.7
Outline
Chapter 2 is explaining how to retrieve tweets from the Twitter API service, which are being processed and analyzed in Chapter 3. The trends are fed into the LDA model and Solr search engine, resulting in similar documents in Chapter 4. The final recommendations on the end-to-end solution are being evaluated in Chapter 5.
16
www.python.org
Chapter
2 Mining Twitter
This chapter is about retrieving tweets from Twitter and storing them for trend extraction later. USA was chosen as the country for this project because of several reasons. First of all, most Issuu readers are from the USA. Secondly, more than half of Twitter users are from the USA too [Bee12]. Finally, having tweets in english makes it simpler, because Issuu’s LDA model was trained on english Wikipedia and sticking to the english tweets means that no translation will be needed.
$ All tweets
in the world
Location
filter [USA]
$
!
Tweets
in USA
Database
Figure 2.1: Mining Twitter flowchart
! Trend related
Tweets
I Issuu’s
LDA
# # Topic
Distribution
␡
␡
␡ ␡
␡ ␡
! Similar
documents
10
Mining Twitter
The data in Twitter are 140 character long messages called tweets. Often they contain some additional meta-data: Symbol
Description
Example
Grouping tweets together by type or topic, known as a hashtag.
Wow, Mac OS X Mavericks is free and will be available for machines going back as far as 2007? #Apple #Keynote
@
Used to referencing, mentioning or replying another user.
@alastormspotter: iOS7 will release at around noon central time on Wednesday.
RT
Symbolizing a retweet (posting an existing tweet from another user).
RT @ThomasCDec: 50 days to #ElectionDay
#
Table 2.1: Additional meta-data used in tweets.
2.1
Twitter API
The Twitter API is providing two different calls which may be suitable for this purpose: GET search/tweets : Is part of the ordinary API, i.e. with the rate limit of 450 requests per 15 minutes and continuations url, which means that there are a finite number of tweets per request before requesting next chunk. POST statuses/filter : Is part of the streaming API, as mentioned in Table 1.1. Which for the unpaid plan, has the limitation of only providing a 1% representative subset of the full dataset.
2.2 Tweet’s location problem
11
Twitter is using a three step heuristic, to determine whether a given Tweet falls within the specified location defined as a bounding box1 :
1. If the tweet is geo-location tagged, this location will be used for comparison with the bounding box. 2. A user on Twitter can in the account settings specify location, which in the API calls refers as place, and this will be used for comparison if the tweets is not geo tagged. 3. If neither of the rules listed above match, the tweet will be ignored by the streaming API. The streaming API was chosen, because it takes all the three heuristics into account, whereas the search API only includes the second. Additionally it is difficult to know how frequently to execute the API call, in order to be up-to-date, due to the limitations.
2.2
Tweet’s location problem
A couple of problems where spotted with the location accuracy:
1. Twitters API supports streaming by location, but only with coordinates in sets of SW and NE, defining each country by a square. Figure 2.2 shows the tweets streamed from USA (tweets that are actually from USA have been filtered, to provide a better overview). 2. Although the selected bounding box is covering USA and even more, tweets from Guatemala and Honduras is still present (see Figure 2.2).
1 Two pairs of longitude and latitude coordinates; south-west (SW) and north-east (NE) corner of a rectangle
12
Mining Twitter
Figure 2.2: Visualization of the problem with the Twitter service, where each red dot represents a tweet. Duration is 1 hour and the number of tweets with a wrong location is 7,240.
Figure 2.3: Visualization of the solution to the Twitter service problem, where each red dot represents a tweet. Duration is 1 hour and the number of tweets is 112,851, this means an error rate of approximately 7%.
2.2 Tweet’s location problem
13
Applying the location filter to the streaming API, means that the bounding box needs to be known. GeoNames2 solves the problem, by providing all coordinates needed for all countries. The two problems spotted regarding the location accuracy, turned out to have the same solution. Algorithm 1 investigates whether the current received tweet are from the same country as desired. The ones which are, will be stored in the MySQL3 database for further analysis. Appendix C contains information about, the database choice and implementation details including E/R diagram. A problem occurred, some of the tweets where missing the country code, which means that they could not be processed. To solve this OpenStreetmap’s reverse geocoding API4 was used, which has the ability to convert longitude and latitude value pairs to a country code. Algorithm 1: Parse tweet from data if coordinates in data then if place not in data then country_code = reverse geocode coordinates if tweet.country_code is chosen country_code then #Parse the rest of the tweet add tweet to database else raise LocationNotAvailableException
For debugging purposes an interactive tweet-map, which is a graphical interactive way of visualizing tweet, has been created (used at Figure 2.2) and Figure 2.3). It is a JavaScript/HTML5 based website hosted locally in Python with the module Flask5 and implementation details available in Section C.1.
2 www.geonames.org - Licensed under a Creative Commons attribution license, which gives you free access to: Share - to copy, distribute and transmit the work and Remix - to adapt the work to make commercial use of 3 www.mysql.com 4 wiki.openstreetmap.org/wiki/Nominatim/ 5 flask.pocoo.org
14
Mining Twitter
Chapter
3 Trending framework
In the previous chapter it was described how the tweets were collected ensuring their location accuracy and stored in the MySQL database. This chapter focuses on how to turn those tweets into trends. "Fast" trends were chosen for this project, because it would have the most impact compared to the "slow" trends. Eventually most trends will appear on Issuu - having huge delay, since Issuu users are not as active as other social networks. Therefore the challenge is to reduce this delay. To illustrate the idea a time period of three days was chosen knowing in advance that there were several trends in there and testing if the algorithm can find them. On October 22nd Apple held it’s annual event where it has presented the updated product line (new iPads, MacBooks and of course the new OSX Mavericks). This event was chosen as one of the examples to start with.
16
Trending framework
3.1
Raw data
At first, we will take a look at the raw data from the database. In addition also a 3 consecutive days subset, which will be used as example, to describe the trending framework: Count
Type
Full dataset Duration (hours)
Example
1462
72
127,930,378
4,103,273
25,502,269
770,453
Unique hashtags
3,180,466
206,850
Avg. tweets per day
2,099,858
1,367,757
Avg. length per tweet (char)
56
54
Avg. words per tweet
9.5
9.1
Tweets Hashtags
Table 3.1: Facts about the dataset collected. More statistics about the dataset are presented in Appendix A. The plot of total tweets per hour at Figure 3.1 - where the x-axis represents the 3 days (72 hours) and 0, 24 and 72 is midnight (this also applies for the other plots in this chapter) - clearly shows, that the frequency/fluctuation of tweets reflects the same day/night rhythm as humans, which was as expected. 30,000
tweets per hour
0 1,800
12
24
36
48
Figure 3.1: Total tweets per hour
60
72
#pll #apple #jobs
3.1 Raw data
17
Hashtags is used to categorize/label the tweet, with one word or phrase and can be used to spot the trends in the tweets. The full text of the tweets, could also have been used, this option has been tested and found to be generally too vague. 30,000
Figure 3.2, shows the total amount of tweets of the chosen hashtags. As described before Apple is one of them, the other two are; the TV-show "Pretty Little Liars" which was shown the same day and the hashtags "jobs", which is a way companies use to identify a job opening on Twitter. These three different hashtags represent different kinds of trends: onetweets per hour time-events, weekly recurring and daily recurring, which will be described 0 this chapter. 12 24 36 48 60 72 later in 1,800
#pll #apple #jobs
0
12
24
36
48
60
Figure 3.2: Raw tweet count for hashtags
0.07
72 #pll #apple #jobs
Due to the quite big fluctuation in the total amount of tweets during a day, a weight function with the purpose of reducing the importance of tweets, which are tweeted during nights, expressed as a sigmoid function1 has been created (this could also have been another mathematical function like hyperbolic tangent): 0 0.06
12
24
36
48 1 wt = , 1 + exp(−(m × (tweets ∈ t − X)))
60
72
(3.1) #pll #apple #jobs
where t defines a time period and m is the slope of the curve, which can be defined as how "expensive" it is, to have a tweet count below the 0 preferred amount X (see Figure 3.3).
1
en.wikipedia.org/wiki/Sigmoid_function/ 0
0.06
0
12
24
36
48
60
72
#pll #apple #jobs
Untitled 24 1
27
Untitle
18
Trending framework 1.0
weets per hour
weighted tweet count
0
72
12
24
36
48
60
72
Figure 3.3: Weighted tweet count per hour 30,000
#pll #apple #jobs
72
3.2
Normalizing data
To get reasonable results from data which vary in the amount (in this tweets per hour case, the total amount of tweets pr hour), it is highly recommended to 0 12 24 36 48 normalize the data. In this project every hashtag will be 60 normalized72by the total amount of tweets in the time period t, this results in a normalized 1,800 #pll #apple value for each hashtag: r w #jobs
#pll #apple #jobs
ft =
|tweets 3 the hashtag| |tweets ∈ t|
(3.2)
Figure 3.4 shows the result of applying Equation 3.2 to the data at Figure Untitled30 6 Untitled 11 36 Untitled 16 pattern, Untitled 3.2. The seems follow the same although there is 27 hashtags 33 to 45 21 48 some 0differences12 during the night. 24 36 48 60 72
72
ur
ho
1
2
ho
ur
s
Untitled 1
24
0.07
#pll #apple #jobs
#pll #apple #jobs
0 72
#pll #apple #jobs
0.06
0
12
24
36
48
Figure 3.4: Normalized hashtags
60
72
#pll #apple #jobs
Untitled 24 1
27
Untitled30 6
3.3 Detecting trends
tweets pr hour
60
3.3
19
Detecting trends
72
#pll #apple #jobs
To detect a trend, is it important to know how the hashtag has behaved in the previous time window (reference window r ), before the current time window w. Sizes of the w and r are parameters in the framework and are tunable. Figure 3.5 shows an example of the values, for the reference window and current window, which respectively is 2 hours and 1 hour.
60
72 #pll #apple #jobs
r
36
#pll #apple #jobs
ur
s
33
ho
30
ur
27
45
48
1
24
ho
72
2
60
w
Figure 3.5: Example sizes of w and r.
To be able to know if the term is a trend, the normalized reference window (ft_ref ) will be subtracted from the current window, to find out whether the interest has increased: 60
72
#pll #apple #jobs
60
72
|r| P
ft_ref =
|the hashtag ∈ tweets| |r| P
(3.3)
|tweets ∈ t|
where r is a list of reference windows. The outcome of this step can be seen in Figure 3.6. It clearly shows that it has a huge impact on the "jobs" hashtag, whose influence has dropped.
20 0
12
24
36
48
Trending framework Untitled 1 60
0.06
72
#pll #apple #jobs
0
0 0.06
3.4 0
12
24
36
48
60
Figure 3.6: w - r, where r = 2 hours.
72
#pll #apple #jobs
Recurring trends
The size of w (current window) and r (reference window) creates problems regarding interfering recurring trends. Hashtags like "jobs" turns out as a trend that it follow 0 each day, 12 no matter 24 the fact 36 48 the same 60 pattern each 72 day (see Figure 3.2). These types can be daily, weekly or yearly recurring defined as: Day - Examples of recurring daily trends will be hashtags such as "jobs". This hashtag is recurring each day, but not necessarily on the same time, and tests shows that the amount tends to be a bit lower during the weekend. #apple Week - TV-shows is a great example of trends, which are recurring each week on the same day and time as long as they are shown. Another 1 6 12 18 24 type of weekly recurring trends is the natural difference between weekdays with work and weekends. Year - New Years Eve, Christmas or Halloween are all examples of yearly recurring trends.
24
27
Untitled30 6
tweets per hour
0
12
24
36
48
60
72
3.5 Trend score
21
1,800
#pll #apple #jobs
In this project the focus is to get rid of the daily recurring trends. The issue will be solved, by subtracting the maximal value of the hashtag from the day before the time period t: max(fi ), i ∈ {t − 24; t}
(3.4)
0 12 the framework 24 36 48 72 This would make sensitive for outliers. But60in this project outliers are in fact what is being looked for - trends. Subtracting the 0.07 #pll maximum does not mean that a hashtag can not be trending two #apple days in #jobs a row, the amount of tweets containing it, just needs to rise.
The weekly and yearly recurring trends where not implemented, but the principle is the same.
3.5
0
12
Trend score
24
36
48
60
0.06
72
#pll #apple #jobs
Combining Equation 3.1, 3.2, 3.3 and 3.4 gives the complete Equation 3.5 for 0calculating the trend_score of a hashtag on a given time t: trend_score = (ft − ft_ref − max(fi )) × wt , i ∈ {t − 24; t}
(3.5)
Figure 3.7 shows the final result of the trending framework after applying 0 36 48 60 72 Equation 3.5 to12the data: 24 0.06
#pll #apple #jobs
0
0
12
24
36
48
60
72
Figure 3.7: w - r, where r = 24 hours. where the x -axis represents time (as for the other plots), where the y-axis represents the trend_score.
#apple
Untitled 24 1
27
Untitle
#pll #apple #jobs
0
22
Trending framework 0
12
24
36
48
60
72
0.06 plot the importance of the hashtag "jobs" is reduced to a point, In this #pll #apple where it is insignificant (a trend score below 0) and the other two turns #jobs out to be trendy, which is exactly what we want. 0
Sometimes the framework produces too many trends than needed for Issuu. Because of this a threshold has been set as a limit. If and only if the trend_score is above the threshold, will the hashtag be accepted as a 12 36 72 trend. 0The threshold is a24parameter which is48tunable 60 #pll
#apple
#jobs
1
6
12
18
24
Figure 3.8: Trend scores for all trends in the timeperiod of 22. October. The red line displays the chosen threshold.
Figure 3.8 is a visual representation of the trending scores of the Apple event day, the 22nd of October. The red line represents the threshold.
3.6
Aggregating trends
At Issuu it is not very likely that users come back each hour, but more likely ones a day, therefore it would be unnecessary to recommend new document each hour. A solution where it is possible to aggregate trend throughout longer periods, which should be tunable, would be preferable. The computationally and extensible most optimal solution would be to extend the existing database (explained in Section C.3), make it able to store trends and references to corresponding tweets.
3.6 Aggregating trends
23
Two new tables: trend and tweet_trend_relation, where added to the database, containing the trend and the time where it where trendy (see Figure 3.9).
!"#$ % !"#$ &% '( &% ) %& %
'
' )--
* +
./
* +
./ '
* + , *
Figure 3.9: E/R Diagram v2
24
Trending framework
Chapter
4
From trends to magazines
This chapter is all about mapping the computed trends to Issuu, which can then be presented as similar magazines/documents for the users. Two different approaches will be investigated in order to solve this problem: Latent Dirichlet Allocation (LDA) and Apache Solr search engine
4.1
LDA
Latent Dirichlet Allocation (LDA) is a generative probabilistic model, that allows to automatically discover the topics in a document. A topic is defined by the probability distribution over words in a fixed vocabulary, which means that each topic, contains a probability for each word. LDA can be expressed as a graphical model, known as the plate notation (Figure 4.1)
26
From trends to magazines
θd
α
Z
d,n
W
β
d,n
N D Dirichlet parameter
Per-document topic proportion
Per-word topic assignment
Observed word
Topic Hyperparameter
Figure 4.1: Plate notation of the LDA model [Ble09]
where,
Variable D N W Z α β θ
Definition The number of documents. Total number of words in all documents. The observed word n for the document d Assigns the topics for the n’th word in the d ’th document. K dimensions per-document topic distribution vector, where K is the number of topics. Y dimensions per-topic word distribution vector, where Y defines the number of words in the corpus. Topic proportions for the d ’th document.
Table 4.1: Definition of LDA model parameters
4.1 LDA
27
Figure 4.2 is an example of a visual representation of a LDA space with three topics. A given document x has a probability, to belong in each topic, which all sums up to 1. The corners of the simplex, corresponds to the probability 1 for the given topic. Topic 1 ␡
Topic 3
␡
␡ ␡
␡ ␡
␡ ␡
Topic 2
Figure 4.2: LDA topic simplex, with three dummy topics. Topic distribution for a document can be visualized using a bar-plot. This describes which topics are present in the document and by that, it’s underlying hidden (latent) structure:
Topic 1
2
3
4
5
6
7
8
9 10
x-1
x
Figure 4.3: Representation of topic distribution using dummy data, where the x-axis represents the x number of topics and the y-axis represents the probability of belonging to the given topic x.
At Issuu, the LDA model is trained on 4.5 million English Wikipedia1 articles. LDA makes the assumption that all words, in the same article is somehow related. Every article is unique in the sense, that it has unique distributions of words. 1
www.wikipedia.org
28
From trends to magazines
This could be interpreted as a unique topic for each article, resulting in 4.5 million topics. This would be useless, since the goal is to make a model that finds similarities among documents, instead of declaring them all different. One of the main steps in LDA is dimensionality reduction where the number of topics is reduced (in Issuu case to 150 topics) forcing similar topics to “merge” and reveal deeper underlying patterns.
4.2
Using LDA
All the tweets containing the trending hashtag, will be used as data source for Issuu’s LDA model, instead of only the hashtag itself. A hashtag does not provide enough information and context, to give a stable result from Issuu’s LDA model. The model is context-dependent and would not be able to differentiate, between the fruit and the electronic company, based on the hashtag #apple, without any context. To give an idea of the richness of the context, behind the tweets from a single hashtag like #apple (see Figure 4.4), two tag clouds were generated. A tag cloud is a visual representation of text, which favors the words that is mostly used in the text, by either color or size.
Figure 4.4: Left: Tag cloud for all words, in the tweets containing #apple. Right: Same tagcloud after removing #apple, #free and #mavericks, to get a deeper understanding.
$
All tweets
flowchart (Figure in the world
$
!
4.2 Using LDA
Location
4.5)filtercontains [USA]
Tweets
29
Database
The four insteps, visualized as arrows and USA denoted with numbers. It shows overall structure of how to turn trends from Twitter, into similar magazines/documents using LDA.
! Trend related
Tweets
1
% Issuu’s
LDA
2
# #
3 ␡
␡
␡ ␡
␡ ␡
Topic
Distribution
4
! Similar
documents
Figure 4.5: From trend to magazines flowchart
0.000007945967
The tweets corresponding to the trending hashtag is feed in to the LDA model (step 1), which produce the topic distribution (step 2). Figure 4.6 shows the topic distribution for the #apple tweets, where it is easy to see that the software/electronics topic is dominating, as expected.
0.011839627149 0.000007945967 0.001273715413 0.010046488782 0.020267836621 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.009953154993 0.000007945967 0.000007945967
Software
0.002414483548 0.000007945967
Figure 4.6: Topic distribution for #apple tweets
0.000007945967 0.000007945967 0.000007945967 0.008195320570 0.000007945967 0.000007945967
Using this LDA topic distribution, it is possible, in combination with the Jensen-Shannon divergence algorithm (Equation 4.1) (step 3), to find similar magazines to recommend from Issuu (step 4):
0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967
1 1 JSD(P kQ) = D(P kQ) + D(QkM ), 2 2 1 where M = D(P + Q), 2
(4.1)
0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.026455528532 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.000007945967 0.003986194631
where P and Q are two probability distributions, which in Issuu’s case are two LDA topic distributions.
30
4.2.1
From trends to magazines
Results using LDA
Figure 4.7 is a a subset of the magazines found being similar using LDA, to the tweets containing the hashtag #apple.
Figure 4.7: Subset of the similar #apple documents using LDA. NB: The documents are blurred due to copyright issues and the terms of services/privacy policy on Issuu, this applies for all figures which shows magazine covers
The resulting magazines range from learning material such as "... For Dummies" to magazines like "Computer Magazine" and "macworld ". The popularity and the date they are uploaded on Issuu differs from magazine to magazine. These would be parameters which also could be used to weight the documents.
4.3 Solr
4.3
31
Solr
Lucene is a Java-based high-performance text search engine library. A document in Lucene terms, is not a document as we know it, but merely a collection of fields, which describes the document. For any given document, these fields could be information like the title and the amount of pages. Lucene uses text analyzer, which tokenize the data from a field, into a series of words. After that a process called stemming will be performed, which is reducing all the words to its stem/base. See Figure 4.8 for example.
Magazine recommendations based on social media trends
Text to be processed
Tokenization
on Magazine trends based social media recommendations Tokenized text
Stemming
on trend Magazine recommend social base media Stemmed words
Figure 4.8: Example of tokenizing and stemming
Lucene uses term frequency-inverse document frequency (tf-idf ) as a part of the scoring model, where tf is the term frequency in a document, which is the measure of how often, a term appears in a document. Idf is the measure of how often a term, appears across a collection of documents. Solr is a open sourced enterprise search server, used widely by services like Netflix2 . It is a web application service build around Lucene, adding useful functionality such as geospatial search, replication and web administrative interface for configuration.
2
www.netflix.com
32
From trends to magazines
4.4
Using Solr
In this project an integration with Issuu Solr server has been created, the use of the search engine is throughout HTTP requests and JSON3 responses. JSON is a easy human-readable open standard text format, which is mostly used to transfer data between servers and web application, like in this project. An example of a request could be:
<base url>? q=apple+mavericks+free+new& wt=json& debug=true& start=0& row=50& where, Parameter base url debug q
Description Address to access the Solr search engine. If true, the response will contain addition information, including scores and reason for each document. Main text to be queried in the request.
rows
Maximum number of results - used to paginate.
start
Used to define current position, in combination with rows to paginate.
wt
The format of the response like json. Table 4.2: Explanation of parameters to Solr.
The q parameter is constructed using the x most occurring words, in the tweets corresponding to the trending hashtag, where x is tunable. 3
json.org
4.4 Using Solr
4.4.1
33
Results using Solr
The similar magazines produced by Solr (Figure 4.9) are more diverse more than the similar documents from the LDA model (Figure 4.7). They range from Apple magazines and learning material to magazines about the surf spot "Mavericks" in California and the NBA (National Basketball Association) team Dallas Mavericks.
Figure 4.9: Subset of the similar #apple documents using Solr
Appendix B contains a complete example, using the hashtag #bostonstrong displaying the complete process from tweets to similar documents using both LDA and Solr.
34
From trends to magazines
Chapter
5 Conclusion
A prototype of an end-to-end solution, with the purpose of spotting location based trends from a social media network and mapping them to Issuu, has been developed. Twitter was selected as the data source, because it suited the requirements the best, as described in section 1.2. Both of the results (LDA: Figure 4.7 and Solr: 4.9) suggest that improvements could be made. Four improvements for the trending framework and three for the LDA was found useful: â&#x20AC;˘ Trending Framework: 1. Support the existing 28 languages supported1 by Issuu. 2. Capture "Slow" trends. 3. Non-parametric model. 4. Recurring weekly and yearly trends. 1
Magazine written in other languages does not have a LDA topic distribution, because only those languages are incorporated into the translation framework and therefore can be translated to English.
36
Conclusion • LDA: 1. Limited by Issuu’s LDA model. 2. Wikipedia lacks certain topics. 3. Big magazines results in many topics.
5.1
Improvements of the trending framework
USA was chosen as location in this prototype, because most Issuu readers are from USA and more than half of Twitter users are from the country too. In the future the goal will be to support the existing 28 languages supported by Issuu (Figure 5.1). English Spanish German French Portuguese Russian Arabic Italian Dutch Supported
Turkish Farsi Polish
Danish
Finnish Romanian Indonesian Hungarian Swedish Croatian Icelandic Catalan Norwegian Czech Hebrew
Not supported
Figure 5.1: List of languages supported by Issuu, including colored world map of countries speaking those languages. A solution is to extend the existing end-to-end solution with Issuu’s translation framework (Figure 5.2), which will be able to translate all nonEnglish tweets to English (1st improvement of the trending framework). A tuneable trending framework has been developed, which is capable of spotting "fast" trends on Twitter and reduce the importance on daily recurring trends. The hashtags: #apple and #pll were found to be trendy the 22nd of October, among a vast number of unimportant recurring trends. It was the day Apple presented their updated product line including the new OSX Mavericks2 and the Halloween episode of the popular TV-show "Pretty Little Liars (pll)" was shown. 2
Apple Special Event: http://www.apple.com/apple-events/october-2013/
5.1 Improvements of the trending framework
37
Yes
!
#
&
$ No
MySQL
Database
Trends
English?
' '
% Translation
Issuu’s
LDA
Topic
Distribution
Figure 5.2: Translation module to improve the solution.
Hashtags like #happyhalloween described in Appendix B, along with #bostonstrong, was possible to spot during the 31st of October, but the ability to spot "slow" trends (described in Section 1.3) would need to be improved (2nd improvement of the trending framework). A possible solution is to run multiple instances of the framework simultaneously, with various sizes of the current window (w ) and the reference window (r ) (Figure 5.3). 2h
⚙
#
Fast
6h
! MySQL
Database
!
Tweets
⚙
#
Slow
12h
⚙
#
Trending
Frameworks
Trends
Slow
Figure 5.3: Three simultaneously running trending frameworks. Creating a new trending framework, which is build on a non-parametric model (3rd improvement of the trending framework) [Nik12] would make the system more robust and faster to spot the trends. Robust because parameters like the threshold is not defined from the beginning, but observed from training data, which is used decide whether a new dataset is a trend or not.
38
Conclusion
A last improvement (4th) would be to implement the weekly and yearly recurring trends (as described in Section 3.4), which is basically the same principal as the daily recurring trends.
5.2
Improvements of the LDA model
This project is limited by Issuu’s LDA model (1st improvement of the LDA model), the similar magazines found using the Apple tweets (Figure 4.7) includes magazines about Microsoft too, because that Apple is part of the overall Software topic, which also includes Google and Microsoft etc. (see Figure 5.4). Topics about technology,
computers and software #98 1
100K
Top words in
#apple tweets
#70
#1
#66
undo
software
windows
user
server
computer
microsoft
domain
web
linux
os
program
version
browser
files
mac system
open
programming
file
computing
download
free
apple
user
link
added
blacklist
coibot
reported
resolves
accounts
users
additions
reporting
involved
report
records
wikipedia
mentioned
whitelist
monitor
interest
monitorlist
domainred list
org
adding
conflict
data
code
internet
local
network
service
mobile
users
access
digital
computer
services
using
ip
available
address
web
phone
technology
system
online
networks
application
via
sat
search
admins
poly
cycle
graph
problem
algorithm
node
logic
rp
step
np
arrow
graphs
edge
problems
interwiki
optimization
path
arrows
tree
algorithms
xwiki
fpc
apple free
mavericks
osx
new
os today
x
mac event
keynote
pro
available watching
ipad
macbook
love
oh
iwork
want
black
go
looks
good
cook
…
…
…
…
…
Figure 5.4: Top words in the topics.
Wikipedia is great resource to use as a text corpus for training the LDA model because it is free, it is broad enough to cover multiple topics and very clean and focused in terms of each article being about one topic. For example Wikipedia article about Italian food is very unlikely to write about technology or cars. The main disadvantage of using Wikipedia at Issuu is that Wikipedia is not evenly covering all possible themes (2nd
5.3 LDA vs. Solr
39
improvement of the LDA model) - it is overloaded on business and technology, while lacking more in entertainment topics. That is one of the reasons why certain topics in Issuu’s LDA model are combined together (for example American football and baseball) thus making it unable for us to distinguish between them. This will be addressed when Issuu launches their new LDA model with more topics. Magazines are often broad and big and therefore includes many topics, this is of course a problem, because a single topic distribution is computed from the whole magazine. To address this problem (3rd improvement of the LDA model), a solution would be to compute LDA per page instead (Figure 5.5), which makes it possible to recommend single pages within a magazine.
# …
"
Trend
)
' '
' '
✈
' ' ' '
+
⚽
, / 2
…
- 0 3
…
. 1 4
…
…
Magazine
Pages
Topic Distribution
JSD
Similar pages
Figure 5.5: LDA per page solution.
5.3
LDA vs. Solr
The two approaches - LDA (Section 4.1) and Solr (Section 4.3) - differ in the results they provide. One could argue that the quality of the LDA approach is better that the Solr. Once both systems are launched live, an A/B testing3 could be used to see which one fits better for Issuu. 3
en.wikipedia.org/wiki/A/B_testing
40
Conclusion
Appendix
A Dataset statistics
A.1
Location
Los Angeles
2051577
Texas
1678012
Georgia
1539281
New York
1503027
Manhattan
1435282
Chicago
1280852
Florida
1273262
Philadelphia
1178112
Ohio
1116043
Ohio
South Carolina
1078361
South Carolina
Oberon
1
Oberon
1
Macy's
1
Macy's
1
Woodberry
1
Woodberry
1
Juntura
1
Juntura
1
Conconully
1
Conconully
1
German Valley
1
German Valley
1
Unionville Center
1
Unionville Center
1
Cedarbend
1
Cedarbend
1
Funkley
1
Funkley
1
Alicia
1
Alicia
1
Los Angeles
2,051,577
Texas
1,678,012
Georgia
1,539,281
New York
1,503,027
Manhattan
1,435,282
Chicago
1,280,852
Florida
1,273,262
Philadelphia
1,178,112 1,116,043 1,078,361
625,000
1,250,000
1,875,000
2,500,000
Figure A.1: Top and bottom 10 of used locations
job
472043
jobs
412111
job
tweetmyjobs
230950
jobs
oomf
133661
tweetmyjobs
wcw
84900
oomf
pdx
76820
wcw
veteranjob
68357
pdx
mcm
61772
veteranjob
coupon
56949
mcm
nursing
55968
coupon
472,043 412,111 230,950 133,661 84,900 76,820 68,357 61,772 56,949
Oberon
1
Oberon
1
Macy's
1
Macy's
1
Woodberry
1
Woodberry
1
Juntura
1
Juntura
1
Conconully
1
Conconully
1
German Valley
1
German Valley
421
1
Unionville Center
1
Cedarbend
1
Cedarbend
1
Funkley
1
Funkley
1
Alicia
1
Unionville Center
Alicia
A.2 1
Hashtag
job
472043
jobs
412111
job
tweetmyjobs
230950
jobs
oomf
133661
tweetmyjobs
wcw
84900
oomf
pdx
76820
wcw
veteranjob
68357
pdx
mcm
61772
veteranjob
coupon
56949
nursing
55968
happyhalloween
Dataset statistics
1,250,000
1,875,000
2,500,000
472,043 412,111 230,950 133,661 84,900 76,820 68,357
mcm
61,772
coupon
56,949
nursing
55,968
13274
bostonstrong
9913
apple
7388
pll
6070
happyhalloween
13,274
bostonstrong
9,913
apple
7,388
pll
6,070
wrongconversati on
1
wrongconversation
1
billycorgan
1
billycorgan
1
justwantcheesec ake
1
justwantcheesecake
1
stalkerwife
1
stalkerwife
1
travelquestions
1
travelquestions
1
uneedheadandshoulders
1
uneedheadands houlders
1
hdbros
1
thoughtweweregonnamove
hdbros
1
1
quietdownpeople
thoughtwewereg onnamove
1
1
xodb
1
quietdownpeopl e
1
xodb
625,000
150,000
300,000
450,000
600,000
1 Figure A.2: Top and bottom 10 of used hashtags. Including the four hashtags analyzed in this project: happyhalloween, bostonstrong, apple and pll.
Appendix
B
Example: #bostonstrong
-0.012420355014
-0.011907502368
-0.011692806643
-0.008418167598
The fluctuation of the total amount of tweets per hour, follows the same pattern as described in section 3.1, as expected: tweets per hour
Tweet count
FINAL jobs
This is a full example of how the trending framework works (part 1) and the transformation from trends to magazine/documents on Issuu (part 2). Short facts about the subset of the dataset is as follows, 5.465.644 tweets containing 1.195.695 hashtags, where 280.855 is unique.
1
12
24
36
48
Figure B.1: Total tweets per hour
60
72
#bostonstrong #happyhalloween #jobs
obs
Tweet count
tweets per hour
44
Example: #bostonstrong tweets per hour
Tweet count
Whereas the three different hashtags used in this example: #bostonstrong, #happyhalloween and #jobs, differs from the once, in the explanation of 1 12 24 36 60 trending framework in chapter 3. The hashtag 48 #bostonstrong - used72by the fans, of the baseball team Boston Red Sox - was a unexpected event, 1 . Whereas because Red Sox became the champions of the Word Series#bostonstrong #happyhalloween #jobs #happyhalloween is hashtag used to celebrate Halloween, a yearly recur12 24 36 48 60 72 ring 1event.
20355014
#bostonstrong #happyhalloween #jobs
07502368
92806643
18167598
05947036
1
12
24
36
48
60
72 #bostonstrong #happyhalloween #jobs
1
12
24
36
48
60
Figure B.2: Raw tweet count for hashtags
72 #bostonstrong #happyhalloween #jobs
Applying the knowledge learned and the combined trend_score equation, 1 12 expected 24 trend #bostonstrong: 36 48 60 72 3.5, results in the
Trend score
#bostonstrong #happyhalloween #jobs
Trend score
94023387 0355014 27564355 7502368 19849444 2806643 51720271 8167598 67324256 5947036 67304377 4023387 85849511 7564355 94275153 9849444 98973371 1720271 22517665 7324256 94041186 7304377 37034058 5849511 10203695 4275153 90159564 8973371 56507622 2517665 13655731 4041186 70435920 7034058 70941376 0203695 85742611 0159564 94955123 6507622 37452007 3655731 30315292 0435920 12986596 0941376 52766079 5742611 01315815 4955123 34280161 7452007 18646351 0315292 65136068 2986596 82586375 2766079 63453312 1315815 10722174 4280161 29436602 8646351 61630721 5136068 94289492 2586375 36588390 3453312 60365036 0722174 70939696 9436602 65672716 1630721 79403400 4289492 39962259 6588390 65114194 0365036 77024993 0939696 14745095 5672716 10281442 9403400 87067434 9962259 84109190
1
12
24
36
48
60
72 #bostonstrong #happyhalloween #jobs
24
36
48
60
72
Figure B.3: Fully processed data 24
36
48
60
72
Which might be noted on figure B.3, #happyhalloween are actually spotted as a trend. Unaffected by the fact, that it is the settings for spotting fast trends - which is in focus at the moment, as described in chapter 3. Never the less satisfy #happyhalloween the requirements, for being considered as a slow trend - described in section 1.3. 1
en.wikipedia.org/wiki/World_Series
45 Two tag clouds with top 100 of the words, in all the tweets containing the hashtag #bostonstrong, has been created to give an overall perspective of the semantic richness of the data.
Figure B.4: Top: Tag cloud for all words, in the tweets containing #bostonstrong. Bottom: Tag cloud for the rest of the words, after removing #bostonstrong, to get a deeper understanding. The documents similar to those tweets, computed using LDA and Solr, can be seen at figure B.5 and figure B.6. Although all documents is not about baseball or Boston Red Sox, is the overall impressions of the document positive. The documents computed using Solr, is clearly different than the one using LDA, magazines/documents about the city Boston, is among the selected too.
46
Example: #bostonstrong
Figure B.5: Subset of the similar #bostonstrong documents using LDA
Figure B.6: Subset of the similar #bostonstrong documents using Solr
Appendix
C Implementation details
C.1
Flask
The existing Python code will be used, as data provider for the website, for reusing as much of the code as possible. It could properly have been optimized by writing it all in JavaScript, but this is just a debug tool. A web application framework supporting execution of Python code would therefore be preferable. Flask is a lightweight Python web application micro-framework, which means that the core is kept simple but extensible. Flask supports both local server and broadcasting, plus opportunity for custom port number (default is 5000). In addition it supports implementation of the decorator app.route(), which is a URL trigger. As soon as the url defined in the route decorator is matching, the function attached is executed.
48
Implementation details
The interactive tweet-map needs three different app.route()â&#x20AC;&#x2122;s:
/ : The websiteâ&#x20AC;&#x2122;s index page, is loading the HTML which displays the dialog showed in Figure ?? and discussed in the associated section. /get_tweets/. . . : To receive the tweets from the database, matching the settings specified in the initial dialog of the website, this url needs three parameters: country_code, timestamp and duration. After receiving the tweets, will they be plotted on the map. If no tweets are available from the time period and country, an error message will be displayed and a new time period can be selected. /related_documents : When the tweets is loaded and displayed on the map the sidebar, where its possible to see the related documents, to the tweets on the map, appears. This url is triggered by the sidebar and will calculate all the documents, using the trending framework in combination with LDA and Solr, which will be discussed later in the report.
C.2
Peewee
Peewee1 , a ORM (object-relational mapping) module for Python which supports the database choice. was the chosen.
ORM: A programming technique for converting objects between incompatible type systems in object-oriented programming languages. For databases uses, it creates a "virtual object database" which synchronize the state of the objects in the programming language with the database tables. Usually ORM modules/libraries provides an simple abstraction the SQL language on top too. 1
peewee.readthedocs.org - open-source, MIT (Massachusetts Institute of Technology) license, which means that all code is free to use and free to change/modify how ever it fits the integration implementation.
C.3 Database
49
Besides providing an extended abstraction of the SQL language on top of the database connection, Peewee also provides a Python script which makes it possible to grab a database from a running MySQL server and auto generate the Python classes, this includes tables, foreign keys etc.
C.3
Database
There is a ton of different databases to choose from and each have their own advantages. What where preferred in this project is a regular relational database, which is easy and fast to setup using predefined scripts. The database is not the main focus in this project, and is mainly used as a placeholder for the data to be saved, instead of keeping it in memory.
C.4
MySQL
By examining the json result callback from the Twitter service and investigating the different analysis models and framework for achieving the goal of the project (described in the section 1.1), a solution for the table(s) for the database was found. It turned out that all information needed, was able to fit into a single table, named tweet (see figure C.1). Figure C.1: E/R Diagram The desired information from the tweet is attributes such as the text, time stamp and longitude/latitude, to analyze the tweets over time from location.
50
Implementation details
Bibliography
[Bee12]
Beevolve. An exhaustive study of twitter users across the world. http://www.beevolve.com/twitter-statistics/, October 2012. Online; last read 6. January 2014.
[Ble09]
David M. Blei. Topic models. http://videolectures.net/ mlss09uk_blei_tm/, November 2009. Online Video Lecture; last viewed 26. December 2013.
[BNG11]
H. Becker, M. Naaman, and L. Gravano. Beyond trending topics: Real-world event identification on twitter. In Fifth International AAAI Conference on Weblogs and Social Media, 2011.
[BNJ03]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, March 2003.
[CDCS10] Mario Cataldi, Luigi Di Caro, and Claudio Schifanella. Emerging topic detection on twitter based on temporal and social terms evaluation. In Proceedings of the Tenth International Workshop on Multimedia Data Mining, MDMKDD ’10, pages 4:1–4:10, New York, NY, USA, 2010. ACM. [IHS06]
Alexander Ihler, Jon Hutchins, and Padhraic Smyth. Adaptive event detection with time-varying poisson processes. In
52
BIBLIOGRAPHY Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, pages 207–216, New York, NY, USA, 2006. ACM.
[Jac88]
R. Jackson. The matthew effect in science. INTERNATIONAL JOURNAL OF DERMATOLOGY, 27(1):16–16, 1988.
[KLPM10] Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. What is twitter, a social network or a news media? In Proceedings of the 19th International Conference on World Wide Web, WWW ’10, pages 591–600, New York, NY, USA, 2010. ACM. [Nik12]
Stanislav Nikolov. Trend or no trend: A novel nonparametric method for classifying time series. Master’s thesis, Massachusetts Institute of Technology, Massachusetts, USA, September 2012.
[RIS+ 94]
Paul Resnick, Neophytos Iacovou, Mitesh Suchak, Peter Bergstrom, and John Riedl. Grouplens: An open architecture for collaborative filtering of netnews. In Proceedings of the 1994 ACM Conference on Computer Supported Cooperative Work, CSCW ’94, pages 175–186, New York, NY, USA, 1994. ACM.
[ŘS10]
Radim Řehůřek and Petr Sojka. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta, May 2010. ELRA. http: //is.muni.cz/publication/884893/en.
[Sag12]
Jeff Saginor. Study finds facebook users more private than ever. http://www.digitaltrends.com/web/ study-finds-facebook-users-more-private-than-ever/, February 2012. Online; last read 23. August 2013.
[Sal89]
Gerard Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1989.
BIBLIOGRAPHY
53
[SM95]
Upendra Shardanand and Pattie Maes. Social information filtering: Algorithms for automating &ldquo;word of mouth&rdquo;. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’95, pages 210– 217, New York, NY, USA, 1995. ACM Press/Addison-Wesley Publishing Co.
[Wal05]
Thomas Vander Wal. Explaining and showing broad and narrow folksonomies, 2005.