Revealing Connections: Geotagged Genoa Columbia University GSAPP Nick Kunz - Master of Science in Urban Planning, 2019 Sunghoon Lee - Master of Architecture, 2020 Douglas Woodward - Adjunct Professor, Urban Planning Megan Marini - Adjunct Assistant Professor, Urban Planning Richard Plunz - Professor, Architecture
April 25th, 2018
or use the URL at: https://tinyurl.com/y89kfn8g
Preface Marco Bucci​, the newly elected mayor of Genoa, IT has a strategic objective to attract hundreds of thousands of people back to Genoa. Like many other mayors, a major demographic he would like to attract are young educated professionals with a penchant for entrepreneurship and innovation. Understanding the changing perceptions of this demographic, as they relate to the varying degrees of privacy in urban space is where our proposition begins. The way people use urban space is changing. The traditional separation of quotidian uses such as housing, the workplace, leisure, and otherwise, are beginning to overlap and may likely continue to progress in this direction. Technology has allowed us to assess this notion with better insight afforded by heavy social media usage. Utilizing this new layer of information, we conducted a geospatial analysis of social media activity in Genoa, IT as a way to inform our architecture and planning interventions beyond physically obvious urban morphology. The underlying premise of our intervention in this regard was to create rich multipurpose public space that serves a wider range of uses and tendencies demanded by our target demographic. The physical manifestation of the underlying theory allows users to more easily overcome the aggressive topological constraints and helps to amalgamate the discontinuity of urban spaces found within the Quarto neighborhood. Our intervention aims to achieve a much greater degree of physical connectivity and richness of urban public space in a relatively simple way. It is our hope that applying this model within the larger narrative of urban and economic development in Genoa, IT, will help guide the city in its path to future development.
Genoa, Italy
2
Demographic The purpose of this section is to brief the issues facing Genoa and the Quarto neighborhood by more explicitly defining the demographic that Mayor Bucci has elected to establish a strategic objective around in attracting to Genoa. Understanding the profile and general tendencies of this demographic was critical in order to appropriately proceed in assessing possible architecture and planning interventions in Quarto. Through our assessment of existing works, a few critically important themes and characteristics were revealed, which are exhibited by the annotations in the following outline. Demographic: ● 21-35 Years Old ● Ethnically and Racially Diverse ● Highly Educated (Bachelor’s Degree or Higher) ● Higher Rates of Participation in the Sharing Economy, Lower Rates of Ownership ● Preferential Tendency for Urban Environments ● Wider Spectrum of Transportation Demands ● Postponing Marriage and Childrearing ● Early Adopters of Technology ● Heavy Users of Social Media
#millennials Circella, A.G., Fulton, L., Alemi, F., Berliner, R.M., Tiedeman, K., Mokhtarian, P.L., Handy, S. (2016). What Affects Millennials’ Mobility? Part I: Investigating the Environmental Concerns, Lifestyles, Mobility-Related Attitudes and Adoption of Technology of Young Adults in California. The National Center for Sustainable Transportation. Issue CA16-2825. Couture, V., Handbury, J. (2015). Urban Revival in America, 2000 to 2010. National Bureau Of Economic Research. NBER Working Paper No. 24084; JEL No. R23. DOI: 10.3386/w24084. Glaeser, E. L., Rosenthal, S.S., Strange, W.C. (2010). Urban Economics and Entrepreneurship. Journal of Urban Economics. Volume 67, Issue 1, January 2010, Pg 1-14. DOI: 10.1016/j.jue.2009.10.005. Khairunnabila Prayitno (2017). Moving Millennials: The Transit Experiences of Young Adults Living in High-Rise Suburbs of Toronto. UWSpace. URI: 10012/12695. Taylor, Paul. (2010). Millennials: A Portrait Of Generation Next: Confident, Connected, Open To Change. Pew Research Center. OCLC 535504509.
3
Theory Workplaces are becoming more like public places, and public places are becoming more like workplaces. Homes are becoming more like workplace and workplaces are becoming more like homes. In other words, we are beginning to see a convergence and overlapping of social and economic activity occurring in urban space. Private goods and services that were once used privately are now being shared and used more publically. Because of this phenomenon, the demands of multipurpose urban space that can provide a platform for the widening range of public-private interactivity is becoming increasingly more evident throughout time. A common example can be exhibited through ride sharing services, such as Uber and Lyft, where private vehicles are used for shared ridership. Another common example of this is housing. Airbnb, flexible lease agreements, and other forms of temporary stays and shared quarters are becoming a more popular commonplace in urban life. Transportation and housing are not the only urban aparati, which are subject to this phenomena. Workplaces are too becoming more shared. A common example of this is the shared office space and flexible commercial lease agreements made popular by WeWork and other similar ‘office as a service’ platforms. It stands to reason that WeWork and others like it have made this demand evident for small businesses. However, more recently, large multinational corporations are now finding the need for similar shared workplaces. As of 2018, IBM has recently downsized internally leased office space, opting for greater flexibility offered by WeWork. Paradoxically, transportation, housing, the workplace, and other social and economic activities are also becoming less shared in urban space. As the economic gap between the wealthy and the poor increase, and as the working class endure higher degrees of economic pressure (especially pronounced in shrinking cities like Genoa) urban space is also becoming more privately reserved for those with the ability to pay. This divergence of urban space becoming more private, while simultaneously becoming more public, is creating an environment where the spectrum of the types of urban space demanded by the larger narrative of society is widening. The range demanded by recent social and economic interactivity occurring within urban space is evident by behavioral tendencies exhibited by the target demographic. In the previous examples, millennials are perhaps the most pronounced subset of the population to exhibit this tendency. The point at which the convergence of social and economic interactivity that occurs within the public-private spectrum collide is the theoretical foundation, which this study is predicated. Assessing this notion is as exciting as it is challenging; for this we turn to social media.
4
Methodology This methodology was designed to assess general behaviors of the target demographic, as they relate to the location and use of urban space in Genoa and how multiple uses might converge. Because it is generally accepted knowledge that the target demographic are heavy users of social media, it might be possible to begin creating inferences to inform a architecture and planning intervention by virtue of data collected through Twitter, while underpinned by lived experience. Data & Information ● Twitter ○ Public API (Limited Use) ○ 30 Days of Data (3/20/18 - 4/20/18) ○ 8-10mi Radius from Genoa’s Geographic Center ○ 220,838 Total Observations ○ 2,176 Geotagged Observations (1% of Total Observations) ○ 16 Variables (6 Utilized) ■ Text (Tweet Content) ■ Timestamp (YYYY-MM-DD HH:MM:SS) ■ Geo-Coordinates (WGS84) ■ Device (iPhone, Android, Web, Etc.) ■ Public Engagement Metrics (Favorites, Retweets) ■ URLs (Links, Photos) ● OpenStreetMaps (OSM) ○ Basemaps & Satellite Imagery (WGS84) ○ Tertiary & Exploratory Information Geospatial Analysis ● Quantitative Data: Twitter API, OSM ○ Geolocation: Content Location; 2,176 Points (WGS84) ○ Kernel Density Estimation: Spatial Concentration (Heat Map); 400m / 5min Walk ○ k-Means: Unsupervised Point Clustering, 2min - 35max ○ k-Nearest Neighbor (k-NN): Supervised Point Clustering; k = 32 (Informed by k-Means) ○ Voronoi Tessellation: Thiessen Polygons (Interpolated Polygonal Bounds / Territories) ○ Attribute Arithmetic: Popularity Analysis; n Favorites + n Retweets ● Qualitative Data: Twitter API, OSM ○ Satellite Imagery: Observations, Informal Exploration (WGS84) ○ Site Visit: Field Verification, Observations, Genovese Insight & Lived Experience Content Analysis ● Quantitative Data: Twitter API ○ Algorithmic Arithmetic: Term Frequency Analysis ○ Latent Dirichlet Allocation (LDA): Natural Language Processing - Topic Modeling; k = 25, n = 5 ○ Pearson Correlation Coefficient: Term Association Modeling ● Qualitative Data: Twitter API, OSM ○ Tweets & Photos: Content Review, Verification, Observations ○ Site Visit: Field Verification, Observations, Genovese Insight & Lived Experience
5
Data Twitter ● ● ● ● ●
Public API (Limited Use) 30 Days of Data (3/20/18 - 4/20/18) 8-10mi Radius from Genoa’s Geographic Center 220,838 Total Observations 2,176 Geotagged Observations (1% of Total Observations)
Variable Count - Total 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
repl yTo SN
crea ted
trun cate d
repl yTo SID
id
repl yTo UID
stat usS ourc e
scre enN ame
retw eet Cou nt
isRe twe et
retw eete d
long itud e
latit ude
Variables - Total text
favo rited
favo riteC ount
Variable Count - Utilized 1
2
3
4
5
6
crea ted
retw eet Cou nt
long itud e
latit ude
Variables - Utilized text
favo riteC ount
Descriptions 1. Text: contains tweet content and their respective URL’s (string) 2. Favorite Count: indicates the number of favorites the tweet has received at the time of collection (integer) 3. Created: time that the tweet was published, YYYY-MM-DD HH:MM:SS (date and time) 4. Retweet Count: indicates the number of favorites the tweet has received at the time of collection (integer) 5. Longitude: x geocoordinate (WGS84) - user determined geolocation and public sharing (vector) 6. Latitude: y geocoordinate (WGS84) - user determined geolocation and public sharing (vector)
OpenStreetMaps (OSM) ● ●
Basemaps & Satellite Imagery (WGS84) Tertiary & Exploratory Information
6
Geospatial Analysis The purpose of the section exhibited here is to visually illustrate the results so that they may be further interpreted. The details of the geospatial analysis is explained by the R script, which can be found in the appendix. This study can be found online at https://tinyurl.com/y89kfn8g. Geolocation:​ Content Location; 2,176 Points (WGS84)
Kernel Density Estimation:​ Spatial Concentration (Heat Map); 400m / 5min Walk
7
k-Means:​ Unsupervised Point Clustering, 2min - 35max The purpose of the section exhibited here is to visually illustrate the results so that they may be better understood. The details of the k-means analysis is explained by the R script, which can be found in the appendix. Also important to mention that the training data used in this study excluded the last day of data collection (April 20th, 2018).
8
k-Nearest Neighbor (k-NN):​ Supervised Point Clustering; k = 32 (Informed by k-Means)
Voronoi Tessellation:​ Thiessen Polygons (Interpolated Polygonal Bounds / Territories)
9
Attribute Arithmetic: ​Popularity Analysis; n Favorites + n Retweets
Satellite Imagery:​ Observations, Informal Exploration (WGS84)
10
Content Analysis The purpose of the section exhibited here is to visually illustrate the results so that they may be further interpreted. The details of the content analysis is explained by the R script, which can be found in the appendix. Algorithmic Arithmetic:​ Term Frequency Analysis
11
Latent Dirichlet Allocation (LDA):​ Natural Language Processing - Topic Modeling; k = 25, n = 5 Hello "bomdia, bonjour, buenosdias, buomgyorny, goodmorning" Photography "foto, appena, pubblicata, genova, porto" Weather "genoa, kmh, humidity, wind, current" Museum "mare, genova, pasqua, museo, buona" Japanese Martial Arts "hanbojutsu, ninja, tecniche, and, samurai" Fitness "sexyfitnessgirls, club, anitaherbert, michellelewin, sexyabdominals" Sports "lanterna, sampdoria, instagood, blu, derby" Gardens "euroflora, oggi, villa, giorno, forte" Travel "italy, genova, italia, hotel, igitaly" Relax “genova, sun, sea, love, relax" Restaurant "san, ristorante, giorgio, amp, you" Artist "musante, buongiorno, francescomusante, francescomusanteart, danilovigofotografo" Aquarium "genova, acquario, titoacerbo, pescara, viaggiodistruzione" Boccadasse "genova, boccadasse, just, chiosco, with" Sunset / Sunrise "sunset, genoa, sunrise, will, city" Sports "nervi, quando, stadio, luigi, sempre"
12
Liguria Region "liguria, genoa, area, lomellini, noc" Travel "genova, grazie, bar, aeroporto, colombo" City Center "genova, centro, piazza, zena, ferrari" Spring "via, primavera, msc, meraviglia, pegli" Genoa "genovamorethanthis, igersgenova, genovacity, genovagram, lamialiguria" Theatre "felice, photooftheday, finalmente, teatro, carlo" Easter "the, camogli, easter, and, its"
13
Pearson Correlation Coefficient: Term Association Modeling (Randomly Chosen from Topics and Terms) “Hotel”
“Home”
14
“Work”
“Relax”
15
“Art”
“Photography”
16
“City”
17
Postulation Considering Mayor Bucci’s strategic objectives for Genoa, the changing ways in which it is thought that millennials might be redefining use cases for urban space, the analysis which reveals spatial concentrations and clustering of social media activity that provides insight into how this is materializing (that is not necessarily designer and planner defined), and the activities the target demographic might be engaged in, it might be prudent to assume that multipurpose public space is an important proposition to consider in the Quarto neighborhood within the context of broader economic development. The physical architecture and planning interventions are continued into the next phase of this study. The materialization of this study will demonstrate how users can more easily overcome the aggressive topological constraints and help to amalgamate the discontinuity of urban spaces found within the Quarto neighborhood while taking into full account the study conducted here.
18
Appendix #### REVEALING CONNECTIONS #### #### LIBRARIES #### # initiate libraries library(dplyr) library(tidyverse) library(qdapRegex) library(tm) library(sp) library(twitteR) library(leaflet) library(leaflet.extras) library(geosphere) library(NbClust) library(ggplot2) library(topicmodels) library(wordcloud) #### DATA HANDLING #### # twitter api credentials & authorization api_key <- "API KEY PLACEHOLDER" api_secret <- "API SECRET KEY PLACEHOLDER" token <- "TOKEN KEY PLACEHOLDER" token_secret <- "TOKEN SECRET KEY PLACEHOLDER" setup_twitter_oauth (api_key, api_secret, token, token_secret) # search parameters tweets <- searchTwitter(' ', # search keywords and/or hashtags n = 999999, # max number of tweets to return since = '2018-03-20', # search start date (YYYY-MM-DD) until = '2018-04-20', # search end date (YYYY-MM-DD) geocode = '44.407382,8.918684,10mi') # search area (lat,lon,radius) no spaces! # store search results in data frame tweets.df <- twListToDF(tweets) # print data frame as csv to specified file path for record (.csv) write.csv(tweets.df, "FILE/PATH/PLACEHOLDER/data.csv") # load data filenames <- list.files(path = "FILE/PATH/PLACEHOLDER", pattern = "*.csv") # lists csv files in file path fullpath = file.path("FILE/PATH/PLACEHOLDER/", filenames) # defines file path tweetsCombined <- do.call("rbind", lapply(fullpath, read.csv, header = TRUE)) # creates data frame combining all csv files in file path # subset data tweetsCombined_geolocatedSubset <- tweetsCombined[!is.na(tweetsCombined$longitude), ] observations without geolocations
19
# removing
# clean data tweetsCombined_geolocatedSubset$text <- sapply(tweetsCombined_geolocatedSubset$text, function(row) iconv(row, "latin1", "ASCII", sub="")) # removes emojis url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+" # url detection in text string tweetsCombined_geolocatedSubset$url <- str_extract(tweetsCombined_geolocatedSubset$text, url_pattern) # return url to data frame in new column tweetsCombined_geolocatedSubset$text <- rm_twitter_url(tweetsCombined_geolocatedSubset$text) # remove url from text string tweetsCombined_geolocatedSubset$url <- str_c(" <a href='",tweetsCombined_geolocatedSubset$url,"'>", tweetsCombined_geolocatedSubset$url,"</a>") # create active link tweetsCombined_geolocatedSubset <- unite(tweetsCombined_geolocatedSubset, popup, c(text, url)) # combine string text and url for popup tweetsCombined_geolocatedSubset$popup <- sub("_", "", tweetsCombined_geolocatedSubset$popup) # remove weird underscore generated from uniting text and url tweetsCombined_geolocatedSubset$popup <- sub("NA", "", tweetsCombined_geolocatedSubset$popup) # remove NA at end of string for those strings w/o active links #### GEOSPATIAL ANALYSIS #### # k-means clustering algorithm geo.dist = function(tweetsCombined_geolocatedSubset){ require(geosphere) distMatrix <- function(i,z){ dist <- rep(0, nrow(z)) dist[i:nrow(z)] <- distHaversine(z[i:nrow(z), 16:17], z[i, 16:17]) return(dist) } dm <- do.call(cbind,lapply(1:nrow(tweetsCombined_geolocatedSubset), distMatrix, tweetsCombined_geolocatedSubset)) return(as.dist(dm)) } # k-means clusters set.seed(222) # computer random number nc <- NbClust(tweetsCombined_geolocatedSubset[16:17], # data distance = "maximum", # distance type min.nc = 2, # min number of clusters max.nc = 35, # max number of clusters method = "kmeans") # k-means cluster plot table(nc$Best.n[1,]) barplot(table(nc$Best.n[1,]), names.arg = nc$Best, xlab = "Numer of Clusters", ylab = "Quantity of Indices", main = "Geotagged Tweets Genoa k-Means") # k-nn algorithm geo.dist = function(tweetsCombined_geolocatedSubset){ require(geosphere) distMatrix <- function(i,z){ dist <- rep(0, nrow(z)) dist[i:nrow(z)] <- distHaversine(z[i:nrow(z), 16:17], z[i, 16:17]) return(dist)}
20
dm <- do.call(cbind,lapply(1:nrow(tweetsCombined_geolocatedSubset), distMatrix, tweetsCombined_geolocatedSubset)) return(as.dist(dm)) } # k-nn clustering analysis K = 32 # number of k clusters (informed from k-means) distMatrix <- geo.dist(tweetsCombined_geolocatedSubset) # distance matrix cluster <- hclust(distMatrix) # hierarchical clustering method kMeansCluster <- kmeans(geo.dist(tweetsCombined_geolocatedSubset), centers = K) # k-means clustering, number of clusters tweetsCombined_geolocatedSubset$clusterID <- cutree(cluster, k = K) # create cluster variable column # voronoi tessellation algorithm voronoipolygons = function(layer) { require(deldir) crds = layer@coords z = deldir(crds[,1], crds[,2]) w = tile.list(z) polys = vector(mode='list', length=length(w)) require(sp) for (i in seq(along=polys)) { pcrds = cbind(w[[i]]$x, w[[i]]$y) pcrds = rbind(pcrds, pcrds[1,]) polys[[i]] = Polygons(list(Polygon(pcrds)), ID=as.character(i)) } SP = SpatialPolygons(polys) voronoi = SpatialPolygonsDataFrame(SP, data=data.frame(x=crds[,1], y=crds[,2], row.names=sapply(slot(SP, 'polygons'), function(x) slot(x, 'ID')))) } # voronoi tessellation analysis tweetsCombined_geolocatedSubset_voronoi <- unique(tweetsCombined_geolocatedSubset[16:17]) # remove geolocation duplicates (duplicates will not work through voronoi function) voronoiSPDF <- SpatialPointsDataFrame(cbind(tweetsCombined_geolocatedSubset_voronoi$longitude, create spatial data frame tweetsCombined_geolocatedSubset_voronoi$latitude), tweetsCombined_geolocatedSubset_voronoi, match.ID=TRUE) voronoiSPDF <- voronoipolygons(voronoiSPDF) # create voronoi polygons from function #### MAPPING ##### # k-nn clustering algorithm geo.dist = function(tweetsCombined_geolocatedSubset){ require(geosphere) distMatrix <- function(i,z){ dist <- rep(0, nrow(z)) dist[i:nrow(z)] <- distHaversine(z[i:nrow(z), 16:17], z[i, 16:17]) return(dist)} dm <- do.call(cbind,lapply(1:nrow(tweetsCombined_geolocatedSubset), distMatrix, tweetsCombined_geolocatedSubset)) return(as.dist(dm)) }
21
#
# k-nn clustering analysis K = 32 # number of k clusters (informed from k-means) distMatrix <- geo.dist(tweetsCombined_geolocatedSubset)
# distance matrix
cluster <- hclust(distMatrix) # hierarchical clustering method kMeansCluster <- kmeans(geo.dist(tweetsCombined_geolocatedSubset), centers = K) # k-means clustering, number of clusters tweetsCombined_geolocatedSubset$clusterID <- cutree(cluster, k = K) # create cluster variable column # voronoi tessellation algorithm voronoipolygons = function(layer) { require(deldir) crds = layer@coords z = deldir(crds[,1], crds[,2]) w = tile.list(z) polys = vector(mode='list', length=length(w)) require(sp) for (i in seq(along=polys)) { pcrds = cbind(w[[i]]$x, w[[i]]$y) pcrds = rbind(pcrds, pcrds[1,]) polys[[i]] = Polygons(list(Polygon(pcrds)), ID=as.character(i)) } SP = SpatialPolygons(polys) voronoi = SpatialPolygonsDataFrame(SP, data=data.frame(x=crds[,1], y=crds[,2], row.names=sapply(slot(SP, 'polygons'), function(x) slot(x, 'ID')))) } # voronoi tessellation analysis tweetsCombined_geolocatedSubset_voronoi <- unique(tweetsCombined_geolocatedSubset[16:17]) # remove geolocation duplicates (duplicates will not work through voronoi function) voronoiSPDF <- SpatialPointsDataFrame(cbind(tweetsCombined_geolocatedSubset_voronoi$longitude, create spatial data frame tweetsCombined_geolocatedSubset_voronoi$latitude), tweetsCombined_geolocatedSubset_voronoi, match.ID=TRUE) voronoiSPDF <- voronoipolygons(voronoiSPDF) # create voronoi polygons from function ##### MAP MAKING ##### # create map marker pop up content popupContent <- tweetsCombined_geolocatedSubset$popup
# text string of tweets above map marker
# create a color palette by k-nn clustering colorsmap = colors()[1:length(unique(tweetsCombined_geolocatedSubset$clusterID))] clusterPalette = colorNumeric(palette = "Set2", domain = tweetsCombined_geolocatedSubset$clusterID) # create popularity df (n favorites + n retweets) tweetsCombined_geolocatedSubset$popularity <- (tweetsCombined_geolocatedSubset$favoriteCount + tweetsCombined_geolocatedSubset$retweetCount) # create new column from sum of favorites + retweets tweetsCombined_geolocatedSubset$popularityBins = cut(tweetsCombined_geolocatedSubset$popularity,c(0,5,10,15))
22
#
# create map map <- leaflet(data = tweetsCombined_geolocatedSubset) %>%
# run leaflet
### basemap group ### addProviderTiles(providers$CartoDB.Positron, group = "Basemap Model") %>% # default basemap style 0 addProviderTiles(providers$Stamen.Toner, group = "Hybridized Imagery") %>% addProviderTiles(providers$Esri.WorldImagery, group = "Hybridized Imagery", # basemap style option 1 options = providerTileOptions(opacity = 0.5)) %>% addProviderTiles(providers$Esri.WorldImagery, group = "Satellite Imagery") %>% # basemap style option 2 # default display location setView(lng = 8.918684, # longitude coord lat = 44.407382, # latitude coord zoom = 13) %>% # scale ### overlay group ### # twitter markers addCircleMarkers(~longitude, ~latitude, # geolocation marker fill = TRUE, # marker fill option radius = 5, # marker size fillColor = "#1DA1F2", # marker fill color fillOpacity = 0.75, # marker fill, graphic transparency weight = 0, # marker stroke weight (outline) opacity = 0.75, # marker stroke (outline), graphic transparency color = "#1DA1F2", # marker stroke color (outline) popup = ~as.character(popupContent), # load pop up content group = "Tweets") %>% # group category for toggle # twitter color markers (cluster classified) addCircleMarkers(~longitude, ~latitude, # geolocation marker fill = TRUE, # marker fill option radius = 5, # marker size fillColor = ~clusterPalette(clusterID), # marker fill color fillOpacity = 1, # marker fill, graphic transparency weight = 0, # marker stroke weight (outline) opacity = 1, # marker stroke (outline), graphic transparency color = ~clusterPalette(clusterID), # marker stroke color (outline) popup = ~as.character(popupContent), # load pop up content group = "Tweet Clusters") %>% # group category for toggle # twitter spatial concentration addWebGLHeatmap(~longitude, ~latitude, # spatial concentration overlay intensity = 0.2, # concentration size = "400", # search radius (400m = 5 min walk) units = "m", # meters opacity = 0.5, # graphic transparency alphaRange = 1, # intensity gradientTexture = NULL, group = "Tweet Concentrations") %>% # group category for toggle # thiessen polygons - voronoi tessellation (boundary) addPolygons (data = voronoiSPDF,
23
fill = FALSE, # marker fill option fillColor = NULL, # marker fill color fillOpacity = NULL, # marker fill, graphic transparency weight = 0.25, # marker stroke weight (outline) color = "black", # marker stroke color opacity = 0.5, # marker stroke , graphic transparency group = "Tweet Thiessen Polygons") %>% # group category for toggle # twitter popularity (favorites + retweets) addCircleMarkers(~longitude, ~latitude, fill = FALSE, # marker fill option radius = ~popularity, # marker size fillColor = NULL, # marker fill color fillOpacity = NULL, # marker fill graphic transparency weight = 1.00, # marker stroke weight (outline) opacity = 0.5, # marker stroke (outline) graphic transparency color = "black", # marker stroke color (outline) popup = ~as.character(popupContent), # load pop up content group = "Tweet Popularity") %>% # group category for toggle # site outline addRectangles(lng1 = 8.9832, lat1 = 44.4018, # assigned coord 1 lng2 = 8.9969, lat2 = 44.3918, # assigned coord 2 weight = 2.33, # marker stroke weight (outline) color = "purple", # marker stroke color opacity = 0.75, # marker stroke , graphic transparency fill = FALSE, # marker fill option fillColor = "transparent", # marker fill color fillOpacity = NULL) %>% # marker fill, graphic transparency
### layer control toggle ### addLayersControl(baseGroups = c("Basemap Model", # default basemap style 0 "Hybridized Imagery", "Satellite Imagery"), # optional basemap style 1 overlayGroups = c("Tweets", # toggle selector "Tweet Clusters", # toggle selector "Tweet Concentrations", # toggle selector "Tweet Thiessen Polygons", # toggle selector "Tweet Popularity"), # toggle selector options = layersControlOptions(collapsed = FALSE)) %>% # default view unselect layers hideGroup("Tweets") %>% hideGroup("Tweet Clusters") %>% hideGroup("Tweet Concentrations") %>% hideGroup("Tweet Thiessen Polygons") %>% hideGroup("Tweet Popularity") %>% # scale bar addScaleBar(position = "bottomleft") # display map map
24
#### TOPIC MODELING #### ## functions ## # function to simplify tweet source (iPhone, Android, Web, etc.) rather than HTML tagged description encodeSource <- function(x){ if(x=="<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>"){ gsub("<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>", "iphone", x,fixed=TRUE) }else if(x=="<a href=\"http://twitter.com/#!/download/ipad\" rel=\"nofollow\">Twitter for iPad</a>"){ gsub("<a href=\"http://twitter.com/#!/download/ipad\" rel=\"nofollow\">Twitter for iPad</a>","ipad",x,fixed=TRUE) }else if(x=="<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>"){ gsub("<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>","android",x,fixed=TRUE) }else if(x=="<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>"){ gsub("<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>","Web",x,fixed=TRUE) }else if(x=="<a href=\"http://www.twitter.com\" rel=\"nofollow\">Twitter for Windows Phone</a>"){ gsub("<a href=\"http://www.twitter.com\" rel=\"nofollow\">Twitter for Windows Phone</a>","windows phone",x,fixed=TRUE) }else if(x=="<a href=\"http://dlvr.it\" rel=\"nofollow\">dlvr.it</a>"){ gsub("<a href=\"http://dlvr.it\" rel=\"nofollow\">dlvr.it</a>","dlvr.it",x,fixed=TRUE) }else if(x=="<a href=\"http://ifttt.com\" rel=\"nofollow\">IFTTT</a>"){ gsub("<a href=\"http://ifttt.com\" rel=\"nofollow\">IFTTT</a>","ifttt",x,fixed=TRUE) }else if(x=="<a href=\"http://earthquaketrack.com\" rel=\"nofollow\">EarthquakeTrack.com</a>"){ gsub("<a href=\"http://earthquaketrack.com\" rel=\"nofollow\">EarthquakeTrack.com</a>","earthquaketrack",x,fixed=TRUE) }else if(x=="<a href=\"http://www.didyoufeel.it/\" rel=\"nofollow\">Did You Feel It</a>"){ gsub("<a href=\"http://www.didyoufeel.it/\" rel=\"nofollow\">Did You Feel It</a>","did_you_feel_it",x,fixed=TRUE) }else if(x=="<a href=\"http://www.mobeezio.com/apps/earthquake\" rel=\"nofollow\">Earthquake Mobile</a>"){ gsub("<a href=\"http://www.mobeezio.com/apps/earthquake\" rel=\"nofollow\">Earthquake Mobile</a>","earthquake_mobile",x,fixed=TRUE) }else if(x=="<a href=\"http://www.facebook.com/twitter\" rel=\"nofollow\">Facebook</a>"){ gsub("<a href=\"http://www.facebook.com/twitter\" rel=\"nofollow\">Facebook</a>","facebook",x,fixed=TRUE) }else{ "others"} } ## data ## # load data tweets.df <- tweetsCombined_geolocatedSubset
# load from above
# clean data I tweets.df$text <- sapply(tweets.df$text, function(row) iconv(row, "latin1", "ASCII", sub = "")) # removes emojis
25
tweets.df$text <- sapply(tweets.df$text, function(x) iconv(x, to = 'UTF-8')) # allows handling of different grammatical languages tweets.df$tweetSource = sapply(tweets.df$statusSource, function(sourceSystem) encodeSource(sourceSystem)) # simplifies text string of tweet source (iPhone, Android, Web, Etc.) # clean data II tweetCorpus <- VCorpus(VectorSource(tweets.df$text)) # specify text column tweetCorpus <- tm_map(tweetCorpus, removePunctuation) # remove punctuation tweetCorpus <- tm_map(tweetCorpus, tolower) # transform all text to lower case tweetCorpus <- tm_map(tweetCorpus, removeNumbers) # remove numbers removeURL <- function(x) gsub("http[[:alnum:]]*", "", x) # remove URLs tweetCorpus <- tm_map(tweetCorpus, removeURL) # return to data frame twitterStopWords <- c(stopwords("italian")) # remove stop words tweetCorpus <- tm_map(tweetCorpus, removeWords, twitterStopWords) # return to data frame tweetCorpus <- tm_map(tweetCorpus, PlainTextDocument) # transform to plain text ## term frequency ## # create term doc matrix twitterTermDocMatrix <- TermDocumentMatrix(tweetCorpus, # create data frame control = list(removePunctuation = TRUE, # removes punctuation stopwords = TRUE, # removes stop words removeNumbers = TRUE, # removes numbers tolower = TRUE, # transforms all text to lower case minWordLength = 1)) # controls for word length # word frequency parameters searchTermFreq <- 25 # order results by k number of occurances # word frequency analysis which(apply(twitterTermDocMatrix, 1, sum) >= searchTermFreq) # calculate terms occurances (frequentTerms <- findFreqTerms(twitterTermDocMatrix, lowfreq = searchTermFreq)) # print the frequent terms from twitterTermDocMatrix term.freq <- rowSums(as.matrix(twitterTermDocMatrix)) # calculate frequency of each term subsetterm.freq <- subset(term.freq, term.freq >= searchTermFreq) # selecting subset by specified number of occurrences frequentTermsSubsetDF <- data.frame(term = names(subsetterm.freq), freq = subsetterm.freq) # create data frame from subet of terms frequentTermsDF <- data.frame(term = names(term.freq), freq = term.freq) # create data with all terms frequentTermsSubsetDF <- frequentTermsSubsetDF[with(frequentTermsSubsetDF, order(-frequentTermsSubsetDF$freq)), ] # sort by subset data frame frequency frequentTermsDF <- frequentTermsDF[with(frequentTermsDF, order(-frequentTermsDF$freq)), ] #sort by complete DF frequency frequentTermsDF # table results # plot results - histogram ggplot(frequentTermsSubsetDF, aes(x = reorder(term,freq), y = freq)) + geom_bar(stat = "identity") + xlab("Terms") + ylab("Freqency") + coord_flip() # print # plot results - word cloud
26
wordcloud(words = frequentTermsDF$term, freq = frequentTermsDF$freq, colors = frequentTermsDF$freq, random.order = FALSE) ## device analysis ## # subset by source tweetsSubset.df <- subset(tweets.df, tweets.df$tweetSource >= 1000) # plot results = histogram ggplot(tweetsSubset.df, aes(x = tweetSource, y = retweetCount/1000)) + geom_bar(stat = "identity") + xlab("Device") + ylab("Quantity (Thousands)") ## topic modeling ## # create doc term matrix (not to be confused with previous term doc matrix) twitterDTM <- DocumentTermMatrix(tweetCorpus, control = list(minWordLength = 1)) # create data frame # clean data rowTotals <- apply(twitterDTM , 1, sum) twitterDTM <- twitterDTM[rowTotals > 0,]
# remove non zero entry
# latent dirichlet allocation algorithm ldaTopcs <- LDA(twitterDTM, k = 25) # k number of topics to find ldaTerms <- terms(ldaTopcs, 5) # n number of terms of every topic ldaTermsPlot <- (ldaTerms <- apply(ldaTerms, MARGIN = 2, paste, collapse = ", ")) terms and print results ldaTermsPlot # table results
# concatenate
## assocation modeling ## # term search association paramters assocationSearchTerm <- "hotel" # search word (associations <- findAssocs(twitterTermDocMatrix, assocationSearchTerm, 0.25)) correlation threshold
# table results,
# plot results - histogram associations.freq <- rowSums(as.matrix(associations$hotel)) associationsDF <- data.frame(term = names(associations.freq), freq = associations.freq) associationsDF <- associationsDF[with(associationsDF, order(-associationsDF$freq)), ] ggplot(associationsDF, aes(x = reorder(term,freq), y = freq)) + geom_bar(stat = "identity") + xlab("Terms") + ylab ("Associations") + coord_flip()
# term search association paramters assocationSearchTerm <- "home" # search word (associations <- findAssocs(twitterTermDocMatrix, assocationSearchTerm, 0.25)) correlation threshold # plot results - histogram
27
# table results,
associations.freq <- rowSums(as.matrix(associations$home)) associationsDF <- data.frame(term = names(associations.freq), freq = associations.freq) associationsDF <- associationsDF[with(associationsDF, order(-associationsDF$freq)), ] ggplot(associationsDF, aes(x = reorder(term,freq), y = freq)) + geom_bar(stat = "identity") + xlab("Terms") + ylab ("Associations") + coord_flip()
# term search association paramters assocationSearchTerm <- "work" # search word (associations <- findAssocs(twitterTermDocMatrix, assocationSearchTerm, 0.25)) correlation threshold
# table results,
# plot results - histogram associations.freq <- rowSums(as.matrix(associations$work)) associationsDF <- data.frame(term = names(associations.freq), freq = associations.freq) associationsDF <- associationsDF[with(associationsDF, order(-associationsDF$freq)), ] ggplot(associationsDF, aes(x = reorder(term,freq), y = freq)) + geom_bar(stat = "identity") + xlab("Terms") + ylab ("Associations") + coord_flip()
# term search association paramters assocationSearchTerm <- "relax" # search word (associations <- findAssocs(twitterTermDocMatrix, assocationSearchTerm, 0.25)) correlation threshold
# table results,
# plot results - histogram associations.freq <- rowSums(as.matrix(associations$relax)) associationsDF <- data.frame(term = names(associations.freq), freq = associations.freq) associationsDF <- associationsDF[with(associationsDF, order(-associationsDF$freq)), ] ggplot(associationsDF, aes(x = reorder(term,freq), y = freq)) + geom_bar(stat = "identity") + xlab("Terms") + ylab ("Associations") + coord_flip()
# term search association paramters assocationSearchTerm <- "art" # search word (associations <- findAssocs(twitterTermDocMatrix, assocationSearchTerm, 0.25)) correlation threshold
# table results,
# plot results - histogram associations.freq <- rowSums(as.matrix(associations$art)) associationsDF <- data.frame(term = names(associations.freq), freq = associations.freq) associationsDF <- associationsDF[with(associationsDF, order(-associationsDF$freq)), ]
28
ggplot(associationsDF, aes(x = reorder(term,freq), y = freq)) + geom_bar(stat = "identity") + xlab("Terms") + ylab ("Associations") + coord_flip() # term search association paramters assocationSearchTerm <- "photography" # search word (associations <- findAssocs(twitterTermDocMatrix, assocationSearchTerm, 0.25)) correlation threshold
# table results,
# plot results - histogram associations.freq <- rowSums(as.matrix(associations$photography)) associationsDF <- data.frame(term = names(associations.freq), freq = associations.freq) associationsDF <- associationsDF[with(associationsDF, order(-associationsDF$freq)), ] ggplot(associationsDF, aes(x = reorder(term,freq), y = freq)) + geom_bar(stat = "identity") + xlab("Terms") + ylab ("Associations") + coord_flip()
# term search association paramters assocationSearchTerm <- "city" # search word (associations <- findAssocs(twitterTermDocMatrix, assocationSearchTerm, 0.25)) correlation threshold
# table results,
# plot results - histogram associations.freq <- rowSums(as.matrix(associations$city)) associationsDF <- data.frame(term = names(associations.freq), freq = associations.freq) associationsDF <- associationsDF[with(associationsDF, order(-associationsDF$freq)), ] ggplot(associationsDF, aes(x = reorder(term,freq), y = freq)) + geom_bar(stat = "identity") + xlab("Terms") + ylab ("Associations") + coord_flip()
29