TOWARDS AN EPIDEMIOLOGY OF GENTRIFICATION: Modeling Urban Change as a Probabilistic Process using k-Means Clustering and Markov Models
EMILY BINET ROYALL MASTERS THESIS | MIT DUSP | 2015
This thesis creates an algorithm that learns urban change in neighborhoods.
Urban change is defined by so many things.
Economies
Politics
Ideals
Communities
How can we incorporate all these pieces into our understanding of the process? How do we distinguish between symptoms and mechanisms to design nuanced policy?
Contents Problem Space: Gentrification 1. Overview 2. The Gentrification Debate 3. Model Divide
Solution Space: Towards an Epidemiology of Gentrification 1. 2.
Framework Machine Learning: Definition & Origins
Methods: Data & Descriptives 1. 2. 3. 4.
Selected Methods Feature Selection ACS: Advantages & Limitations Inside the Clusters
Results 1. 2. 3.
Summary Visualizing States Identifying & Predicting State Transitions
Discussion 1. 2.
Study Limitations Future Research Opportunities
1. Problem Space: Gentrification
Problem Space_Overview
symptom treating policies: 1. Rent Control: insulate neighborhoods from change 2. Stabilization Vouchers: ameliorate displacement 3. Property Tax Freezes: insulate longtime residents
Problem Space_The Gentrification Debate
Natural Phenomenon (Ball, 2015; Veneradi, 2014;)
Nature
Artifice
Political Construct & Capital Flows (Slater, 2015; Smith, 1979; Lees 1994; Harvey 1973)
perspectives methods
Simulation & Quantitative Modeling (Ball, 2015; Veneradi, 2014;)
Space
camp 1
Place
camp 2
Case-based Interviews & Displacement Studies (Fullilove, 2004)
Problem Space_The Model Divide
Ecology1
Biology2
Physics3
Burgess, 1923; Hoyt, 1939; 2 Forrester, 1969; Birch, 1971 3 Shelling, 1971; Batty, 2005; Torrens & Nara, 2006; 1
Problem Space_The Model Divide
Economics1
Politcal Theory2
Psychology3
1 Smith,
1979; Ley, 1986; Slater, 2006; 2 Harvey, 1973; Lees, 1994; 3 Beauregard, 1986; Hamnett, 1991; Fullilove, 2004
2. Solution Space: Towards an Epidemiology of Gentrification
“…if we understand the principles behind the behavior of cities, we can build on potential assets and strengths, instead of acting at crosspurposes to them.” -Jane Jacobs
Solution Space_Framework
We need: 1. To bridge the philosophical gap 2. Data-driven approach that captures complexity 3. Find a mechanism to treat the disease not the symptom
Solution Space_Machine Learning • Subfield of computer science & signal processing • Evolved from the study of pattern recognition and computational learning theory in artificial intelligence • Explores the study and construction of algorithms that can learn from and make predictions on data • Comprises of Feature extractor, classifier, criterion function • Data can be labeled or unlabeled, algorithm is supervised or unsupervised
training data
ML algorithm
test data
hypothesis
performance
3. Data & Methods
Methods & Data_ACS Preprocessing • American Community Survey Data (5 year estimates, Block Group): 2009-10, 2010-11, 2011-12, 2012-13 • Four Boroughs: Bronx, Queens, Kings, New York • Five Features: 1. Income, (Brookings Institution, 2011; Smith 1979), 2. Education level (Clay, 1979; Freeman, 2005; Ley, 1986), 3. Density and percentage family households (Hoover & Vernon, 1959; Birch, 1979, Marcuse 1985, Kolko, 2010), 4. Structure age (Hidalgo, 2015; Birch, 1979), 5. Race (Beauregard 1986; Hamnett, 1991; Slater, 2006; Schelling, 1971). • 5,401 observations per year
!"#$%&'() = (!! /!)!! + (!!!! /!)!!!! ! where c(i) is the count of households of education bracket i, is a weighted value for the education bracket i, ranging from 1 (less than high school) to 6 (PhD) and t is the total number of households within the census block group for which the index is computed.
!"#$%& = (!! /!)!! + (!!!! /!)!!!! ! where c(i) is the count of households of income bracket i, x(i) is the maximum income value of income bracket i, and t is the total number of households within the census block group for which the index is computed.
Methods & Data_Selected Methods
“signal”= multidimensional data changing over time
machine learning
S1
S2
S3
“states” = categories of signal patterns
Methods & Data_Selected Methods centroid
K-means clustering: finding states
transition probability
Markov Process: learning how census blocks transition between states
S1
S2
S3
if:
compare
predictions observed
Methods & Data_Inside the clusters
• How do you choose k? • Silhouette value: ratio of within-cluster similarity and between-cluster similarity between observations (Rousseeuw, 1987),
• Ratio closer to 1 = well separated. Measured by Euclidean distance between observations.
Borough
k=2
k=3
k=4
k=5
k=6
k=10
Bronx
0.53
0.39
0.39
0.34
0.31
0.30
Kings
0.40
0.39
0.31
0.34
0.32
0.29
New York
0.59
0.49
0.52
0.45
0.36
0.34
Queens
0.34
0.30
0.32
0.30
0.31
0.29
Table 1. Silhouette values for all boroughs across selections of K.
Methods & Data_Inside the clusters
Silhouette Plots: Bronx
Figure 1. Silhouette Plots for Bronx County for selections of k=2 (left) and k=3 (right). Left: Average s(i) = 0.53, classification appears to fit most observations into a cluster 2. Right: Average s(i) = 0.39, a portion of observations in Cluster 2 and 3 appear to be misclassified, though the average silhouette value is comparable to K=2, and better than values of k=5 through 10.
Methods & Data_Inside the clusters
Descriptive Statistics: Bronx k=3(1) Max Min Mean StDev Kurtosi Skew k=3(2) Max Min Mean StDev Kurtosi Skew k=3(3) Max Min Mean StDev Kurtosi Skew
%White 0.82 0.00 0.15 0.11 5.51 1.29
Households %Family Income Educat YrStru 1817.00 1.00 115290.26 3.74 1967 7.00 0.00 10000.00 0.00 1939 410.11 0.69 48594.36 2.25 1943 181.63 0.13 15818.30 0.36 6.21 6.45 3.95 3.85 3.77 3.08 1.19 -0.42 0.83 0.08 1.14
1.00 0.00 0.64 0.22 2.97 -0.61
2430.00 5.00 421.68 212.29 14.91 2.10
1.00 0.00 0.61 0.17 3.12 -0.22
200000.00 19896.99 85235.33 20709.45 4.19 0.34
5.41 1.74 3.04 0.58 3.90 1.00
1996 1939 1952 9.58 4.15 0.62
0.82 0.00 0.18 0.13 4.34 1.09
11266.00 4.00 533.00 513.75 177.93 10.12
1.00 0.08 0.66 0.15 3.47 -0.40
123647.65 11874.81 40725.87 15932.34 4.08 1.03
3.51 1.13 2.13 0.38 2.78 0.35
2005 1946 1969 12.70 3.04 0.89
Methods & Data_Inside the clusters
• State 1 (K1): Middle Class - 15% white - mean income: $48,594 - 68% family households - education index: 2.25 - structural age: 1943 • State 2 (K2): Affluent - 64% white - mean income: $85,235 - 61% family households - education index: 3.04 - structural age: 1952 • State 3 (K3): Low Income - 18% white - mean income: $40,725 - 66% family households - education index: 2.13 - structural age: 1969
Results_Visualizing Change BRONX COUNTY, NEW YORK 2010
State 1
¸
State 2 State 3 0
0.5
1
2
3
Miles 4
Results_Visualizing Change BRONX COUNTY, NEW YORK 2011
State 1
¸
State 2 State 3 0
0.5
1
2
3
Miles 4
Results_Visualizing Change BRONX COUNTY, NEW YORK 2012
State 1
¸
State 2 State 3 0
0.5
1
2
3
Miles 4
Results_Visualizing Change BRONX COUNTY, NEW YORK 2013
State 1
¸
State 2 State 3 0
0.5
1
2
3
Miles 4
Results_Visualizing Change
Hunts Point Ave, BRONX
Results_Identifying & Predicting State Transitions
• State Transitions: probability of transitioning from one state to another • Reflects lack of variation in data • Prediction Rates: - 79% (Bronx), - 63% (Kings), - 79% (Queens), - 83% (New York), - overall prediction rate of 76% accuracy.
from low income to low income to affluent to mid income
from affluent
from mid income
0.92
0.06
0.02
0.17
0.82
0.01
0.05
0.02
0.92
2009-10
2010-11
2011-12
increasing probability
2012-13
1 to 1
0.84
0.94
0.94
0.93
1 to 2
0.13
0.05
0.04
0.05
1 to 3
0.03
0.01
0.02
0.02
2 to 1
0.30
0.16
0.12
0.14
2 to 2
0.69
0.83
0.86
0.85
2 to 3
0.01
0.01
0.02
0.02
3 to 1
0.08
0.03
0.04
0.07
3 to 2
0.05
0.02
0.01
0.02
3 to 3
0.86
0.94
0.95
0.91
Table 3. Transition Probabilities for State transitions (represented here as States “1 to 1” etc.), by above) overall transition probabilities for all four and below) year from 2009-2013 for Bronx County. Transition probabilities are highlighted according to relative size, where large probabilities are green and low probabilities are red. Transitions from 1-2 are second most likely after the diagonal.
600"
500"
400"
Count
Count&
150"
100"
0"
State&Sequence& 11 11 1 44 " 44 4 24 " 44 4 41 " 11 1 31 " 11 1 12 " 22 2 11 " 14 4 21 " 11 1 22 " 22 4 11 " 12 2 13 " 33 3 34 " 44 4 32 " 22 2 44 " 42 4 14 " 11 1 33 " 22 2 44 " 42 2 11 " 11 2 44 " 11 1"
11 11 1" 33 33 3" 22 22 2" 21 11 1" 12 22 2" 22 22 1" 11 11 2" 11 22 2" 22 11 1" 22 21 1" 12 11 1" 11 11 3" 13 33 3"
Count& 300"
250"
200"
100"
50"
0"
900"
800"
700"
22 22 2 11 " 11 1 33 " 33 3 43 " 33 3 41 " 11 1 13 " 33 3 31 " 11 1 12 " 22 2 21 " 11 1 42 " 22 2 11 " 11 3 11 " 33 3 22 " 23 3 31 " 33 3"
22 22 2 11 " 11 1 33 " 33 3 21 " 11 1 23 " 33 3 32 " 22 2 22 " 22 1 12 " 22 2 22 " 21 1 22 " 22 3 22 " 23 3"
Count&
Results_Identifying & Predicting State Transitions Bronx k=3, Top State Paths Queens k=4, Top State Paths
400"
350"
300"
250"
200"
150"
100" 50" 0"
State&Sequence& State&Sequence&
Kings k=3, Top State Paths New York k=4, Top State Paths
300"
250"
200"
150"
300" 100"
200"
50"
0"
State Sequence
Results_Visualizing Change NEW YORK COUNTY, NEW YORK 2010
Manhattan CLUSTER_10 1 State 1 2 State
2
¸
3 State 3
State 4 4 0
0.5
1
2
3
Miles 4
Results_Visualizing Change NEW YORK COUNTY, NEW YORK 2011
Manhattan CLUSTER_11 1 State 1 2 State
2
¸
3 State 3
State 4 4 0
0.5
1
2
3
Miles 4
Results_Visualizing Change NEW YORK COUNTY, NEW YORK 2012
Manhattan CLUSTER_12 1 State 1 2 State
2
¸
3 State 3
State 4 4 0
0.5
1
2
3
Miles 4
Results_Visualizing Change NEW YORK COUNTY, NEW YORK 2013
Manhattan CLUSTER_13 1 State 1 2 State
2
¸
3 State 3
State 4 4 0
0.5
1
2
3
Miles 4
110 W 123rd St, Harlem NEW YORK
Results_Visualizing Change QUEENS COUNTY, NEW YORK 2010
State 1 State 2
¸
State 3 State 4 0
0.75
1.5
3
4.5
Miles 6
Results_Visualizing Change QUEENS COUNTY, NEW YORK 2011
State 1 State 2
¸
State 3 State 4 0
0.75
1.5
3
4.5
Miles 6
Results_Visualizing Change QUEENS COUNTY, NEW YORK 2012
State 1 State 2
¸
State 3 State 4 0
0.75
1.5
3
4.5
Miles 6
Results_Visualizing Change QUEENS COUNTY, NEW YORK 2013
State 1 State 2
¸
State 3 State 4 0
0.75
1.5
3
4.5
Miles 6
Results_Visualizing Change
11th St. Hunters Point, QUEENS
Results_What have we done?
• Organized multidimensional data into states that represent shared conditions across census block groups • Observed how census block groups change their condition (states), and predicted unknown ones • Checked to make sure our predictions were accurate • Visualized change, and confirmed it in Street View • Identified likelihood of state transitions • Identified common sequences of change
Limitations of the study_ACS oversampling
• ACS Oversampling method produces lack of variation in data • Test that values vary statistically significantly between years: 1
Years Compared White%
1A
Households
Family%
Income
Education
YrStruct
2009-10
0.240
1.319
0.330
0.317
0.277
0.240
2010-11
0.047
0.018
0.024
0.054
0.002
0.047
2011-12
0.047
0.018
0.024
0.054
0.002
0.047
2012-13
0.031
0.048
0.032
0.009
0.032
0.031
Compass for Using & Understanding American Community Survey Data: User Guide, 2008
Limitations of the study_other limitations
• Indices further reduce variation • Census block group administrative boundaries changed after 2009 • Observations of variation may be artifacts of the the 2008 Recession • k-means clustering forces data into “buckets”; but we prefer to observe streams of continuous variables • Possible Endogeneity: race variable • 76% Accuracy prediction can be improved
Discussion_future research
• Explore new feature extraction techniques: probabilistic feature extraction motivates latent class study • Machine Learning is a fast growing field with evolving possibilities; Deep learning, Bayesian Program Learning • Tie to ethnographic studies and groundtruthing in gentrifying areas • Explore opportunities for integration of local knowledge into algorithm design • Explore design-tool opportunities (learn and project community perceptions!)