Fitting Real-Estate Data Using K-Means Filtering Joshuah TOUYZ
December 20, 2013
Abstract Typically in spatial data analysis a central limitation is fitting a model when there are many observations since this requisites inverting large and frequently intractable matrices. Among the consequent fixes the data is partitioned or a single plot of interest is analyzed. Partitioning data poses a problem since cutting sections, where high degrees of correlation exists, affects the subsequent estimator, especially at the edges which induces higher variability. In the following paper a method is proposed to create a real estate price map when the number of data points is large. Recognizing that real estate is frequently separated based on neighbourhood or region a K-Means filtering method is implemented to partition the data set. The main result is that partitions have smaller intra-class variability than randomly sampling. The theoretical foundations of the K-means algorithm are sound and not discussed further within this paper, the methodology however presents a quick and dirty way to localize neighbourhoods without inducing additional variance which most likely leads to increased pricing accuracy.
1
1 INTRODUCTION
1
Introduction
Property prices exhibit a high degree of spatial dependence. The price of any individual house is significantly influenced by that of it’s neighbours yet dependence decreases proportionally as distance increases. While appraisers value property based on a prescribed set of rules most valuation models are sufficiently reductive that they fit into either nearest neighbour or localized linear regression methods which frequently ignore spatial and correlation effects. Ultimately the appraisers experience helps determine the value of the house of interest. The lack of exactitude among appraisers is central among reasons why most buyers/sellers are encouraged to have at least three valuations prior to purchase or going on market, respectively. Spatial statistics offers a possible solution to pricing methodology as it accounts for correlation amongst houses and provides a confidence intervals for estimates. Fitting a spatial model over an entire region is however impractical for two reasons. The first is that correlation decays as distance between properties increases; it eventually becomes negligible for sufficiently large distances. Accordingly, in a city with distinct neighbourhoods that are at large enough distances apart neighbourhoods at opposite ends of the city are unlikely to affect one another’s prices. This result allows for increased model flexibility when pricing houses in a neighbourhood; the feature space and coefficients can be better tuned at the local level. It follows that modelling local neighbourhoods rather than a globally linked topography decreases overall model complexity. The second impractically derives from inverting large matrices in a reasonable amount of time. In best cases scenarios run-time is polynomially bounded in Ω(n2 log(n))[2]. So for example an addition of 10 points leads to a minimal run-time factor increase of 102 log(10) ≈ 230. Accordingly, fitting an entire spatial model to, for example, 10000 points is not only impractical but also impossible given the current computational budget of home computers. To address the issues above one asks if the inherent structure of neighbourhoods within a city can be exploited? The answer is in the positive and two empirical observations obtain: 1. Similarly priced houses (on a normalized basis) cluster and 2. Suburban creep occurs along a gradient away from the city centre as technology improves which result in the development of new neighbourhoods. To this end it’s possible in the absence of aerial data to designate specific areas as neighbourhoods through an unsupervised learning task. This occurs solely using locational information rather than price points or other quantitative/qualitative measures. Subsequent to finding neighbourhoods, relevant feature information is incorporated to construct a pricing model. Arguably, simultaneously including all information to find clusters may be more accurate however additional complicating assumptions would be needed. Accordingly, a simple approach is taken.
2
3
2 METHODOLOGY
2
Methodology
Below is a brief summary of K-means along with considerations for feature selection.
2.1
Neighbourhood Selection Through Iterative Clustering
K-means is a conceptually simple deterministic algorithms for cluster analysis. It’s goal, for a given number of clusters K and points {Xi }ni=1 ∈ Rp , is to minimize intra-cluster variation. In Euclidean space this is tantamount to grouping points around a given center (ck ; k = 1, ..., K) that minimizes overall L2 distance. Mathematically, for K clusters and {Xi }ni=1 ∈ Rp points we seek clusterings C(i) around centres ck which minimize within cluster-scatter: K X X
||Xi − ck ||22 .
(1)
k=1 C(i)=k
¯ k minimizes the objective function (1) in terms of For a fixed cluster partition C(i), ck = X available data and consequently reduces the complexity of the problem to finding appropriate partitioning C(i) of the data. Once a set of cluster centres have been identified along with their corresponding points a Vornoi Tessellation of Rp is defined: Vk = {x ∈ Rp : ||x − ck ||22 ≤ ||x − cj ||22 ; j = 1, ..., l}, k = 1, ..., K.
(2)
In the context of real estate data equation (2) can be viewed as a series of neighbourhoods. The process may be iterated in larger neighbourhoods where smaller boroughs are likely to exist. Alternate cluster analysis methods exist, such as k-medoid and the probabilistic EM neither, however neither are considered here and may be tenable areas of future research.
2.2
Feature Selection
When pricing a house a gamut of factors are integrated into the final valuation. Frequently these factors are correlated and are consequently less informative on an individual basis than when viewed in composite. Two prime examples include: 1. Total living area(TLA) and number of bathrooms. As TLA increases the number of bathrooms is likely to increase 2. Age of house and price per square foot. Older houses tended to be smaller due to budgetary constraints and construction restrictions. Consequently age and price per square foot basis tend to be highly correlated. Accordingly selecting a set of features from a group that is highly correlated leads to a more parsimonious and informative model than when considering all variables simultaneously.
4
3 DATA
3
Data
Data from 25,357 single family homes sold in Lucas County, Ohio, 1993-1998 from the county auditor is used in the following analysis. It was obtained from the sp package in R The feature set includes: 1. Coordinates
8. Baths
15. Lotsize
2. Price
9. Halfbaths
16. Appraised Value
3. Year built
10. Frontage
17. Sdate
4. Stories
11. Depth
18. Date Appraised
5. TLA
12. Garage
19. Year Sold
6. Wall
13. Garages Square Feet
20. Age
7. Beds
14. Rooms
The features Appraised Value, Sdate, Date Appraised, Year Sold and Age were removed from the data set since they are not important for this analysis. In addition, only the years 1997 and 1998 were considered because additional years would have required normalizing prices to reflect changes in inflation and interest rates.
5
4 IMPLEMENTATION
4
Implementation
All analysis is conducted using the statistical software R. Data is read into R and filtered based on the years 1997 and 1998. The pricing model is implemented in three stages. In the first stage neighbourhoods are fit based on longitude and latitude using the K-means algorithm. The consequent partitioning can be shown to decrease intra-cluster variability. In the second stage features are selected. In the third stage models are fit using regression,kriging and nearest neighbour methods.
4.1
Identifying Neighbourhoods
4
6
8
10
12
14
Number of Clusters
(a) Within Sum of Squared Error/ Total Sum of Squared Errors for Different Ks
3e+11 2e+11 1e+11
Within Group Sum Of Squares 2
0e+00
0.90 0.80 0.70 0.60
Within Group Sum Of Squares
To find the optimal number of clusters Everitt and Hothorn’s[1] elbow methods are used. They seek the K for which the marginal return is insufficient to justify increasing model complexity, i.e. the K for which explains most of the variance in the model. The elbow criterion however cannot always be defined unambiguously accordingly K-medoids and affinity propagation methods are used to verify the selected number of clusters is reasonable. Since different initializations lead to different clusterings the K-means algorithm is run 100 times and the results of the ratio of between sum of squared and total squared errors are plotted in figure 1a. The sum of squared errors is also plotted in figure 1b for a single run. At K = 10 roughly 90% of the variance can be explained by the clusterings. The elbow method is sense greedy since it does not consider subsequent borough clustering within the neighbourhoods. Finding a general algorithm that can find embedded clusters is however NP-hard. K-medoids and affinity propagation suggest 5 and 15 clusters, respectively; K = 10 is a compromise between the two(results not included). Figures 5a-5c in appendix section 7.1.2 illustrate some possible Vornoi Tessellations for K = 5, 7, 8 and 10 clusters. Table 1 summarizes the
2
4
6
8
10
12
14
Number of Clusters
(b) Sum of Squared Errors for Different Ks
Figure 1: Finding an Optimal K number of data points allocated to each of cluster after partitioning. The denser clusters are of
6
4 IMPLEMENTATION
particular interest since they indicate additional structure may exist within the neighbourhoods.
Cluster
1
2
3
4
5
6
7
8
9
10
Number of Data Points
252
1222
726
446
50
556
1721
1908
1311
1218
Table 1: Number of Data Points in Clusters
4.2
Selecting a Neighbourhood and Borough
lat
214000
Find cluster?
218000
0.98 0.94 2
4
6
8
10
12
210000
0.90
Within Group Sum Of Squares
Cluster 9 is selected. Through an iterative process it is further partitioned into 4 boroughs using the K-means algorithm figures 2a and 2b summarize the results. A similar iterative process can be used to identify boroughs in other clusters. The number of points in each borough is given
14
Number of Clusters
498000
500000
502000
504000
506000
508000
long
(a) Within Sum of Squared Error/ Total Sum of Squared Errors for Different Ks
(b) Vornoi Plot for k = 4
Figure 2: Finding an Optimal K in table 2. We’ll restrict our focus to borough 3.
Borough
1
2
3
4
Number of Points
198
280
498
246
Table 2: Number of Data Points in Boroughs for Cluster 9
4.3
Feature Selection
Variables are filtered based on three methods: principal component analysis, random decision trees and step-wise selection. The first two are presented here whereas the third(step-wise selection) is implemented in section 4.4.2 when fitting a linear model. Feature selection occurs over the entire data set with the assumption that selected variables are among the main factors responsible for house prices regardless of neighbourhood.
7
4 IMPLEMENTATION
Feature selection occurs in two stages. In the first stage the features are filtered using principal component analysis (PCA - from the FactoMineR package in R) which scores components responsible for variance in the data, subsets of these components are likely correlated with one another (at least from a real estate point of view). The possible variables include yrbuilt, TLA , beds, baths, halfbaths, frontage depth, garage, garagesqft, rooms and lotsize. In the second stage random forests are generated to corroborate the selected candidates. The variables with the highest eigenvalues are TLA, baths, yrbuilt,rooms and beds and halfbaths. Note however that TLA is highly correlated with, rooms,beds and haflbaths their correlations are all above 0.5 i.e. 0.74,0.64,0.52, respectively. Some of the variables are likely redundant. On the other hand the remaining variables all have correlation coefficients less than 0.5 (0.23,0.45,0.45) with the variable yrbuilt. The correlation between yrbuilt and TLA is 0.41. In comparison, random forests indicate the two largest contributors to % increase in mean squared error are TLA and yrbuilt and when combined with longitude and latitude they variable set contribute to over 80% of the MSE. Accordingly TLA and yrbuilt are the two features which are selected during the fitting process.
4.4
Model Fitting
Different models are fitted to the spatial topography. The four considered here are: 1. Regression Based Predictors
3. Universal Kriging and
2. Ordinary Kriging
4. Nearest Neighbour
4.4.1
Data
Relevant figures are included in the appendix under the section 7.1.3. The price distribution of houses in cluster 8 borough 3 is uni-modal and rightward skewed. There is a parabolic trend in both East-West directions. The directional variograms further reveal there is spatial correlation in the 90 and 135 directions whereas the variogram map indicates two areas with high spatial correlation. When fitting a linear trend to the variogams they are bounded which implies stationarity among residuals.
4.4.2
Linear Model
Several linear models were generated through step-wise regression and compared based on their AIC. The selected model’s parametric form is given by : f (w, x, y, z) = β0 + β1 (x − x ¯)2 + β2 (z − z¯)2 + β3 (w − w) ¯
(3)
where w is the year built and z is the TLA.The fitted linear model 3 normalizes the residuals. Summary figures are included in appendix section 7.1.4.
4.5
Kriging Models
Subsequent to fitting a linear model ordinary and universally kriging models were fit. The spatial trend used in in the UK model was derived from a reduced functional form of equation
8
4 IMPLEMENTATION
3: f (x, y) = β0 + β1 (x − x ¯)2 .
(4)
Removing yrbuilt and TLA from the trended model reduces the accuracy of the mean effect but simplifies the model fitting procedure. It is possible to incorporate yrbuilt and TLA by considering them as deterministic components of the spatial model; figures 8 and 7 summarize changes to the residuals and the semivariogram using equation (4) rather than equation (3). Despite dropping TLA and yrbuilt the residuals along the longitudinal and latitudinal axes do not exhibit additional trends. In addition most of the semivariances fall within the outlined semvariogram envelope. The maximum likelihood estimate is subsequently fit to the semivariogram (see figure (9) in section 7.1.5) and used to fit a universal kriged models with an exponential,Gaussian, spherical, circular and cubic covariance functions. The results for ordinary and the exponentially universally kriged models are compared in figure 3. The inclusion of a spatially trended terms smooths the predicted surface when compared to the ordinary kriged surface which is more hilly.
1000
OK Prediction
0 −2000
Y Coord
UK Prediction
−1500
0
1000
X Coord
Figure 3: Universal vs. Ordinary Kriging
4.6
Nearest Neighbour Regression
Nearest neighbour regression is a non-parametric method that predicts values based on the class membership of the k closest members. In it’s unweighted form all k members are treated equally regardless of distance. The predicted value z for a point located at (x, y) is then given by: X zi z= i∈I,card(I)=k;min ||(xi ,yi )−(x,y)||2
a variant is to use the Mahalanobis distance as measure of nearness. Typically the optimal number of neighbours is found through an iterative search given some test set.
9
3.5e+08 2.5e+08
Mean Squared Error 10 Fold CV
5 RESULTS AND ANALYSIS
0
10
20
30
40
50
Neigbours
Figure 4: Optimal Number of Neighbours Using 10 Fold Cross Validation
5
Results and Analysis
Ten-fold cross validation is conducted on each of the models; roughly 49 points at each are removed from the plot for testing and the predictive mean squared error(PMSE) is calculated. Ordinary kriging is the standard against which the models are compared. Looking at figure 4 nearest neighbour regression attains a minimum mean squared predictive error rate at k = 11. Including the longitudinal trend reduces the average PMSE of 10-Fold CV by 17.37% when using an exponentially fit model relative to ordinary kriging. Other Universal Kriging methods display marginal increases (rather than decreases) relative to Ordinary Kriging. In contrast nearest neighbourhood searches and linear regression models produce much poorer results, indicating that spatial correlation is important to consider when fitting such models. The results are summarized in table 3.
Model
Average PMSE of 10-Fold CV
% Decrease Relative to OK
Ordinary Kriging Universal Kriging - Exponential Universal Kriging - Gaussian Universal Kriging - Spherical Universal Kriging - Circular Universal Kriging- Cubic Linear Regression Nearest Neighbour k = 11
27736540 22917646 58561662 34897992 33119719 38309144 217900594 207500901
– 17.37 -111.13 -25.82 -19.408 -38.11 -685.60 -648.1139
Table 3: Comparing Models
6 CONCLUSION
6
Conclusion
This project covered the methodological steps for partitioning a large spatially correlated data set into manageable plots without inducing increases in intra-cluster variability. The iterative process led to a model that fit the data well and gives credence to using spatial structure when pricing real estate rather than relying on linear regression and nearest neighbours methods. While the theoretical underpinnings behind clustering can be explored further, especially from a spatial statistics point of view, time limitations precluded fitting an entire topographic map. The next step in such a project would be to compare the average decrease in error between random and systematic sampling of the entire topography. A number of points were made throughout this project with respect to future research. Some ideas included: using alternate clustering and clustering selection algorithms, implementing Mahalanobis rather than Euclidean distance nearest neighbour search and including the linearly trended term as a deterministic part of the spatial model.
References [1] Brian S. Everitt and Torsten Hothorn A Handbook of Statistical Analyses Using R. Chapman Hall, Florida, 2nd Edition, July 20, 2009. [2] Ran Raz. On the Complexity of the Matrix Product. In Proceedings of the thirty-fourth annual ACM symposium on Theory of computing, ACM Press, 2002.
10
11
7 APPENDIX
7
Appendix
7.1 7.1.1
Implementation Feature Selection
yrbuilt stories TLA wall beds baths halfbaths frontage depth garage garagesqft rooms lotsize s1997 s1998 long lat
%IncMSE
IncNodePurity
25.900402 10.904804 19.695670 6.695135 5.013221 10.366785 7.587152 10.434635 4.377202 10.059274 9.623783 5.731910 11.377334 1.715203 3.619741 17.413305 20.406973
4.865431e+12 3.622430e+11 1.024634e+13 6.985975e+11 4.304930e+11 4.265309e+12 9.277709e+11 1.873431e+12 3.788216e+11 2.638164e+12 1.319486e+12 1.053844e+12 3.749352e+12 7.123309e+10 6.540196e+10 1.942364e+12 1.107115e+12
Table 4: Random Forest With 100 Trees
7.1.2 Selecting Neighbourhoods-Vornoi Tessellations For Different Neighbourhoods 7.1.3
Model Fitting/ Data
Figures A grid is constructed on which the points will be interpolated. Notice near the edges of variogram map there are some houses priced differently.
7.1.4
Model Fitting/ Fitting a Linear Model
7.1.5
Model Fitting/ Spatially Trended Model
12
205000
Find cluster?
215000
225000 510000
195000
490000
vlat
Find cluster?
215000 195000
205000
vlat
225000
7 APPENDIX
530000
490000
vlong
510000
530000
vlong (c) Vornoi With 8 Centers
Find cluster?
215000 205000 195000
Find cluster? 490000
vlat
225000
(b) Vornoi With 7 Centers
225000 215000 205000 195000
530000
vlong
(a) Vornoi With 5 Centers
vlat
510000
490000
510000
530000
vlong (d) Vornoi With 10 Centers
Figure 5: Fitted Neighbourhoods for Different Numbers of Clusters
13
7 APPENDIX
120000 Price
80000 40000
Price
0
40000
80000
30 20 10
Frequency
40
120000
50
Histrogram of Price For Cluster 8 Borough 3
20000
60000
100000
140000
505000
Price
507000
213000
East−West
(a) Histogram of Prices
215000 South−North
(b) East-West vs. Prices
(c) North-South vs. Prices
500
1000 1500
0
45
90
135
8e+08
Interpolation Grid and Sample Points
6e+08
4e+08
price 1e+09
dy
8e+08
6e+08
0
semivariance
2000
2e+08
8e+08
6e+08 4e+08 4e+08
−2000 2e+08
2e+08 −3000
−1000
0
dx
1000 2000 3000 500
1000 1500
distance
(d) Directional Variogram
14
0
40000 1000
−2000
0
2000
lat
Normal Q−Q Plot
Histogram of Residuals
60 0
20
0
Frequency
40000
long
−40000
Sample Quantiles
−1000
−40000
Residuals
0 −40000
Residuals
40000
7 APPENDIX
−3
−1
1
Residuals
3
−40000 0
40000
model0$residuals
Figure 6: Fitting a Linear Model to the Data
15
1000
40000 0
Residuals 0
2000
−2000
0
1000
lat
Normal Q−Q Plot
Histogram of Residuals
40 0
20
0
Frequency
40000
60
long
−60000
Sample Quantiles
−1000
−60000
40000 0 −60000
Residuals
7 APPENDIX
−3 −2 −1
0
1
2
3
−60000
Residuals
0
40000
model00$residuals
4e+08 2e+08 0e+00
semivariance
6e+08
Figure 7: Fitting a Spatial Trend Model to the Data
0
1000
2000
3000
distance
Figure 8: Semivariance with Envelope
4000
16
semivariance
0e+00
2e+08
4e+08
7 APPENDIX
0
1000
2000
3000
4000
distance
Figure 9: Weighted Least Squares Estimate (Dotted Line) vs. Maximum Likelihood Estimate (Solid Line)