project by Joshua Touyz

Fitting Real-Estate Data Using K-Means Filtering Joshuah TOUYZ

December 20, 2013

Abstract Typically in spatial data analysis a central limitation is fitting a model when there are many observations since this requisites inverting large and frequently intractable matrices. Among the consequent fixes the data is partitioned or a single plot of interest is analyzed. Partitioning data poses a problem since cutting sections, where high degrees of correlation exists, affects the subsequent estimator, especially at the edges which induces higher variability. In the following paper a method is proposed to create a real estate price map when the number of data points is large. Recognizing that real estate is frequently separated based on neighbourhood or region a K-Means filtering method is implemented to partition the data set. The main result is that partitions have smaller intra-class variability than randomly sampling. The theoretical foundations of the K-means algorithm are sound and not discussed further within this paper, the methodology however presents a quick and dirty way to localize neighbourhoods without inducing additional variance which most likely leads to increased pricing accuracy.

1 INTRODUCTION

Introduction

Property prices exhibit a high degree of spatial dependence. The price of any individual house is significantly influenced by that of it’s neighbours yet dependence decreases proportionally as distance increases. While appraisers value property based on a prescribed set of rules most valuation models are sufficiently reductive that they fit into either nearest neighbour or localized linear regression methods which frequently ignore spatial and correlation effects. Ultimately the appraisers experience helps determine the value of the house of interest. The lack of exactitude among appraisers is central among reasons why most buyers/sellers are encouraged to have at least three valuations prior to purchase or going on market, respectively. Spatial statistics offers a possible solution to pricing methodology as it accounts for correlation amongst houses and provides a confidence intervals for estimates. Fitting a spatial model over an entire region is however impractical for two reasons. The first is that correlation decays as distance between properties increases; it eventually becomes negligible for sufficiently large distances. Accordingly, in a city with distinct neighbourhoods that are at large enough distances apart neighbourhoods at opposite ends of the city are unlikely to affect one another’s prices. This result allows for increased model flexibility when pricing houses in a neighbourhood; the feature space and coefficients can be better tuned at the local level. It follows that modelling local neighbourhoods rather than a globally linked topography decreases overall model complexity. The second impractically derives from inverting large matrices in a reasonable amount of time. In best cases scenarios run-time is polynomially bounded in Ω(n2 log(n))[2]. So for example an addition of 10 points leads to a minimal run-time factor increase of 102 log(10) ≈ 230. Accordingly, fitting an entire spatial model to, for example, 10000 points is not only impractical but also impossible given the current computational budget of home computers. To address the issues above one asks if the inherent structure of neighbourhoods within a city can be exploited? The answer is in the positive and two empirical observations obtain: 1. Similarly priced houses (on a normalized basis) cluster and 2. Suburban creep occurs along a gradient away from the city centre as technology improves which result in the development of new neighbourhoods. To this end it’s possible in the absence of aerial data to designate specific areas as neighbourhoods through an unsupervised learning task. This occurs solely using locational information rather than price points or other quantitative/qualitative measures. Subsequent to finding neighbourhoods, relevant feature information is incorporated to construct a pricing model. Arguably, simultaneously including all information to find clusters may be more accurate however additional complicating assumptions would be needed. Accordingly, a simple approach is taken.

2 METHODOLOGY

Methodology

Below is a brief summary of K-means along with considerations for feature selection.

2.1

Neighbourhood Selection Through Iterative Clustering

K-means is a conceptually simple deterministic algorithms for cluster analysis. It’s goal, for a given number of clusters K and points {Xi }ni=1 ∈ Rp , is to minimize intra-cluster variation. In Euclidean space this is tantamount to grouping points around a given center (ck ; k = 1, ..., K) that minimizes overall L2 distance. Mathematically, for K clusters and {Xi }ni=1 ∈ Rp points we seek clusterings C(i) around centres ck which minimize within cluster-scatter: K X X

||Xi − ck ||22 .

(1)

k=1 C(i)=k

¯ k minimizes the objective function (1) in terms of For a fixed cluster partition C(i), ck = X available data and consequently reduces the complexity of the problem to finding appropriate partitioning C(i) of the data. Once a set of cluster centres have been identified along with their corresponding points a Vornoi Tessellation of Rp is defined: Vk = {x ∈ Rp : ||x − ck ||22 ≤ ||x − cj ||22 ; j = 1, ..., l}, k = 1, ..., K.

(2)

In the context of real estate data equation (2) can be viewed as a series of neighbourhoods. The process may be iterated in larger neighbourhoods where smaller boroughs are likely to exist. Alternate cluster analysis methods exist, such as k-medoid and the probabilistic EM neither, however neither are considered here and may be tenable areas of future research.

2.2

Feature Selection

When pricing a house a gamut of factors are integrated into the final valuation. Frequently these factors are correlated and are consequently less informative on an individual basis than when viewed in composite. Two prime examples include: 1. Total living area(TLA) and number of bathrooms. As TLA increases the number of bathrooms is likely to increase 2. Age of house and price per square foot. Older houses tended to be smaller due to budgetary constraints and construction restrictions. Consequently age and price per square foot basis tend to be highly correlated. Accordingly selecting a set of features from a group that is highly correlated leads to a more parsimonious and informative model than when considering all variables simultaneously.

3 DATA

Data

Data from 25,357 single family homes sold in Lucas County, Ohio, 1993-1998 from the county auditor is used in the following analysis. It was obtained from the sp package in R The feature set includes: 1. Coordinates

8. Baths

15. Lotsize

2. Price

9. Halfbaths

16. Appraised Value

3. Year built

10. Frontage

17. Sdate

4. Stories

11. Depth

18. Date Appraised

5. TLA

12. Garage

19. Year Sold

6. Wall

13. Garages Square Feet

20. Age

7. Beds

14. Rooms

The features Appraised Value, Sdate, Date Appraised, Year Sold and Age were removed from the data set since they are not important for this analysis. In addition, only the years 1997 and 1998 were considered because additional years would have required normalizing prices to reflect changes in inflation and interest rates.

4 IMPLEMENTATION

Implementation

All analysis is conducted using the statistical software R. Data is read into R and filtered based on the years 1997 and 1998. The pricing model is implemented in three stages. In the first stage neighbourhoods are fit based on longitude and latitude using the K-means algorithm. The consequent partitioning can be shown to decrease intra-cluster variability. In the second stage features are selected. In the third stage models are fit using regression,kriging and nearest neighbour methods.

4.1

Identifying Neighbourhoods

Number of Clusters

(a) Within Sum of Squared Error/ Total Sum of Squared Errors for Different Ks

3e+11 2e+11 1e+11

Within Group Sum Of Squares 2

0e+00

0.90 0.80 0.70 0.60

Within Group Sum Of Squares

To find the optimal number of clusters Everitt and Hothornâ&#x20AC;&#x2122;s[1] elbow methods are used. They seek the K for which the marginal return is insufficient to justify increasing model complexity, i.e. the K for which explains most of the variance in the model. The elbow criterion however cannot always be defined unambiguously accordingly K-medoids and affinity propagation methods are used to verify the selected number of clusters is reasonable. Since different initializations lead to different clusterings the K-means algorithm is run 100 times and the results of the ratio of between sum of squared and total squared errors are plotted in figure 1a. The sum of squared errors is also plotted in figure 1b for a single run. At K = 10 roughly 90% of the variance can be explained by the clusterings. The elbow method is sense greedy since it does not consider subsequent borough clustering within the neighbourhoods. Finding a general algorithm that can find embedded clusters is however NP-hard. K-medoids and affinity propagation suggest 5 and 15 clusters, respectively; K = 10 is a compromise between the two(results not included). Figures 5a-5c in appendix section 7.1.2 illustrate some possible Vornoi Tessellations for K = 5, 7, 8 and 10 clusters. Table 1 summarizes the

Number of Clusters

(b) Sum of Squared Errors for Different Ks

Figure 1: Finding an Optimal K number of data points allocated to each of cluster after partitioning. The denser clusters are of

4 IMPLEMENTATION

particular interest since they indicate additional structure may exist within the neighbourhoods.

Cluster

Number of Data Points

252

1222

726

446

556

1721

1908

1311

1218

Table 1: Number of Data Points in Clusters

4.2

Selecting a Neighbourhood and Borough

lat

214000

Find cluster?

218000

0.98 0.94 2

210000

0.90

Within Group Sum Of Squares

Cluster 9 is selected. Through an iterative process it is further partitioned into 4 boroughs using the K-means algorithm figures 2a and 2b summarize the results. A similar iterative process can be used to identify boroughs in other clusters. The number of points in each borough is given

Number of Clusters

498000

500000

502000

504000

506000

508000

long

(a) Within Sum of Squared Error/ Total Sum of Squared Errors for Different Ks

(b) Vornoi Plot for k = 4

Figure 2: Finding an Optimal K in table 2. Weâ&#x20AC;&#x2122;ll restrict our focus to borough 3.

Borough

Number of Points

198

280

498

246

Table 2: Number of Data Points in Boroughs for Cluster 9

4.3

Feature Selection

Variables are filtered based on three methods: principal component analysis, random decision trees and step-wise selection. The first two are presented here whereas the third(step-wise selection) is implemented in section 4.4.2 when fitting a linear model. Feature selection occurs over the entire data set with the assumption that selected variables are among the main factors responsible for house prices regardless of neighbourhood.

4 IMPLEMENTATION

Feature selection occurs in two stages. In the first stage the features are filtered using principal component analysis (PCA - from the FactoMineR package in R) which scores components responsible for variance in the data, subsets of these components are likely correlated with one another (at least from a real estate point of view). The possible variables include yrbuilt, TLA , beds, baths, halfbaths, frontage depth, garage, garagesqft, rooms and lotsize. In the second stage random forests are generated to corroborate the selected candidates. The variables with the highest eigenvalues are TLA, baths, yrbuilt,rooms and beds and halfbaths. Note however that TLA is highly correlated with, rooms,beds and haflbaths their correlations are all above 0.5 i.e. 0.74,0.64,0.52, respectively. Some of the variables are likely redundant. On the other hand the remaining variables all have correlation coefficients less than 0.5 (0.23,0.45,0.45) with the variable yrbuilt. The correlation between yrbuilt and TLA is 0.41. In comparison, random forests indicate the two largest contributors to % increase in mean squared error are TLA and yrbuilt and when combined with longitude and latitude they variable set contribute to over 80% of the MSE. Accordingly TLA and yrbuilt are the two features which are selected during the fitting process.

4.4

Model Fitting

Different models are fitted to the spatial topography. The four considered here are: 1. Regression Based Predictors

3. Universal Kriging and

2. Ordinary Kriging

4. Nearest Neighbour

4.4.1

Data

Relevant figures are included in the appendix under the section 7.1.3. The price distribution of houses in cluster 8 borough 3 is uni-modal and rightward skewed. There is a parabolic trend in both East-West directions. The directional variograms further reveal there is spatial correlation in the 90 and 135 directions whereas the variogram map indicates two areas with high spatial correlation. When fitting a linear trend to the variogams they are bounded which implies stationarity among residuals.

4.4.2

Linear Model

Several linear models were generated through step-wise regression and compared based on their AIC. The selected model’s parametric form is given by : f (w, x, y, z) = β0 + β1 (x − x ¯)2 + β2 (z − z¯)2 + β3 (w − w) ¯

(3)

where w is the year built and z is the TLA.The fitted linear model 3 normalizes the residuals. Summary figures are included in appendix section 7.1.4.

4.5

Kriging Models

Subsequent to fitting a linear model ordinary and universally kriging models were fit. The spatial trend used in in the UK model was derived from a reduced functional form of equation

4 IMPLEMENTATION

3: f (x, y) = β0 + β1 (x − x ¯)2 .

(4)

Removing yrbuilt and TLA from the trended model reduces the accuracy of the mean effect but simplifies the model fitting procedure. It is possible to incorporate yrbuilt and TLA by considering them as deterministic components of the spatial model; figures 8 and 7 summarize changes to the residuals and the semivariogram using equation (4) rather than equation (3). Despite dropping TLA and yrbuilt the residuals along the longitudinal and latitudinal axes do not exhibit additional trends. In addition most of the semivariances fall within the outlined semvariogram envelope. The maximum likelihood estimate is subsequently fit to the semivariogram (see figure (9) in section 7.1.5) and used to fit a universal kriged models with an exponential,Gaussian, spherical, circular and cubic covariance functions. The results for ordinary and the exponentially universally kriged models are compared in figure 3. The inclusion of a spatially trended terms smooths the predicted surface when compared to the ordinary kriged surface which is more hilly.

1000

OK Prediction

0 −2000

Y Coord

UK Prediction

−1500

1000

X Coord

Figure 3: Universal vs. Ordinary Kriging

4.6

Nearest Neighbour Regression

Nearest neighbour regression is a non-parametric method that predicts values based on the class membership of the k closest members. In it’s unweighted form all k members are treated equally regardless of distance. The predicted value z for a point located at (x, y) is then given by: X zi z= i∈I,card(I)=k;min ||(xi ,yi )−(x,y)||2

a variant is to use the Mahalanobis distance as measure of nearness. Typically the optimal number of neighbours is found through an iterative search given some test set.

3.5e+08 2.5e+08

Mean Squared Error 10 Fold CV

5 RESULTS AND ANALYSIS

Neigbours

Figure 4: Optimal Number of Neighbours Using 10 Fold Cross Validation

Results and Analysis

Ten-fold cross validation is conducted on each of the models; roughly 49 points at each are removed from the plot for testing and the predictive mean squared error(PMSE) is calculated. Ordinary kriging is the standard against which the models are compared. Looking at figure 4 nearest neighbour regression attains a minimum mean squared predictive error rate at k = 11. Including the longitudinal trend reduces the average PMSE of 10-Fold CV by 17.37% when using an exponentially fit model relative to ordinary kriging. Other Universal Kriging methods display marginal increases (rather than decreases) relative to Ordinary Kriging. In contrast nearest neighbourhood searches and linear regression models produce much poorer results, indicating that spatial correlation is important to consider when fitting such models. The results are summarized in table 3.

Model

Average PMSE of 10-Fold CV

% Decrease Relative to OK

Ordinary Kriging Universal Kriging - Exponential Universal Kriging - Gaussian Universal Kriging - Spherical Universal Kriging - Circular Universal Kriging- Cubic Linear Regression Nearest Neighbour k = 11

27736540 22917646 58561662 34897992 33119719 38309144 217900594 207500901

â&#x20AC;&#x201C; 17.37 -111.13 -25.82 -19.408 -38.11 -685.60 -648.1139

Table 3: Comparing Models

6 CONCLUSION

Conclusion

This project covered the methodological steps for partitioning a large spatially correlated data set into manageable plots without inducing increases in intra-cluster variability. The iterative process led to a model that fit the data well and gives credence to using spatial structure when pricing real estate rather than relying on linear regression and nearest neighbours methods. While the theoretical underpinnings behind clustering can be explored further, especially from a spatial statistics point of view, time limitations precluded fitting an entire topographic map. The next step in such a project would be to compare the average decrease in error between random and systematic sampling of the entire topography. A number of points were made throughout this project with respect to future research. Some ideas included: using alternate clustering and clustering selection algorithms, implementing Mahalanobis rather than Euclidean distance nearest neighbour search and including the linearly trended term as a deterministic part of the spatial model.

References [1] Brian S. Everitt and Torsten Hothorn A Handbook of Statistical Analyses Using R. Chapman Hall, Florida, 2nd Edition, July 20, 2009. [2] Ran Raz. On the Complexity of the Matrix Product. In Proceedings of the thirty-fourth annual ACM symposium on Theory of computing, ACM Press, 2002.

7 APPENDIX

Appendix

7.1 7.1.1

Implementation Feature Selection

yrbuilt stories TLA wall beds baths halfbaths frontage depth garage garagesqft rooms lotsize s1997 s1998 long lat

%IncMSE

IncNodePurity

25.900402 10.904804 19.695670 6.695135 5.013221 10.366785 7.587152 10.434635 4.377202 10.059274 9.623783 5.731910 11.377334 1.715203 3.619741 17.413305 20.406973

4.865431e+12 3.622430e+11 1.024634e+13 6.985975e+11 4.304930e+11 4.265309e+12 9.277709e+11 1.873431e+12 3.788216e+11 2.638164e+12 1.319486e+12 1.053844e+12 3.749352e+12 7.123309e+10 6.540196e+10 1.942364e+12 1.107115e+12

Table 4: Random Forest With 100 Trees

7.1.2 Selecting Neighbourhoods-Vornoi Tessellations For Different Neighbourhoods 7.1.3

Model Fitting/ Data

Figures A grid is constructed on which the points will be interpolated. Notice near the edges of variogram map there are some houses priced differently.

7.1.4

Model Fitting/ Fitting a Linear Model

7.1.5

Model Fitting/ Spatially Trended Model

205000

Find cluster?

215000

225000 510000

195000

490000

vlat

Find cluster?

215000 195000

205000

vlat

225000

7 APPENDIX

530000

490000

vlong

510000

530000

vlong (c) Vornoi With 8 Centers

Find cluster?

215000 205000 195000

Find cluster? 490000

vlat

225000

(b) Vornoi With 7 Centers

225000 215000 205000 195000

530000

vlong

(a) Vornoi With 5 Centers

vlat

510000

490000

510000

530000

vlong (d) Vornoi With 10 Centers

Figure 5: Fitted Neighbourhoods for Different Numbers of Clusters

7 APPENDIX

120000 Price

80000 40000

Price

40000

80000

30 20 10

Frequency

120000

Histrogram of Price For Cluster 8 Borough 3

20000

60000

100000

140000

505000

Price

507000

213000

East−West

(a) Histogram of Prices

215000 South−North

(b) East-West vs. Prices

500

1000 1500

135

8e+08

Interpolation Grid and Sample Points

6e+08

4e+08

price 1e+09

8e+08

6e+08

semivariance

2000

2e+08

8e+08

6e+08 4e+08 4e+08

−2000 2e+08

2e+08 −3000

−1000

1000 2000 3000 500

1000 1500

distance

(d) Directional Variogram

40000 1000

−2000

2000

lat

Normal Q−Q Plot

Histogram of Residuals

60 0

Frequency

40000

long

−40000

Sample Quantiles

−1000

−40000

Residuals

0 −40000

Residuals

40000

7 APPENDIX

−3

−1

Residuals

−40000 0

40000

model0$residuals

Figure 6: Fitting a Linear Model to the Data

1000

40000 0

Residuals 0

2000

−2000

1000

lat

Normal Q−Q Plot

Histogram of Residuals

40 0

Frequency

40000

long

−60000

Sample Quantiles

−1000

−60000

40000 0 −60000

Residuals

7 APPENDIX

−3 −2 −1

−60000

Residuals

40000

model00$residuals

4e+08 2e+08 0e+00

semivariance

6e+08

Figure 7: Fitting a Spatial Trend Model to the Data

1000

2000

3000

distance

Figure 8: Semivariance with Envelope

4000

semivariance

0e+00

2e+08

4e+08

7 APPENDIX

1000

2000

3000

4000

distance

Figure 9: Weighted Least Squares Estimate (Dotted Line) vs. Maximum Likelihood Estimate (Solid Line)