Urban Spatial Analysis by Yun Shi

YUN SHI | Portfolio

Urban Spatial Analysis Selected Works 2016

Computer Skills Statistics Spatial Analysis Programming 2D Graphic 3D Modelling 3D Rendering

· RStudio, SPSS, NLogit, Excel · ArcGIS, GeoDa, Vis-Stamp, Ecotect · VisualBasic.NET - AutoCAD, Photoshop, Illustrator, Indesign · SketchUp, Rhino · VRay

Education

YUN SHI Address

2930 Chestnut Street, Apartment 2012A Philadelphia, Pennsylvania, USA

Phone E-mail

+1 215 316 4951 yunshi@upenn.edu

Portfolio Link

http://issuu.com/portfolio_yunshi/ docs/portfolio

Sept. 2016 ~ Junr 2017 · Master in Urban Spatial Analytics University of Pennsylvania, PennDesign Sept. 2011 ~ June 2016 · B.Eng in Urban Planning Tongji University, College of Architecture and Urban Planning GPA: 4.71 / 5.0

Architectural Works

Awards & Scholarships October 2015 · Second Class of Learning Scholarship Tongji University · Third Prize in Competition of Urban Transportation Innovative Practice National Steering Committee of Urban and Rural Planning Education in China · Third Prize in Urban and Rural Social Comprehensive Practice Research Report National Steering Committee of Urban and Rural Planning Education in China March 2015 · Excellent Students Awards Tongji University October 2014 · First Class of Learning Scholarship Tongji University · NITORI Scholarship Tongji University May 2014a · First Prize in National English Competition College English Teaching and Research Association of China (CETRAC) October 2013 · First Class of Learning Scholarship Tongji University · Intertek Scholarship Tongji University September 2013 · Second Prize in Summer Social Practice of the Youth Cup, College of Architecture and Urban Planning, Tongji University October 2012 · First Class of Learning Scholarship Tongji University

Experiences 2015 ~ Present · Mentor Columbia Business School (USA) Enterprise Program, Shanghai August 2015 · Research Assitant Institute for Advanced Study in Tongji University July 2015 · Research Assitant Green Architecture and New Energy Research Center, Tongji University November 2014 · Volunteer Pujiang Innovation Forum July 2012 · Volunteer Shanghai International Youth Science and Technology Expo 2011 ~ 2012 · Deputy Manager Student Welfare Protection Center of Student Union, Tongji University

1- Home Price Prediction â&#x20AC;&#x201D; Hedonic Home Price Prediction San Francisco, California

Introduction Purpose of the Project The purpose of the project is to build a better predictive model of home prices, which can be used in many analyses such as segregation, housing market and distribution of public resources. As a matter of fact, Zillow has always been trying to do so, however some of the predictions are not as accurate as expected. The reasons may vary from place to place, but generally, it might be the case that driving factors for places differ, and some of the important factors are neglected due to its â&#x20AC;&#x153;unrelatednessâ&#x20AC;? at first sight. Thus, in this project, we are trying to dig out more unexpected variables depicting the home values.

Modeling Strategy Here, we are trying to build a model to predict home sale prices in San Francisco, and the biggest handicap is that each neighborhood in the area has different characteristics which might defect the predictive power of the model. Thus, to begin with, we first try to gather data that can depict each neighborhood well (i.e., set a basic price line), and after that we managed to find some "special" factors influencing specific areas in San Francisco.

1- Home Values Prediction 1 Completed by Yun Shi, 11/10/ 2016

Process the Potential Predictors #1 — Add

"Useful" Predictor

"Unrelated" Predictor

— Per Capita Income

— Elevation

Add the Predictors To develop the model, we gather data from every possible sources ranging from SF open data website to Yelp score. When exploring the potential predictors, there is some basic criteria: 1) “Useful” Highly similar to the distribution of house sale prices, such as per capita income. 2) “Necessary” Things must to consider when buying a house, such as distance to severe crime.

"Necessary" Predictor — Distance to Severe Crime

3) “Unrelated” Factors influencing the price beyond every one's expectation, such as elevation.

1- Home Values Prediction 2 Completed by Yun Shi, 11/10/ 2016

Process the Potential Predictors #2 â&#x20AC;&#x201D; Filter

Filter the Predictors Then, we plot the correlation matrix (left) and bar graphs between variables (bottomright) to see if these potential predictors are able to influence the sale price and check if these independent variables are not correlated with each other. After that, we handpick the ones more influential to the sale price and independent from others to enter into the model.

1- Home Values Prediction 3 Completed by Yun Shi, 11/10/ 2016

Regression Results

Overall Performance

Spatial Autocorrelation

In general, the model is well-fitted with an R-squared value of 0.74. The MAPE (mean absolute percentage error) is 24.7%, less than 25%. And F Stastistic shows that the result can be trusted.

To check if the residuals are spatial autocorrelated (i.e., predictions in some areas are more accurate than others), we calculated the Global Moran's I, which ranges from -1(dispersion) to 1(clustering).

However, we still need to check the residuals to see if some basic regression assumptions are violated.

The index is 0.02559, meaning that there is no spatial autocorrelation. And a p-value of 0 validates the result, indicating a statistical significance.

Residuals Map

1- Home Values Prediction 4 Completed by Yun Shi, 11/10/ 2016

Model Validation #1 â&#x20AC;&#x201D; Residual Plots Residual

Observed VS Residuals 3e+06

Residual is the bit thatâ&#x20AC;&#x2122;s left when you subtract the predicted value from the observed value. The formula is provided as Residual = Observed - Predicted. By checking the residual plot, we can know the relationship between residuals, observed and predicted. It is important because by inspecting these plots, we can detect if the observations are time/spatial correlated violating the regression assumption, the outliers that can be typos and most of all, to see if the model needs to be improved.

2e+06

1e+06

Residuals

In fact, we have done the process every time testing a new model during the process. And we have successfully removed some of the outliers that are not typos but clearly have uncommon sale prices, and improved our model.

0e+00

1e+06

Predicted VS Residuals 3e+06

0e+00

1e+06

2e+06

3e+06

4e+06

Observed

Observed vs Residuals 2e+06

We expect the scatterplot between observed values and residuals to have a clear pattern (i.e. a strong correlation) so that the model is proved to be accurate.

Predicted vs Residuals

1e+06

Residuals

And in this graph, it is observed that though some the residuals are clustered in one place, overall it shows an obvious linear relationship which means that the model is relatively accurate while has room for improvement.

0e+00

A good scatterplot of predicted values and residuals should be: 1) Symmetrically distributed, tending to cluster towards the middle of the plot. 2) Clustered around the lower single digits of the y-axis. 3) No clear patterns observed. In this graph, the points are clustered, however, there is no clear pattern thus there is no serious problem, but the model can be improved.

1e+06

0e+00

1e+06

2e+06

3e+06

4e+06

Predicted

1- Home Values Prediction 5 Completed by Yun Shi, 11/10/ 2016

Model Validation #2 â&#x20AC;&#x201D; Cross Validation & MAPE by Neighborhood Cross Validation

MAPE by Neighborhood

After we obtain the model with satisfactory predictors, we then compare our model with the spatial lag model. By cross validation, the MAE and MAPE of our regression model is slightly lower than the other, which means that our model is better.

MAPE by Neighborhood By mapping MAPE value by neighborhood, we may observe whether the model has the same predictive power in different neighborhoods. In other words, if the MAPE of all neighborhoods have similarly small values near 0, the model is able to predict every one of the neighborhood perfectly. In the figure, the darker colour covers areas of higher MAPE, where the predictions are not accurate as in elsewhere. And as detected, the wrongly predicted neighborhoods are more likely to be in the northern or southern part of the San Francisco. The largest MAPE is made in the Downtown/Civic Center neighborhood where the value is 0.74. Based on our knowledge of San Francisco, these neighborhoods are those more gentrified ones. In that case, the model might make more mistakes when assessing houses of "rich" neighborhoods.

1- Home Values Prediction 6 Completed by Yun Shi, 11/10/ 2016

Discussion & Conclusion Discussion: Effectiveness and Efficiency In general, we think our model an effective but not an efficient one. It is true that in order to achieve a better model with greater predictive power, we put all kinds of predictors into our model. When filtering the predictors, some insignificant predictors are saved based on our knowledge of city planning, thus the model contains much more predictors than expected. As a result, the model is more powerful than before however with redundant independent variables. In the process, we remove some of the outliers and create some variables by extract values of heatmap generated (inspired by spatial lag model), so as mentioned above, our residuals are more evenly distributed, which is proved by the Global Moran's I. However, the model can still be improved, especially when confronted with "rich" neighborhoods. The first model we built was only with common data from ACS and Decennial Census joined to census tracts in San Francisco. The model proved to predict well in "poor" neighborhoods, nevertheless, it had grave defects in generating a high sale price. The problem was then fixed by adding variables such as distance to fancy bars, restaurants and golf courses. But the model still can not predict price well in "rich" neighborhood as it does in a "poor" one.

Conclusion In all, we are pretty confident in our model that we would recommend the model to Zillow, not only just to use the model in the future predictions, but also take a look at some of the variables that are untraditionally related to housing market yet proved to be powerful when predicting housing values. And we have to admit that by using only the data in San Francisco, the model might be weak if used in other places. In other words, the model needs some modifications to be more adaptive or more generalizable to apply in other housing markets than San Francisco's. 1- Home Values Prediction 7 Completed by Yun Shi, 11/10/ 2016

MUSA 5

2- Transit Proximity â&#x20AC;&#x201D; Willingness to Pay for Transit Access Philadelphia, Pennsylvania

Influential Factors to Home Sales Price

As shown in above, when the house is more far away from the aggravated assaults, the price goes up higher in general. Furthermore, in a 200m range, the trend is even more obvious.

When the house is more far away from the central business district, the price depends on its distance range. Generally, when the distance is within 200m, it will go higher; out of 200m, lower instead.

The relationship between these two is positive, which is more true within a radius of 200m away from the home sold.

Clearly illustrated, people love to pay more for walkability. 2- Transit Proximity

Completed by Yun Shi, 10/27/ 2016

Regression Results

Choice of Final Model

*A variable with prefix 'LN' means it has been LOG transformed.

Step One First, I check the correlations between predictors and interest by Pearson and Spearman methods.

Then, I choose the ones tested with both absolute values above 0.4, indicating a strong relationship. LNd_septa / LNpct_non_wh / LNd_crime / LNd_off_site / LNd_abate / LNd_vacant

Step Two I get my kitchen sink model with all six predictors that I think influential.

Step Three From the previous correlation test results, I know LNd_crime and LNd_vacant are strongly correlated with other variables, which means I have to get rid of one of them.

Results Interpretation

1) One foot change in distance to Septa stations leads to a $-1/sqft change in home sales price. 2) One percentage change in distance to Septa stations leads to a $-100/sqft change in home sales price.

To compare their effects, I built two models without each of them. Then I choose the one 3) One foot change in distance to off site owners leads to a $35,676,400/sqft change with a higher R squared value (without LNd_ in home sales price. vacant.) 4) One foot change in distance to aggrageted LNd_septa / LNpct_non_wh / LNd_crime / LNd_off_site assaults leads to a $1,345,457,579/sqft change in home sales price. / LNd_abate 5) One foot change in distance to a house that got a ten year tax abatement leads to a $-1/sqft change in home sales price.

Spatial Autocorrelation From Global Moran's I summary, given the z-score of 212.36, there is a less than 1% likelihood that this clustered pattern could be the result of random chance. In other words, there exists spatial autocorrelation in residuals. So it has violated the basic assumption of an OLS model. Besides, it indicates that when modeling home prices, explicit consideration of this spatial relationship is needed.

Model Performance

In general, the model is not so useful for estimating the willingness to pay for transit. The reasons are : 1) R squared of the final model is only 0.4, which tells us that most of the model is not explained by these independent variables. 2) There is severe spatial autocorrelation violating the assumption.

Price Map

Residuals Map 2- Transit Proximity

Completed by Yun Shi, 10/27/ 2016

Willingness to Pay for the Transit #1 Sales Prices Analysis

- Homes sold within 0.25 mile - Homes sold between 0.25 mile to 0.5 mile Similarity Judged from the scatterplot below, generally speaking, the overall relationship between home sales prices and the distance to transit is the same if these two zones are treated separately. When the home is located nearer to the transit, the price goes lower. Difference However when these two zones are combined, there is some slightly differences and interesting facts about sales price between homes within 0.25 mile and 0.25-0.5 mile. 1) Unexpectedly, in the 0.25-mile buffer zone, the price goes down faster when approaching the transit stops, compared to those outside the 0.25-mile zone. 2) Theoretically there is supposed to be a linear relationship easy to follow, however, the real sales prices vary greatly only hold the distance to transit stops as a constant. That is to say, distance to the nearest transit stops might be just a tiny factor to the vast majority when buying a home.

There are some local neighborhood effects influencing the sales price. For example the City Hall area and South Philadelphia all have big orange circle, which means homes near the transit are generally of higher values. 2- Transit Proximity

Completed by Yun Shi, 10/27/ 2016

Willingness to Pay for the Transit #2 Price Premium Estimate From my model (without the station fixed effects one), I detect no obvious special fondness when choosing the house. So I trust the judgement from Model 1 that if we separate the people into two groups, those live inside the 0.25 mile will likely to pay more and pay more often than other that do not. And the p-value proves that it is statistically significant.

General Discussion I think the research most of all identify the home sales price. And the home sales price, to me, largely depends on each family's own wealth, and have no direct relationship with the willingness to pay for the transit.Furthermore, the home sales price itself is flexible. It flows with the local economic trend, CPI and sometimes the house itself can decide its own value taken little attributes from nearby amenities. Then, the model has assumed that the residents within the 0.25 mile buffer will take advantage of the transit stops more often. Though it might be true, however, it can be more related to the income level of the residents. For example, students and the old are more likely to take public transporation.

2- Transit Proximity

Completed by Yun Shi, 10/27/ 2016

3- Smart Growth Planning â&#x20AC;&#x201D; Urban Growth VS. Development Suitability Pennsylvania, U.S

Environmental Sensitivity Index Map

Basemap Source: OpenStreetMap

3- Smart Growth Planning 1 Completed by Yun Shi, 09/22/ 2016

Urban Development Likelihood Map

Basemap Source: OpenStreetMap

3- Smart Growth Planning 2 Completed by Yun Shi, 09/22/ 2016

Notable Sites Map

Basemap Source: OpenStreetMap

3- Smart Growth Planning 3 Completed by Yun Shi, 09/22/ 2016

Notable Counties

TOP 5 Pennysilvania Counties to be focused by DEP 1 2 3 4 5

Allegheny Westmoreland Chester Montgomery York

3- Smart Growth Planning 4 Completed by Yun Shi, 09/22/ 2016

Data Process & Analysis

Map 1 Environmental Sensitivity Index Map Weight Value +5 — Federal Lands; 100-Meter Buffer Surrounding All Rivers and Streams; Wetlands; Forests; Slope >15%;

Map 2 Urban Development Likelihood Map Weight Value +5 — Potentially Urbanizeable Sites Near Recently Urbanized Sites +3 — Potentially Urbanizeable Sites Near Major Highways +1 — Scaled Marginal Density Measure*

+4 — Slope 5% - 15% +3 — Parks; Pastures; +2 — Major Highways; Public Conservancy; Farms; Slope 2% - 5% +1 — Slope <2% Decision Factors - Current ecological value of land type (federal lands, lands surrounding rivers and streams, wetlands and forests) - Future ecological value if the type is changed (parks to residential buildings, pastures to golf course, etc) - Possibility of that change (federal lands, slope >15%, etc) Spatial Pattern - In general, the northwest part of Pennsylvania is more ecological sensitive comparing to the southeast part. - To be specific, in Forest, Warren, Mckean and Elk, there is a large sensitive area highlighted red in the map and proved to be the State Game Lands Number 86, which may prove the need of conservation in another way.

*Marginal Density Measure is scaled based on MIN-MAX normalization.

Decision Factors - Since recently urbanized sites may maintain a steady momentum of ecomonic development, which can be attractive to developers. And the land demand for that area will be the economic drive for nearby lands. - Land transportation is still an indispensable part of logistics systems of cities. And sometimes, construction of major highways may demonstrate the government's willingness to develop that part of the country. - Compared to the other two factors, density factor, to me, is more like the result of urban development. It is rather an conclusion than the projection, and density may be changed in a very short time due to various reasons. For that reason, the weight value of that is only +1. Spatial Pattern - Generally, lands to be potentially urbanised is in the middle part counties of Pennsylvania. - The scattering spatial pattern may indicate that there might be various econimc or social reasons behind that compared to a clustered one, which may indicate a huge change in that area between 10 years.

Map 3 Notable Sites Map

Decision Factors - After normalizing the index of both economical sensitivity and urban development likelihood, those score equal or greater than 5.0 in each index is compared. - Finally, sites with both higher indexes are calculated and presented in Map 3 Notable Sites Map. Spatial Pattern *To express more clearly, the area of each site is also represented by the size of dots in the map.

- The map indicates that sites near major highways are to be noted. And there is some clustering points besides that linear pattern. - The size of dots shows that in the clustering points, these sites tend to be larger than average.

Table 1 Notable Counties Decision Factors Score 1 - Number of patches (environmentally sensitive sites most at risk from urbanization) and the comparison to all counties in Pennsylvania (calculated by location quotient) - Weight Value = 1 Score 2 - Total patch area and the comparison - Weight Value= 2 (To be focused by DEP, the county may have some difficulties in management or development of these patches, and a large number of patches might be the reason) Score 3 - Total patch area and the comparison - Weight Value = 1 Score 4 - Largest patch area and the comparison - Weight Value =1 Total Score = Score 1*1 + Score 2*2+ Score 3*1+Score4*1 3- Smart Growth Planning 5 Completed by Yun Shi, 09/22/ 2016

4- TOD Planning â&#x20AC;&#x201D; Urban Growth VS. Development Suitability Philadelphia, Pennsylvania

4- TOD Planning