Advanced Spatial Statistics - Final Project

Page 1

Mini-­‐Project 1

Introduction: For the first mini-­‐project I am using a built in dataset in R, “LifeCycleSavings”. This dataset contains information regarding how much of their income people in 50 selected countries saved each year, as a percentage of total net income. The dataset has further subdivisions that focus on people under 15 and over 75 years of age, respectively, but for this analysis I will just use the “sr” subdivision that examines the population as a whole. The dataset has a good spatial distribution, as it contains countries from every continent except Antarctica.1 It is important to note that the data was collected from 1960 to 1970, and therefore the results may be very different from the results one would get today. Also, communist countries (except for China) are not included. Various forms of analysis are below: Analysis: The first form of analysis I did was a simple scatterplot (below). Although it provides a good picture of how countries relate to each other, it has various other drawbacks. Most notably, we do not know which country is which.

1 Due to the isolationist nature of North Korea it is assumed that “Korea” is South Korea


The next form of analysis is a barplot, with the countries ranked in descending order. Unfortunately the labels would not line up properly in this format, so a second barplot of the countries in alphabetical order is included as well.


A histogram shows the general trends:

A boxplot provides further numerical analysis. Note that there are no outliers.


The information contained in the boxplot is by and large the same information that is provided by R’s “summary” function. Since it would be useful to have this information for each continent the data set has been divided up accordingly. Starting on the next page are a series of boxplots that show all of this information graphically, save the mean. Minimum 1st Quartile Median Mean 3rd Quartile Maximum All 0.6 6.970 10.510 9.671 12.620 21.1 Countries (Chile) (Japan) (50) Africa (5) 2.81 8.89 11.14 10.94 13.30 18.56 (Tunisia) (Zambia) Asia (6) 3.98 5.78 10.45 10.58 12.56 21.1 (South (Japan) Korea) Australia 10.67 10.86 11.05 11.05 11.24 11.43 (2) (New (Australia) Zealand) Europe 1.27 10.32 11.92 11.25 13.41 16.85 (20) (Iceland) (Denmark) North 7.56 7.87 8.18 8.18 8.48 8.79 America (2) (United (Canada) States) South 0.60 4.02 7.30 6.80 9.23 12.88 America (Chile) (Brazil) (15) -­‐ Of the six continents North and South America tend to have the lowest savings rates, as indicated by the 1st Quartile, Median, Mean, and 3rd Quartile. -­‐ Collectively Europe is the continent with the highest savings rate. This may be because many European governments provide services such as health care that individuals in other countries have to pay for themselves, thus lowering their savings rate. Europe is also the only continent with any statistical outliers (Iceland and Sweden). -­‐ The results for Asia are very interesting in that they tend to be towards the low end, with the notable exception of Japan (which is not a statistical outlier). This is most likely because at the time of the study Japan was the only truly developed country in the region that had citizens who could afford to save their income. -­‐ The results for Australia and North America are somewhat skewed in that each only has two countries. This is relatively unavoidable; what is of more interest are the relatively low number of countries for Africa and Asia. This is most likely because these continents have many undeveloped countries where it may be difficult to get data. Nevertheless the lack of information is a detriment to the data set. -­‐ The overall IQR was 5.65 The lowest continent IQR was Europe (3.085) and the highest was Asia (6.78). Australia and North America were not included since they only had two countries apiece.




Conclusion/Summary: This mini-­‐project analyzed savings rate data from the 1960s. Ultimately it showed that Europe tended to have higher savings rates, while North and South America had lower savings rates. The data this mini-­‐project was based off of is below, as is the R script.


List of Countries sr Australia 11.43 Austria 12.07 Belgium 13.17 Bolivia 5.75 Brazil 12.88 Canada 8.79 Chile 0.60 China 11.90 Colombia 4.98 Costa Rica 10.78 Denmark 16.85 Ecuador 3.59 Finland 11.24 France 12.64 Germany 12.55 Greece 10.67 Guatamala 3.01 Honduras 7.70 Iceland 1.27 India 9.00 Ireland 11.34 Italy 14.28 Japan 21.10 Korea 3.98 Luxembourg 10.35 Malta 15.48 Norway 10.25 Netherlands 14.65 New Zealand 10.67 Nicaragua 7.30 Panama 4.44 Paraguay 2.02 Peru 12.70 Philippines 12.78 Portugal 12.49 South Africa 11.14 South Rhodesia 13.30 Spain 11.77 Sweden 6.86 Switzerland 14.13 Turkey 5.13 Tunisia 2.81 United Kingdom 7.81 United States 7.56 Venezuela 9.22 Zambia 18.56 Jamaica 7.72 Uruguay 9.24 Libya 8.89 Malaysia 4.71


R script

require("datasets") ?LifeCycleSavings LifeCycleSavings data(LifeCycleSavings) x <-­‐ LifeCycleSavings$sr plot(x, main = "Percent Savings by Country", xlab = "Total Number of Countries", ylab = "Percent Savings", col = "blue", pch = 16) barplot(x[order(x, decreasing = TRUE)], names.arg = LifeCycleSavings$countries, cex.names = 0.4, las = 3, main = "Percent Savings by Country", ylab = "Percent Savings", xlab = "Countries", col = "orange") LifeCycleSavings[,1] row.names(LifeCycleSavings) LifeCycleSavings$countries <-­‐ row.names(LifeCycleSavings) LifeCycleSavings barplot(x, names.arg = LifeCycleSavings$countries, cex.names = 0.4, las = 3, main = "Percent Savings by Country", ylab = "Percent Savings", xlab = "Countries", col = "orange") hist(x, main = "Percent Savings by Country", xlab = "Percent Savings", ylab = "Number of Countries", col = "orange")


boxplot(x, main = "Percent Savings by Country", ylab = "Percent Savings", las = 1, boxwex = 0.5, whisklty = 1, col = "orange") summary(x) IQR(x) transposed <-­‐ t(x) africa <-­‐ transposed[, c(36,37,42,46,49)] asia <-­‐ transposed[, c(8,20,23,24,34,50)] australia <-­‐ transposed[, c(1,29)] europe <-­‐ transposed[, c(2:3, 11, 13:16, 19, 21:22, 25:28, 35, 38:41, 43)] northamerica <-­‐ transposed[, c(6, 44)] southamerica <-­‐ transposed[, c(4:5, 7, 9:10, 12, 17:18, 30:33, 45, 47:48)] summary(africa) summary(asia) summary(australia) summary(europe) summary(northamerica) summary(southamerica) IQR(africa) IQR(asia) IQR(australia) IQR(europe) IQR(northamerica) IQR(southamerica) boxplot(africa, main = "Africa", ylab = "Percent Savings", las = 1, boxwex = 0.5, whisklty = 1, col = "orange") boxplot(asia, main = "Asia", ylab = "Percent Savings", las = 1,


boxwex = 0.5, whisklty = 1, col = "orange") boxplot(australia, main = "Australia", ylab = "Percent Savings", las = 1, boxwex = 0.5, whisklty = 1, col = "orange") boxplot(europe, main = "Europe", ylab = "Percent Savings", las = 1, boxwex = 0.5, whisklty = 1, col = "orange") boxplot(northamerica, main = "North America", ylab = "Percent Savings", las = 1, boxwex = 0.5, whisklty = 1, col = "orange") boxplot(southamerica, main = "South America", ylab = "Percent Savings", las = 1, boxwex = 0.5, whisklty = 1, col = "orange")


Mini-­‐Project 2 Introduction: For the second mini-­‐project I am using an external data set called “Lifeboats” that provides information about how the lifeboats on the Titanic were used. This dataset is available at http://vincentarelbundock.github.io/Rdatasets/csv/vcd/Lifeboats.csv I specifically intend to look at how the lifeboats on the port (left) and starboard (right) side of the ship were deployed in order to analyze the differences between the two. Analysis: Once the Lifeboats dataset was brought into R it was subdivided into two additional datasets entitled “Port” and “Starboard”, respectively. A “summary” test was then applied to each of these datasets. Of particular interest are the variables “crew”, “men”, “women”, “total”, and “cap” “Crew” -­‐ Number of crewmembers on each lifeboat “Men” -­‐ Number of men on each lifeboat “Women” -­‐ Number of women on each lifeboat “Total” -­‐ Total number of people on each lifeboat “Cap” – Maximum capacity of each lifeboat. For all but two of the lifeboats this was 65 people, the exceptions were 40 and 47, respectively. The results of the summary tests are below: Port st Minimum 1 Quartile Median Mean 3rd Quartile Maximum Crew 3 5 7 7.778 9 15 Men 0 2 4 4 6 10 Women 2 25 42 39.44 59 64 Total 12 41 56 51.22 70 71 Cap 40 65 65 60.22 65 65 Starboard st Minimum 1 Quartile Median Mean 3rd Quartile Maximum Crew 2 2 4 4.111 5 8 Men 0 0 0 0.778 2 2 Women 21 35 40 38.78 50 53 Total 26 39 42 43.67 55 63 Cap 40 65 65 60.2 65 65


A mosaicplot is a good way to visualize the data. In particular we can see what percentage of the total were “crew”, “men”, and “women”. The bars are scaled in proportion to the total number of people in each lifeboat.


We can also add up the data to get the following information regarding the lifeboats and who got in them: Crew Men Women Total Capacity Percent Full Port 70 36 355 461 542 85 Starboard 37 7 349 393 542 73 The following graphs show how full each of the individual lifeboats were upon being launched.


We can also plot when the boats were launched as bar graphs (since the time data is presented as multiple numbers we cannot plot it as a scatterplot, histogram, etc.)

Since the labels are somewhat difficult to read here is the information in a table: Time 12:4 12:5 1:0 1:1 1:2 1:2 1:3 1:3 1:4 1:4 1:5 5 5 0 0 0 5 0 5 0 5 5 Port X X X X X X XX X Starboar X X X X X X X X d

2:0 5 X


From this analysis we can draw the following conclusions: -­‐ As expected the number of women who made it into the lifeboats is much higher than the number of men and crewmembers. Unfortunately this dataset does not have data regarding children, who like women were given preference. -­‐ More people escaped from the port side of the boat. This is indicated in both the summary and summation statistics, as well as the “Percent Full” bar graphs. In some cases the boats were filled past capacity. -­‐ The port side launched their boats more quickly than the starboard side. The last boats launched on the starboard side were some of the emptiest. Conclusion/Summary: This mini-­‐project analyzed the number of lifeboats launched from the port and starboard sides of the Titanic and the particulars of these lifeboats. The analysis showed that more people made it into lifeboats launched on the port side, and that these lifeboats were launched earlier. The R script for this project is below.


R script Lifeboats # http://vincentarelbundock.github.io/Rdatasets/csv/vcd/Lifeboats.csv install.packages("RColorBrewer") require("RColorBrewer") port <-­‐ Lifeboats[1:9, ] starboard <-­‐ Lifeboats[10:18, ] summary(port) summary(starboard) port2 <-­‐ port[, c(5:7)] mosaicplot(port2, main = "Port", col = brewer.pal(3,"Accent")) starboard2 <-­‐ starboard[, c(5:7)] mosaicplot(starboard2, main = "Starboard", col = brewer.pal(3, "Accent")) sum(port$crew) sum(port$men) sum(port$women) sum(port$total) sum(port$cap) x <-­‐ sum(port$total) y <-­‐ sum(port$cap) x/y portpercentage <-­‐ (port$total)/(port$cap) barplot(portpercentage*100, ylim = c(0, 156), main = "Percent Full -­‐ Port", col = "blue") #The ylim may seem strange but this was the only way to get it to work


sum(starboard$crew) sum(starboard$men) sum(starboard$women) sum(starboard$total) sum(starboard$cap) x2 <-­‐ sum(starboard$total) y2 <-­‐ sum(starboard$cap) x2/y2 starboardpercentage <-­‐ (starboard$total)/(starboard$cap) barplot(starboardpercentage*100, ylim = c(0,100), main = "Percent Full -­‐ Starboard", col = "blue") summary(Lifeboats$launch) v1 <-­‐ c(1,2) v2 <-­‐ c(1,2) df <-­‐ data.frame(v1,v2) m <-­‐ data.matrix(df) layout(m) time <-­‐ (port$launch) plot(time, cex.names = 0.5, las = 3, main = "Time of Launch -­‐ Port Lifeboats", col = "red") time2 <-­‐ (starboard$launch) plot(time2, cex.names = 0.5, las = 3, ylim = c(0, 1), main = "Time of Launch -­‐ Starboard Lifeboats", col = "red")


Mini-­‐Project 3 Introduction: For the final mini-­‐project I (to an extent) developed my own dataset regarding the population density of cities. This was done by acquiring population density data for the 50 densest cities in the world from Wikipedia, along with their overall population and area. I then added information regarding latitude and longitude so the data could be plotted in an X-­‐Y plane. The resulting dataset was exported as a .csv file called “popden”. This file served as the basis of the mini-­‐project. Analysis: The analysis process began with a simple scatterplot in order to better see and understand the data. There is a noticeable difference between the five densest cities (Manila, Titagarh, Baranagar, Serampore, and Pateros) and the rest of the data.


A histogram is another way to represent the data:

What would truly be interesting, however, is to see how the cities relate to each other spatially. In order to do this I made the following graph that shows the cities in space. Most of them are located in Western Europe or Southeast Asia. The outlier is Union City, California, which is the only point in the Western Hemisphere.


We can continue the spatial analysis by performing cluster analysis. A hierarchical cluster analysis using the “complete linkage” method is below. The complete linkage method, otherwise known as CLINK, determines cluster similarity by the similarity of their two most dissimilar spanning objects.

Other forms of cluster analysis provide slightly different results. Single linkage (SLINK) is the opposite of CLINK and uses the two objects that are most similar. A dendrogram showing the SLINK method is below.


The third notable method of cluster analysis is Ward’s method. This strategy tries to find the smallest increase in the sum of squares variance.

To better visualize the results we can put boxes around different cluster solutions.

We can also perform cluster analysis by using a K-­‐means test that divides the data into a predetermined number of categories. When set to a value of 3 it split this dataset into clusters of 18, 19, and 13.


Conclusion: The cluster analysis confirms what an astute observer may have already noticed. Specifically, there are two main clusters, one in Asia and one in Europe, with Union City, California serving as an outlier. This makes sense since, unfortunately, many Asian cities, (especially those in India and the Philippines) have densely populated slums that add to their overall population density. Furthermore, Europe has many older, denser cities with higher population densities. Due to the abundance of land American cities are very spread out, and therefore have lower population densities. It would be interesting to see how data from other continents (Africa, South America) compared to these results.


R script popden=read.csv(file.choose(),header=TRUE) squaremiles <-­‐ popden$Density..sq.miles. squaremiles plot(squaremiles, main = "Population Densities of 50 Densest Cities", xlab = "Number of Cities", ylab = "Population Density", col = "orange", pch = 16) hist(squaremiles, main = "Population Densities of 50 Densest Cities", xlab = "Population Density", ylab = "Number of Cities", ylim = c(0, 20), col = "orange") plot(popden$Latitude ~ popden$Longitude, main = "Location of Cities in Space", xlab = "Degrees Longitude", ylab = "Degrees Latitude", xlim = c(-­‐150, 150), ylim = c(0, 60), col = "orange", pch = 16) install.packages("classInt") library(classInt) latlong <-­‐ popden[, c(9:10)] d <-­‐ dist(latlong) c <-­‐ hclust(d, method = "complete") c plot(c, labels = popden$City, cex.main = 2, main = "Complete Linkage Cluster Analysis Dendrogram", xlab = "Cities")


c2 <-­‐ hclust(d, method = "single") c2 plot(c2, labels = popden$City, cex.main = 2, main = "Single Linkage Cluster Analysis Dendrogram", xlab = "Cities") c3 <-­‐ hclust(d, method = "ward") c3 plot(c3, labels = popden$City, cex.main = 2, main = "Ward Cluster Analysis Dendrogram", xlab = "Cities") rect.hclust(c, k = 3, border = "blue") latlong_kmeans <-­‐ kmeans(d, 3) latlong_kmeans


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.