2st semester, Surveying, Planning and Land Management Semester project The scope of the sentinel-2 data for identifying deprived urban areas
THE SCOPE OF SENTINEL-2 DATA FOR IDENTIFYING DEPRIVED URBAN AREAS Group 1 Alfredo Alvarez Curran, Ana Castellanos Alvarez & Tobias Damtoft Nowak Andersen
Title: The scope of sentinel-2 data for identifying deprived urban areas Elective: Geoinformatics Project period: 01-02-2022 to 03-06-2022 Project group: 1 Participants:
Alfredo Alvarez Curran Abstract: Mapping Deprived Urban Areas (commonly known as Slums) with the use of open-source EO systems is still a field under improvement. The lack of agreement on a common definition of ’deprived urban conditions’, together with the wide variety of forms in which they appear throughout the world, make this task particularly challenging. The aim of this project is to study the scope of Sentinel-2 data for detecting DUAs by applying machine learning and deep learning techniques, focusing on two separate study areas with different sets of geographical features, Lima and Nairobi. Our findings suggest that, while there is still room for enhancement, it is possible to obtain satisfactory results using Sentinel -2 data in combination with local characteristics of DUAs, being able to identify their location over a wide geographical scale.
Ana Castellanos Alvarez
Tobias Damtoft Nowak Andersen
Supervisor: Jamal Jokar Arsanjani Number of pages: 45 Turned in: December 20, 2022 i
Contents 1 Introduction 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Objectives Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 2 4 5 5
2 Machine Learning & Deep Learning 8 2.1 Machine Learning classifiers description . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Deep Learning models description . . . . . . . . . . . . . . . . . . . . . . . . . 11 3 Study Area and Data 3.1 Study Area 1: Nairobi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Study Area 2: Lima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12 12 13 14
4 Methods 4.1 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Nairobi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Lima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Post processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Accuracy assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16 17 17 19 21 21 24 26 26
5 Results 5.1 Choice of parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Choice of input-data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Classification Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Nairobi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Lima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28 28 30 31 33 34 35
6 Discussion
38
7 Conclusion
40
Bibliography
41
ii
List of Figures
43
Appendix
45
iii
Introduction
1 Introduction The number of people living in urban areas does not stop increasing, and it is estimated that 60% of the global population will live in cities by 2030, and the percentage will rise, to almost 67% by 2050 according to the current urbanization trends. Normally, in developing countries, high rates of urbanization cause stagnate economies and poor planning and governance, creating in turn, poor areas within bigger cities (UN-Habitat, 2018). The UN estimates that more than a billion people live in deprived conditions, meaning that one in every eight people live in these circumstances all over the world, being most of them located in Eastern and South-Eastern Asia (370 million), Sub-Saharan Africa (238 million) and Central and Southern Asia (226 million) (United Nations, n.d.).
Figure 1.0.1: Percentage population living in DUA in selected regions, excludes Australia and New Zealand, 2018. Source: (United Nations, n.d.) The needs and concerns of people who live in deprived urban areas (DUAs) are almost never taken into consideration when making urban planning, financing and policymaking, thus leaving a massive percentage of the global population behind. The amount of people living in these situations will continue to increase in most developing countries unless important action is taken by Governments at all levels, in collaboration with civil society and development partners (Nations, 2021), having the following prediction for 2050.
Page 1 of 45
Introduction
Figure 1.0.2: Population of DUAs, estimation for 2025. The number are in million Source: (Nations, 2021) After the third United Nation Conference on Human Settlements in 2015, the number of research on deprived urban characterisation has raised, as a consequence of the SDG11, more specifically, to its target 11.1 which aim to ensure access for everybody to adequate, safe and affordable housing and basic services and upgrade slums by 2030 (Abascal et al., 2022) (United Nations, n.d.) Monitoring contributes to a better performance assessment and better coordination between central governments and the regional and local governments, allowing cities to gather accurate information which can be used to design and implement adequate policies and programs to respond to a specific issue based on evidence. Therefore, monitoring and quantifying the percentage of the population that lives in DUAs is an important task in providing the necessary information to decision makers and stakeholders in order to achieve the target 11.1 (UN-Habitat, 2018). This paper aims to contribute to this cause by studying the use of open-source data and tools in order to monitor these areas.
1.1
Background
Terminology There are a large number of terminologies referring to these types of settlements, depending on different connotations, one or the other is usually used. Terms such as “informal,” “illegal” or “squatter,” are focused on the tenure status, terms such as “unplanned” relate to the planning context, “spontaneous” or “irregular” relates to the growth dynamics, whereas “deprived,” “shantytown” and “sub-standard” are associated with poor physical and socio-economic conditions. The term “slum” is largely linked to the Habitat Agenda and the related development goals (Kuffer et al., 2016) Even though the term slum is commonly used, it has negatives and ideological connotations (Gilbert, 2007) , nevertheless the term deprived areas has not such as connotation because it is just a technical term that refers to lack of something. That is why, the term deprived urban areas will be used in this project for being considered a more adequate term to refer to the areas under study. Page 2 of 45
Introduction
Definition of deprived urban areas The term deprived area is not commonly used in this context, and still lacks a global agreement to be conceived and mapped. A definition of the term slum, on the other hand, has been widely discussed. After the Expert Group Meeting (EGM) convened in Nairobi in 2002 by UN-Habitat, the United Nations Statistic Division and the Cities Alliance, an operational definition of slums was agreed upon and recommended to be used for international usage. This operational definition defines a slum as the area that combines, one or more of the following characteristics such as lack of access to safe water, lack of access to improved sanitation, lack of housing durability, lack of sufficient living area, and lack of security tenure (UN-habitat, 2003). These indicators are explained below: • Access to safe water: it is considered that a settlement has an inappropriate drinking water supply when less than 50% of households have an improved water supply, being these sources household connection, access to public standpipe and rainwater collection, preventing contamination in the drinking water, specially preventing from the faecal matter´s contamination (Sliuzas et al., 2008). • Access to improved sanitation: it is considered that a settlement has an inappropriate sanitation when less than 50% of households have an improved sanitation facility, understanding by this flush/pour-flush toilets or latrines connected to a sewer, septic tank or pit (Sliuzas et al., 2008). • Durable structures: not only referring to the structure materials, but also to the location. A housing structure is considered durable when the roof, walls and floor are built with resistant materials, being able to provide protection to its inhabitants against weather and climate conditions. Normally household from deprived areas are non-durable dwelling, built with inadequate building structures such as earthen floors, mud-and-wattle walls or straw roofs. Being usually located in hazardous areas, inappropriate for settlements such as places situated at risk of landslides, earthquakes and floods, also housing settled located near industrial areas with toxic emissions or waste disposal sites and housing around high-risk zones as railroads, airports and energy transmission lines (Sliuzas et al., 2008). • Sufficient living area (overcrowding): referring to a low number of square meters per person and high occupancy rates meaning the number of persons sharing the same room. Being considered overcrowding when more than three persons share the same habitable room which is a minimum of four-square meters (Sliuzas et al., 2008). • Access to security tenure: individuals do not have documentation to proof their secure tenure status which protect them against arbitrary unlawful evictions (Sliuzas et al., 2008). When mapping deprived areas, the following three spatial levels must be considered: object level, slum settlement level and slum environment in order to map these areas:
Page 3 of 45
Introduction
Figure 1.1.1: The six general indicators representing concepts at three spatial levels Source: (Kohli, et al., 2021) The indicators are described as: • Building characteristics such as roof type, footprint, building height, walls and floor materials and number of floors. • Access network such as the type of roads, the width and layout, having normally irregular access with variable street types and widths. • Density, referring to the percentage of roof coverage, open spaces and vegetation. These areas generally have high roof coverage with no open spaces or vegetation. • Settlement shape, which tends to follow the shape of features such as roads, railways or drainage channels. • Location, being frequently located on hazardous sites where other development is not possible, such as in flood zones, marshy areas, along railways and on steep slopes. • Neighborhood characteristics, being the relationship with theirs surrounding areas. The location of this settlement depends on socio-economic factors, being often formed close to opportunities for unskilled or low skilled jobs (Kohli, et al., 2021) Sentinel-2 data resolution of 10m will determine which of these characteristics will be identifiable
1.2
Literature review
The number of publications related to mapping DUAs employing remote sensing methods has increased in the last two decades since more very high-resolution (VHR) data is available, together with the widespread concern to curb and eradicate the growth of these areas (Kuffer et al., 2016). Previous research shows that it is possible to obtain very accurate results when mapping deprived urban areas by using VHR data, being able to find less than 0.5m spatial resolution data. One of the papers related to VHR data and mapping deprived urban areas which gives a very interesting overview of the fact of working with VHR data and remote sensing, is the one called “Slums from Space—15 Years of Slum Mapping Using Remote Sensing” where the variety of the different methodology to achieve that purpose were gathered from a review of different papers, exploring the potential and limitations of the different methods.
Page 4 of 45
Introduction
Another paper that is worthy to be mentioned for being, according to the author, the first published paper which shows satisfactory results with mapping deprived urban areas with Sentinel-2 is called “Mapping Informal Settlements in Developing Countries with Multi-resolution, Multi-spectral Data” in 2018. After that, many other papers have been published using Sentinel-2 and different methods to map deprived urban areas.
1.3
Problem statement
Traditionally, data has been generated by manually surveying the areas and then digitizing this information. Among several limitations, this type of field-work approach is not only costly and time-consuming, but it also makes it difficult to update and transfer (Abascal et al., 2022). In the past two decades, the growing number of satellite imagery providers has made Remote Sensing (RS) emerge as a feasible approach for gathering data on DUAs (Mahabir, 2018). Furthermore, the digital format of RS imagery has enabled the application of several Artificial Intelligence(AI) techniques for feature extraction. This powerful combination of RS and AI has opened up a world of possibilities in the field of mapping DUAs. Having understood the need for producing spatial data on DUAs, this research project aims to work with accessible open-source data. The different tools and techniques used for identifying DUAs rely on the level of detail and aggregation of the spatial data that the user wishes to acquire (Maxwell Owusu, 2021). By reviewing several of the most relevant works on the subject, it can be observed that the majority of works are based on commercial Very High Resolution (VHR) costly data (P., 2018). Furthermore, these approaches are focused on a local scale while a more global solution for mapping deprived areas is still under-researched (Kuffer et al., 2016).
Figure 1.3.1: Frequency of methods versus main focus for slum mapping using VHR imagery. Source: (Kuffer et al., 2016)
1.3.1
Objectives Research Question
In this context, the use of Sentinel-2 imagery provides new ground in the field of mapping DUAs. Furthermore, Machine Learning (ML) classifiers, such as Maximum Likelihood or Random forests, as well as Deep Learning (DL) architectures offer the possibility of combining the multi-spectral data from Sentinel with other relevant open source features in order to accurately map DUAs. To this end, the following objectives are to be pursued: Page 5 of 45
Introduction
• To employ available classification techniques with Sentinel-2 data for mapping DUAs in two separate locations with very different sets geographical features. • To evaluate the impact of local deprivation features in combination with spectral data on the classification results. Research question: This paper aims to investigate the following research question: 1. How good is Sentinel 2 data for mapping Deprived Urban Areas using Open Source methods? 2. How can the local characteristics of Deprived Urban Areas contribute to the classification process?
Page 6 of 45
Introduction
Figure 1.3.2: Research Structure
Page 7 of 45
Machine Learning & Deep Learning
2 Machine Learning & Deep Learning The aim of this paper is to investigate state-of-the-art methods to work with Sentinel-2 data for identifying DUAs. For this purpose, two different set of tools are selected: Machine Learning classifiers and Deep Learning models, in both cases, aimed at pixel-classification on satellite images.
2.1
Machine Learning classifiers description
The following section provides a theoretical description of the ML classifiers used in this project. k-nearest neighbors (KNN) It is a supervised machine learning algorithm, that assumes that similar things are close to each other. KNN is a non-parametric algorithm, that means it does not make any assumption on underlying data, classifying a new data point based on similarity classifying the new data into the class that is most similar to the available classes. First the number k of neighbours has to be selected, and then the Euclidean distance of k number of neighbours are calculated getting the nearest neighbours, among these k neighbours the amount of number of the data points in each class is counted, assigning to the new point, the value of the class where that number of values has been higher (javatpoint, 2021).
Figure 2.1.1: Illustration of K-nearest neighbours. Source: (javatpoint, 2021) Support Vector Machine (SVM) Support vector machine is a supervised non-parametric statistical learning technique where no assumption on the data distribution is made. The purpose of this training algorithm is to find an optimal hyperplane that divides the dataset into a separate predefined number of classes agreeing with the values of the training samples. The optimal separation refers to the decision boundary that reduces misclassification obtained in the training step. The margin width is defined by the points that are located on the margin, naming these points support vectors (Mountrakis, et al., 2010). Page 8 of 45
Machine Learning & Deep Learning
Figure 2.1.2: Illustration of Support Vector Machine. Source: (Mountrakis, et al., 2010) SVM is widely used in the remote sensing field because of the capacity of producing higher classification accuracy than traditional methods even using small training data set (Mantero, et al., 2005) but the outcome depends on the user-defined parameters: the kernel used, choice of parameters for the chosen kernel and the method used to generated SVM, having a strong effect on its performance (Tso Mather, 2009). The learning process that SVM follows is reducing classification mistakes on unseen data with no previous assumption made on the probability of the data (Mountrakis, et al., 2010). Random Forest (RF) Random forest is a supervised machine learning algorithm which is used in classification and regression problems, consisting of a combination of decision tree classifiers where each of them is generated using a random vector sampled independently from the input vector where each tree casts a unit vote for the most popular class to classify an input vector (Pal, 2005), (Breiman, 1999). There are two statistical measures that improve the classification objectives: bootstrapping and bagging, also known as bootstrap aggregation. The last one is the ensemble technique used by random forest. Bagging chooses a random sample from the data set. Hence each model is generated from the samples (Bootstrap Samples) provided by the Original Data with replacement known as row sampling, calling this step bootstrap. After that, each model is trained independently generating some results. The final output is based on majority voting after combining the results of all models. This step which involves combining all the results and generating output based on majority voting is known as aggregation (Sruthi, 2021).
Page 9 of 45
Machine Learning & Deep Learning
Figure 2.1.3: Illustration of Random Forest. Source: (Ampadu, 2021). Some of the features of Random Forest are that not all the variables are considered while making an individual tree, being each tree different. Each tree is created independently out of different data and attributes (Sruthi, 2021).
Page 10 of 45
Machine Learning & Deep Learning
2.2
Deep Learning models description
The following section provides a theoretical description of the DL models used in this project. Deep learning is a machine learning technique where a computer model learns from examples to perform classification tasks. The models are trained by using a large set of labelled data and neural network architectures, learning features directly from the data. A neural network (NN) is a computational learning system variety of deep learning technology which involves a large number of processors operating in parallel and arranged in rows highly interconnected, having several hidden layers which make the complexity of the learned image features higher when the number of these layers are higher.The training consists of giving input and telling the network what the output should be, allowing the model to regulate itself its internal weighting to learn how to perform better, where the inputs that contribute to obtaining right answers are weighted higher (Burns, 2018). To create these neural network the library Tensorflow is used.
Figure 2.2.1: Neural networks. Source: (IBM, 2020)
Page 11 of 45
Study Area and Data
3 Study Area and Data As mentioned previously in this paper, it is well known that DUAs vary all across the globe, presenting different features that are recognizable from space (Kohli, et al., 2021). Hence, in order to evaluate the potential of Sentinel-2 data for mapping DUAs, this project will use two separate geographical locations in which DUAs present very different characteristics: Lima (Peru) and Nairobi (Kenya). As both cities are very large (specially Lima with an area of 2,672 km²), in order to optimize computational power, an Area of Interest (AOI) has been delineated. All the classifiers used in this project will be tested in the corresponding AOI.
3.1
Study Area 1: Nairobi
Figure 3.1.1: Study Area, Nairobi. Source: Self-produced Nairobi is the capital of Kenya and one of West Africa’s largest cities. Approximately 60% of Nairobi’s 5 million inhabitants live in a deprived areas, and it is estimated that there are around 200 Page 12 of 45
Study Area and Data
different deprived settlements in the city (www.kibera.org.uk, n.d.). The largest slum in the city (Kibera) is by many considered the most populated on the continent. Other deprived areas have existed for centuries and are also well known to the public. However, with a population growth rate of over 4%, the city is one of the fastest-growing cities in the world, and it is therefore challenging to keep track of the new settlements. According to the UN habitat, 75% of the urban growth is absorbed by deprived areas which entails a rapid development of these areas (World population review, 2022b). The main reason Nairobi was selected as a case study is because it has been widely studied in previous literature and there is a great amount of reference data available in comparison to other locations.
3.2
Study Area 2: Lima
Figure 3.2.1: Study Area, Lima. Source: Self-produced Lima has been selected as it is known to one of the project’s authors and therefore, a certain knowledge of the local context can be provided. It is the capital city of Peru and with 10 million inhabitants, it is one of the largest cities in the Americas (World population review, 2022a). It is not known with certainty how many people live in a deprived areas in Lima, however, it is estimated to be around 30% of the population. The main problems Lima is facing are like those of Nairobi. The city experienced rapid population growth as well as uncontrolled and unplanned urban sprawl due to low planning capacity. These factors had led to the development of areas with a lack of land tenure and other basic services (Almaaroufi et al., 2006). Page 13 of 45
Study Area and Data
3.3
Data
This study assesses the use of Sentinel-2 data for detecting DUAs. Being an open source option, it provides a spatial resolution of up to 10m and temporal resolution of 10 days, making it a very interesting option for detecting and monitoring DUAs on a small scale. The Sentinel data for both case studies was downloaded from USGS Earth Expolorer and it was possible to find clear images with almost no cloud cover. Out of the 13 spectral-bands provided by sentinel-2, band 2-8 and 11-12 were acquired. Band 1, 9 and 10 were not chosen due to their primary use for atmospheric corrections and their low resolution (60m). Regarding the ground-truth data, it was first requested to several authors from relevant academic papers, but did not succeed in obtaining it. Secondly, an in-depth search on the internet was performed and the best option that we found was “The atlas of informality” which is a user-based initiative for slum mapping carried out by professors and students from University of Colorado Boulder. As it is not an official organization, this data-set has been used to get a first idea of where to look at DUAs. (Atlas of informality, n.d.). The data is accessible on their web-page, where it is displayed as shown in figure 3.3.1, but cannot be downloaded. Because the data is user-generated, there is no knowledge of the accuracy, as well as no guarantee that all slum areas are mapped within the study areas. The information is therefore taken with a grain of salt.
Figure 3.3.1: A screenshot of the webpage "Atlas of informality". Source: Self-produced In addition to the data provided by the “atlas of informality”, the distinction between deprived and formal settlements is done by assessing google maps imagery and using google street view. This process will be further explained in 4.1.1 and 4.1.2. Additional data related to the specific characteristics of deprived areas in the two study areas is also collected and processed as described in chapter 4.1.1 and 4.1.2. The following table shows the final amount of data-sets to be used in this research. Page 14 of 45
Study Area and Data
Data
Year
Sentinel-2 (Bands 2,3,4,5,6,7,8,11,12) Atlas of Informality High Resolution Orthophoto Buildings’ Centroid (Nairobi) Rivers Shp (Nairobi) DEM (Lima)
Source
2022 USGS 2020 Online Data 2022 Google Earth 2021 Earth Engine Data Catalog 2020 Open Street Map 2020 NASA
Page 15 of 45
Methods
4 Methods The method used in this research paper is split into three main steps (figure 4.0.1): Data preprocessing, classification, and result analysis.
Figure 4.0.1: Method Structure Source: Self-Produced
Page 16 of 45
Methods
4.1
Data Pre-processing
Before going through the classification process, the first step is preparing the data. Firstly, the sentinel bands 5,6,7,11 and 12 are re-sampled to 10m resolution with the use of QGis so that each band contains the same number of pixels. This will be very important when working with the data in Python. Secondly, a selection of relevant features is extracted for each study area. By doing so, a criteria is set for differentiating deprived from non-deprived areas. This process also allows to generate a label data-set which will be used in the classification for both training and validation purposes. This approach has been taken due to not being able to acquire reliable ground truth data from any contacted source. Furthermore, as this papers aims to study the influence of local characteristics when detecting DUAs with Sentinel-2, the relevant features related to DUAs will be included as input data for each case study and their performance in the classification process will be assessed.
4.1.1
Nairobi
Feature Selection The features selected as part of DUAs in Nairobi were detected by first using the areas delineated by “the atlas of informality” which are distinct from other parts of the city in the following ways: • The deprived areas occur in larger neighborhoods. • The buildings have a significantly smaller surface area. • The density of the buildings is generally higher. • The patterns of the buildings and roads are not in all cases different from the rest of the city. In some areas, they are as expected with small unorganized paths and buildings. In other deprived areas, the roads and buildings seem just as organized as in the rest of the city. • There is generally a spectral difference in true-color imagery. The roofs appear brighter and have few color variations. • The deprived areas are mostly located along the city’s many small rivers. Beyond that, there is no visual relation between the landscape and the location of those areas.
Figure 4.1.1: Close-up showing difference between Deprived and Non-Deprived (Nairobi) Source: Self-Produced Page 17 of 45
Methods
Making use of Google Earth’s high resolution orthophoto, a manual delineation of the areas that include the characteristics mentioned above is performed. The result is a Shapefile containing the areas which are considered deprived inside the area of interest:
Figure 4.1.2: Manually delineated DUAs in Nairobi (Nairobi) Source: Self-produced Composite Once the image-features are selected and the label data-set is created, we search for further input data to be added to the composite raster and potentially improve the detection of DUAs. In the case of Nairobi, the following characteristics were found to be most relevant and able to translate into a feature-dataset: Building density: The density of buildings in the deprived areas was probably the most significant visual characteristic. In order to include this feature, building centroid was downloaded from Earth Page 18 of 45
Methods
Engine Data Catalog (Google Research, 2021). The data is derived by google research who has mapped building footprints around the world, and it is open source and covers 60% of Africa. The data were exported into GIS and used to make a heat map by using the tool kernel density. The search radius was set to be 15m. Distance to river: DUAs in Nairobi are generally close to rivers. In order to include this characteristic, a shapefile containing rivers in the area was downloaded from OpenStreetMap. The tool “Euclidean distance” was applied to make a raster layer with information about the measured distance to the river. The threshold for maximum distance was set to be 150m. Reclassify was used to reverse the values of the raster layer, so cells close to a river have a higher value than cells further away.
4.1.2
Lima
Feature Selection There exist no publicly available maps of DUAs in Lima, and there is generally not a lot of information about these areas. The atlas of informality delineates some areas. However, it seems like there are several other areas that look like DUAs that do not appear on their map. Unlike Nairobi, it is generally difficult to distinguish between these mapped areas and non-deprived areas on an aerial image. They also appear to be less consistent when it comes to building density and how the roads and buildings are structured. However, the areas considered a DUAs by the “atlas of informality” still has some characteristics in common: • The DUAs are often placed on hillsides on the outskirts of the city. They are therefore generally located on steeper terrain than non-deprived urban areas. • The roads are not paved and have a distinct pattern due to the terrain. • The units are smaller; however, the building density is not necessarily high.
Figure 4.1.3: Close-up showing difference between Deprived and Non-deprived Source: Self-Produced As it has been described previously, these features are manually mapped inside the area of interest resulting in the following label data-set:
Page 19 of 45
Methods
Figure 4.1.4: Manually delineate DUAs for Lima Source: Self-produced Composite The image above offers a clear picture of where DUAs are located in Lima: on the threshold between the mountains and the city. Therefore, the main input layer to add to the composite raster is a Slope-Raster which is obtained by geo-processing a Digital Elevation Model of the area.
Page 20 of 45
Methods
4.2
Classification
Now that the data and study areas have been carefully studied, the classification process will start by first delineating a training area from which training samples and label-data will be extracted and fed to the classifier, which will finally classify the entire AOI.
Figure 4.2.1: Training sites Source: Self-produced
4.2.1
Machine learning
The three machine learning classifiers, that have been used in the project are Support Vector Machine (SVM), K-Nearest Neighbour (KNN), and Random Forest. These methods were chosen because Page 21 of 45
Methods
they are accessible both in GIS software and in python, which corresponds to the general objective of using open source and available methods. Training data for the machine learning methods were created in QGIS as a polygon feature class. A part of the area of interest was chosen for training in order to use the rest of the area for testing. Only within that area, polygons for training were drawn. In Nairobi, a total of 6 classes were defined:Deprived, Others (natural surfaces), and 4 different Non-deprived artificial surfaces. It was necessary to have several classes for artificial surfaces for the models to better distinguish between deprived and non-deprived areas especially close to rivers and as well as avoiding noisy results. At least three polygons were drawn for each class. Although deprived areas in Nairobi are not uniform, only one class was usable, since more classes confused the model and gave odd results.
Figure 4.2.2: Training Samples, Nairobi Source: Self-produced In Lima a first approach creating up to 7 classes failed to identify deprived areas as those were appearing in between non-deprived areas. A second approach creating only 3 classes, deprived, non-deprived and other (mainly used for natural surfaces) proved to be more accurate.
Page 22 of 45
Methods
Figure 4.2.3: Training Samples, Lima Source: Self-produced In both cases, the training data were changed continuously to get the best possible results, by assessing the output of the models. The actual classification was done in python by using the free library scikit learn, which gives access to the chosen classifiers. Scikit learn provides an intelligible interface that is commonly used in machine learning. Python was chosen due to its ability to handle larger datasets as well as for documentation purposes. Ideally, the whole process, from pre-processing of data to visualization should have been done in python, however, lack of experience with the language and limited time made it faster and more secure to use GIS in other parts of the process for instance the initial data management and final visualization. The used script is based on a tutorial published by Chris Holden on GitHub (Holden, n.d.). The script first rasterizes the training samples, then rearranges the data (training and label data) so it fits the model. Lastly, the model predicts values for the whole area of interest, and the data is exported as a tif. file for further processing in GIS. For each method, the parameters are assessed by trying out at least three different values to see how they affect the accuracy of the output. If the different values do not have an impact on the accuracy of the model, the default value is chosen. Otherwise, the value with the best accuracy was selected. For the random forest classifier, different amounts of trees and max depth were tested, for the KNN different number of neighbors, and for SVM different kernels. The assessment of the different parameters was based on the method described in chapter 4.3.
Page 23 of 45
Methods
4.2.2
Deep learning
In order to train the neural network, a new training dataset was required. Using the same training area as shown in figure 4.2.1, a binary (deprived/non-deprived) raster was made which covered the whole area. The geometry of the training data-set is based on the self-made ground truth data explained in chapter 4.1.
Figure 4.2.4: Label-data for DL model Source: Self-Produced The neural network was made in python by using Google’s free library TensorFlow, which is a very popular software that gives easy access to the otherwise very complex world of deep learning. The script is based on a tutorial published by Pratyush Tripathy on towardsdatascience.com (source). Page 24 of 45
Methods
The image below (figure 4.2.6) is a screenshot of the architecture of the neural network applied for Nairobi and Lima.
Figure 4.2.5: The image shows a closeup of the predicted areas from random forest. Source: Self-Produced The model contains an input layer that flattens the data and where the input shape size is defined followed by the dense layer. For each area of interest, different amounts of dense layers and neurons were tried out, in order to test which structure gave the best results. The assessment was done by looking at how changes in the model affected the accuracy calculated at the bottom of the script. For Nairobi, the best results were noted when using 4 dense layers with 32 neurons and 1 dense layer with 2 neurons, whereas the best results for Lima were reached by using 1 layer with 32 neurons and 1 layer with 2 neurons. 100 epochs were chosen, but due to early stopping, the model only ran around 10 of them. The training data was split 70/30 where 70% is used for training and 30% for the validation, however, the validation in the script only indicates the accuracy of the model within the training area. The final classification of the whole area of interest is assessed the same way as the other classifier (se chapter 4.3 about accuracy assessment).
Page 25 of 45
Methods
4.2.3
Post processing
For all four classifiers, a tiff file was imported into GIS for further processing. From each file, the class representing the deprived area was extracted. Subsequently, the extracted features were converted to a shapefile in order to sort out smaller polygons. Ideally, the threshold should be as low as possible, however considering the resolution of sentinel data, the aim of the project, and that the average size of a deprived area in the ground truth dataset is 0,4 square kilometers for Nairobi. Therefore, the value that was selected was all polygons over 100Ha (0,1sqKm).
Figure 4.2.6: Post-Processing diagram Source: Self-Produced
4.3
Accuracy assessment
The accuracy for each classification method by creating a confusion matrix from a number of sampled points. The labeled data obtained in (Chapter 4.1) will be used as reference data in order to determine the accuracy of the results. Both the reference data-set and classification schema contain the same value-field: Deprived, and Non-Deprived. The strategy to sample the points around the classified data will be ’stratified random’, which provides a number of randomly placed points that are proportional to each class’s area. Once the points are sampled around the classified data, the reference data is used for creating the confusion matrix.
Figure 4.3.1: Confusion matrix illustration Source: (Kohli et al, 2017)
Page 26 of 45
Methods
ClassV alue Deprived Non-deprived Total U Accuracy Kappa Deprived 31 15 46 0, 67 0 Non-deprived 7 447 456 0, 98 0 Total 38 462 500 0 0 P Accuracy 0, 81 0, 97 0 0, 96 0 Kappa 0 0 0 0 0, 71 The results obtained from the matrix are presented from 0-1, with 1 as a 100% accuracy. The diagonal shows the True Positive (TP) values for each class whereas the ones that are off the diagonal show errors. The confusion matrix table shows the user’s accuracy (U Accuracy column), the producer’s accuracy (P Accuracy row), and Kappa value. The user’s accuracy column will show false positives (FP), that is to say, cases where pixels are being incorrectly classified as a known class when they should have been classified as something different. The data to compute this errors is being read from the rows of the table. The Total row displays the number of points that should, in accordance to the reference data, have been identified as a given class(ESRI, n.d.). The producer’s accuracy column shows false negatives (FN). The data to compute this error rate is read in the columns of the table. The Total column shows, according to the classified map, the number of points that were identified as a given class (ESRI, n.d.). The Kappa statistic of agreement gives an overall assessment of the accuracy of the classification (ESRI, n.d.).
Page 27 of 45
Results
5 Results The results obtained will be compared visually with the reference data, black color will correspond to non-deprived whereas white will correspond to deprived. Furthermore, a confusion matrix obtained from the point-accuracy assessment will provide with an evaluation of each classifier.
5.1
Choice of parameters
Essential parameters for each classifier were tested by applying different values. The purpose of the assessment is to determine which gives the highest accuracy and to assess how sensible the models are for changes in these parameters. Random forest For the random forest classification, 5 different numbers of trees were tried out, in order to assess the most suitable value. The values were 10, 25, 50, 100, and 150. The values 100 and 25 got the best score with a kappa value around 0.66 which were about 0.6 higher than the other samples. The test has shown that the number of trees does not have a notable effect on the accuracy and therefore 100 trees are chosen, as it is the default. The same pattern was also detected for Lima, and the same value was therefore applied. Changes in the max depth values were also applied but did not appear to have any significance. Therefore, max depth of 30 was chosen which is the default. K-Nearest Neighbour For the nearest neighbor, a similar assessment was conducted for the number of neighbors. For this assessment, the values 2, 3, 4, and 5 were chosen. The results for Nairobi got a kappa value between 0,51-0,56 where 3 and 2 neighbors have the highest. The value 3 was chosen due to slightly better user and producer accuracies. The same value was applied in Lima. Support vector machine For the Support vector machine, the kernels “Poly”, “RBF”, and “Sigmoid” were compared. The results showed that RGF with 0,49 has the highest kappa value which was just 0,2 higher than poly. The Sigmoid kernel on the other hand only got a kappa value of 0,12. Neural network As mentioned in the method chapter, the output of the neural network is a raster where each cell has a probability value between 0 and 1. In order to classify what is deprived and non-deprived, an appropriate threshold must be found. Different thresholds were therefore applied in order to find the one which gives the highest accuracy. For the assessment, the values 0.025, 0.05, 0.1, 0.15, 0.20, 0.30, 0.40, 0.60 and 0.80. The results for the accuracy and for the kappa value are shown in figure 5.1.1.
Page 28 of 45
Results
Figure 5.1.1: x axis represent the thresholds and y axis the kappa value. Source: Self-Produced As the figure for Nairobi shows, the different values do not have a significant impact on the accuracy. The kappa value, on the other hand, shows clear fluctuations, and it can be concluded that 0,15 gives the best result. The most suitable threshold for Lima was 0,15 as illustrated on figure 5.1.1.
Figure 5.1.2: x axis represent the thresholds and y axis the kappa value Source: Self-Produced
Page 29 of 45
Results
5.2
Choice of input-data
Nairobi In Nairobi, building density and closeness to rivers were the local characteristics considered. Data about those characteristics was provided and processed as explained in chapter 4.1.1. The comparison of classification with (11 bands) and without (9 bands) these inputs was conducted and as shown in the figure 5.2.1, did the additional data increase the kappa value as well as the accuracies. There is no substantial difference in their visual appearance, however, in some areas, it seems that the classification with no additional data covers less of the deprived areas as shown in figure 5.2.2.
Figure 5.2.1: Difference between including local features or not (table), Nairobi.
Figure 5.2.2: Red is with characteristics and blue is without Source: Self-Produced Lima The most influential local characteristic of DUAs that we found in Lima and can be added to the composite raster is the Slope. As it will be seen in chapter 5.3, the method with the best performance is the NN, and so the comparison between the performance of the classifier considering the slope or
Page 30 of 45
Results
not is carried out by feeding the NN model first with the composite containing the slope and then with the composite only containing spectral information.
Figure 5.2.3: Difference between including local features or not (table), Lima Source: Self-Produced
Figure 5.2.4: Difference between including local features or not, Lima Source: Self-Produced
5.3
Statistics
In order to asses the results quantitatively, the results obtained after adjusting the relevant parameters and computing the confusion matrix, will consider the user accuracy (U acc), producer accuracy Page 31 of 45
Results
(P acc) for the ’Deprived’ class, and the Kappa value as an overall assessment of the classifier’s performance.
Figure 5.3.1: Results table Source: Self-Produced
Page 32 of 45
Results
5.4
Classification Maps
Figure 5.4.1: Graphic comparison of employed methods Source: Self-Produced Page 33 of 45
Results
5.4.1
Nairobi
For Nairobi, the random forest has the highest kappa value of 0,71 compared to the four other classifiers which have similar kappa values between 0,48 and 0,58. The random forest also has the best user accuracy for a deprived area (0,67) and the best producer accuracy (0,81). In the case of Nairobi, Random Forest, therefore, has made the most accurate prediction. K-nearest neighbor was the classifier that had the worst performance with a kappa value of 0,48. Although the accuracy of the model differs, they generally give similar visual results, which can be seen by comparing the predicted polygons to a high-resolution image. The deprived area used for training the models (the southern area) is clearly detected in all cases, which also was expected. In the deprived neighborhoods in the central and north-western part of the area of interest, all the model confuses non-deprived areas with deprived areas. The image below (figure 5.4.2) shows the issue with that area, and how the model struggles to clearly define deprived areas. The spectral signature of deprived and non-deprived settlements in that area appear very similar which makes it challenging to separate the two classes. Although the deprived area appears denser and is located close to a river, the methods fails to use those inputs to make a more reliable distinction in this part of the city.
Figure 5.4.2: The area around the river is defined in the ground truth data as deprived area. The area to the left is standard urban settlements. The orange polygons in the image is the prediction from random forest. Basemap: ESRI In other parts of the city where there is a more obvious difference and where the deprived areas better meet the known criteria, the models perform better and give similar results. This is shown in figure 5.4.3.
Page 34 of 45
Results
Figure 5.4.3: The image shows a closeup of the predicted areas from random forest. Source: Self-Produced
5.4.2
Lima
In the case of Lima, the best results were achieved by using the Neural Network. As its the case with Nairobi, the rest of the classifiers have trouble in similar areas. Overall, the output classified map from the NN seems to be the one that better understands where the DUAs are located in Lima, this is, in the thresholds between the mountains and the formal urban frame. The main errors regarding accuracy of the models come from those areas in which there is little building density and the model is predicting natural surfaces when there is still urban area.
Page 35 of 45
Results
Figure 5.4.4: The image shows a closeup of the predicted areas classified by NN. Source: Self-Produced In many cases, we considered several areas in the reference data as deprived, although they have a regular and maybe even planned urban frame . The model does not consider these areas as deprived, it seem as if it has learned that only the areas located in the mountains in high-slope areas, with their organic and irregular patterns, are the ones to be considered as deprived. The numerical results should then be improved by limiting the reference data to the delineation of these kind of areas.
Page 36 of 45
Results
Figure 5.4.5: The image shows a closeup of the predicted areas classified by NN. Source: Self-Produced
Page 37 of 45
Discussion
6 Discussion In this project, we present a methodology for mapping DUAs in two different study areas, combining Sentinel-2 data with machine learning classifiers and a deep learning model. The reviewed literature shows VHR data has been widely studied for this purpose, aiming at achieving high accuracy as well as high level of aggregation in the results. However, the aim of this research has been to study how far can we go using Sentinel-2 data with only 10m resolution. The following chapter will discuss the positive aspects of the proposed method as well as the limitations to bear in mind, and secondly, several considerations to take into account when taking a more global approach aimed at covering different geographical areas at once. The main benefit of the proposed method is that, in contrast to studies based on VHR data, both the tools and the data used are Open Source, enabling more people to access and explore this path. Furthermore, despite the relatively low complexity of the models used, they obtain satisfactory results when looking at large extension of areas, being able to get an overview of where the DUAs are located. However, highly precise delineations of DUAs should not be expected, mainly, due to the following reason: The amount of information contained in a 10m by 10m resolution pixel makes the model very prone to errors in those areas where the boundaries between deprived and not deprived are not clear. Users should expect a low level of aggregation, being inadequate for further processing aimed at obtaining population numbers, density, morphology, level of income etc.. being necessary in these cases to work with VHR data. It has been widely studied that there are different degrees of deprivation, several studies even research how they can be identified with EO systems (using VHR data)(Abascal et al., 2022).Therefore, in order to get optimal results based on Sentinel-2 data resolution, it is necessary to establish a clear threshold between what is considered deprived and what is not. The method proposed has been able to successfully detect those areas that could be regarded as highly deprived, thus being more recognizable from space. In this context, considering the temporal resolution of Sentinel-2 data, it could also be appropriate for monitoring small scale growth of these highly deprived areas at a national or even international level. A global model to map DUAs is too challenging and has not been done yet. The issue with that is the large variety of ontologies of these areas among the different parts of the world and for that reason the amount of training data set that would be needed to create such a deep learning model would be extremely large. Not only that, but also the need to add further input data to the composite raster in order to improve the detection of the DUAs. These inputs have to be related with the characteristics of the DUAs, and since it intends to be a global one, have to contain a lot of different inputs in order to cover all the different features. Being necessary previously to have knowledge of the patterns that the DUAs follow. For example, in our case studies, slope is very relevant for Lima, however not for Nairobi, and in Nairobi the density is an important feature, but this input would not help much in detecting deprived urban areas in Lima where all the city has a very high density. So that is why two different model have been performed to detect deprived urban areas in such a different places, where different inputs are needed to improve that detection of these areas. However, in every continent, the deprived urban areas ontologies tend to have similar characteristics between them, so maybe, it could be easier to create a model to map deprived urban areas within the same continent and being able to obtain accurate results in the different countries Page 38 of 45
Discussion
of the same continent (Kohli, et al., 2021). Trying to map DUAs in a global dimension would be unlikely with machine learning techniques due the fact that a selection of accurate training samples would be needed for each different place. However, it could be possible to do it with a deep learning model where a very large amount of images would be needed, but having that you could detect the DUAs by running the model, without the need to add further input data to the composite raster. When trying to work with large-scale monitoring and modelling, the use of Google Earth Engine (GEE) could be a good option to work with. GEE is an online platform that provides free access to a large amount of satellite images, being able to integrate different datasets, using the cloud for the entire process and even for performing the analysis, using machine learning or deep learning techniques. It is possible to open the data directly in the code editor and run the code directly here to visualize the results, being possible as well to upload your own data to the platform and continuing working from there.
Page 39 of 45
Conclusion
7 Conclusion This project has considered the use of sentinel 2 imagery to detect deprived areas in two distinct areas using widely available methods and assessed how local characteristics can be used to further improve the outcome. This research has shown that sentinel 2 can be used to give a general overview of the location of deprived areas, even with limited training data. That said, the ambiguity and lack of common agreement between what is a deprived area and what is not, added to an image resolution of only 10m, make this method inadequate for a precise mapping of these areas. However, considering the short period of time and low computational power used for this exercise, it is arguably possible to obtain better results by using more accurate datasets and methods of higher complexity. The potential of sentinel data is therefore assumably larger than presented in this project. In addition, the research found, that a preliminary analysis of the local characteristics of DUAs in the given study area is key. The possibility of adding these features as input data in the classification process provides considerably higher accuracy than merely using spectral information. This opens up interesting research avenues aimed at achieving global mapping of DUAs.
Page 40 of 45
Bibliography, List of Figures & Appendix
Bibliography Abascal et al. (2022). “Identifying degrees of deprivation from space using deep learning and morphological spatial analysis of deprived urban areas”. In. Almaaroufi, Samar et al. (2006). “Place-Making through the Creation of Common Spaces in Lima’s Self-Built Settlements: El Ermitaño and Pampa de Cueva as Case Studies for a Regional Urbanization Strategy”. In. Ampadu, Hyacinth (2021). Random Forest Understanding. url: https://ai-pool.com/a/s/randomforests-understanding. Atlas of informality (n.d.). Explore the Atlas of Informality. url: https://www.atlasofinformality. com/. Breiman, Leo (1999). “Random Forest—Random features”. In. Burns, Ed (2018). What is a neural network? Explanation and examples. url: https : / / www . techtarget.com/searchenterpriseai/definition/neural-network. ESRI (n.d.). Accuracy Assessment, ESRI. url: https://pro.arcgis.com/en/pro- app/2.8/help/ analysis/image-analyst/accuracy-assessment.htm. Gilbert, A. (2007). The return of the slum: Does language matter? url: https://doi.org/10.1111/ j.1468-2427.2007.00754.x. Google Research (2021). Open Buildings - A dataset of building footprints to support social good applications. url: https://sites.research.google/open-buildings/#download. UN-Habitat (2018). “SDG Indicator 11.1.1 Training Module: Adequate Housing and Slum Upgrading”. In. UN-habitat (2003). “The Challenge of Slums: Global Report on Human Settlements 2003”. In. Holden, Chris (n.d.). Chapter 5: Classification of land cover. url: https://ceholden.github.io/opengeo-tutorial/python/chapter_5_classification.html. IBM (2020). Neural Networks. url: https://www.ibm.com/cloud/learn/neural-networks. javatpoint (2021). K-Nearest Neighbor(KNN) Algorithm for Machine Learning. url: https://www. javatpoint.com/k-nearest-neighbor-algorithm-for-machine-learning. Kohli, et al. (2021). “An ontology of slums for image-based classification”. In. Kuffer et al. (2016). “Slums from Space—15 Years of Slum Mapping Using Remote Sensing”. In. Mahabir, R. (2018). “A Critical Review of High and Very High-Resolution Remote Sensing Approaches for Detecting and Mapping Slums: Trends, Challenges and Emerging Opportunities”. In. Mantero, et al. (2005). “Partially Supervised Classification of Remote Sensing Images Through SVM-Based.Probability Density Estimation”. In.
Page 41 of 45
Bibliography, List of Figures & Appendix Maxwell Owusu, Monika Kuffer (2021). “Towards user-driven earth observation-based slum mapping”. In. Mountrakis, et al. (2010). “Support vector machines in remote sensing: A review”. In. Nations, United (2021). The Sustainable Development Goals Report. Statistics division. url: https: //unstats.un.org/sdgs/report/2021/goal-11//. P., Helber (2018). “Mapping Informal Settlements in Developing Countries with Multi-resolution, Multi-spectral Data”. In. Pal, M. (2005). Random forest classifier for remote sensing classification. url: https : / / www . tandfonline.com/doi/full/10.1080/014311604123312696. Sliuzas et al. (2008). “Report of the Expert Group Meeting on Slum Identification and mapping, 2008”. In. Sruthi (2021). Understanding Random Forest. url: https://www.analyticsvidhya.com/blog/2021/ 06/understanding-random-forest/. Tso Mather (2009). “Classification Methods for Remotely Sensed Data”. In. United Nations (n.d.). Sustainable Development Goals 11. url: https://sdgs.un.org/goals/goal11/. World population review (2022a). Lima population 2022. url: https://worldpopulationreview.com/ world-cities/nairobi-population. — (2022b). Nairobi population 2022. url: https : / / worldpopulationreview . com / world - cities / nairobi-population. www.kibera.org.uk (n.d.). Kibera Facts Information. url: https : / / www . kibera . org . uk / facts info / # : ~ : text = Kibera % 5C % 20Facts % 5C % 20 % 5C % 26 % 5C % 20Information , the % 5C % 20biggest%5C%20in%5C%20the%5C%20world..
Page 42 of 45
List of Figures 1.0.1 1.0.2 1.1.1 1.3.1 1.3.2
Percentage population living in DUA in selected regions, excludes Australia and New Zealand, 2018. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Population of DUAs, estimation for 2025. The number are in million . . . . . . . . The six general indicators representing concepts at three spatial levels . . . . . . . Frequency of methods versus main focus for slum mapping using VHR imagery. . . Research Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 2.1.2 2.1.3 2.2.1
Illustration of K-nearest neighbours. . . . . . . . . . . . . . . . . . . . . . . . . . 8 Illustration of Support Vector Machine. . . . . . . . . . . . . . . . . . . . . . . . 9 Illustration of Random Forest. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Neural networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1.1 3.2.1 3.3.1
Study Area, Nairobi. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Study Area, Lima. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A screenshot of the webpage "Atlas of informality". . . . . . . . . . . . . . . . . . 14
4.0.1 4.1.1 4.1.2 4.1.3 4.1.4 4.2.1 4.2.2 4.2.3 4.2.4 4.2.5 4.2.6 4.3.1
Method Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Close-up showing difference between Deprived and Non-Deprived (Nairobi) . . . . . Manually delineated DUAs in Nairobi (Nairobi) . . . . . . . . . . . . . . . . . . . Close-up showing difference between Deprived and Non-deprived . . . . . . . . . . Manually delineate DUAs for Lima . . . . . . . . . . . . . . . . . . . . . . . . . . Training sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Training Samples, Nairobi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Training Samples, Lima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Label-data for DL model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The image shows a closeup of the predicted areas from random forest. . . . . . . . Post-Processing diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Confusion matrix illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16 17 18 19 20 21 22 23 24 25 26 26
5.1.1 5.1.2 5.2.1 5.2.2 5.2.3 5.2.4 5.3.1 5.4.1
x axis represent the thresholds and y axis the kappa value. . . . . . . . . . . . . . x axis represent the thresholds and y axis the kappa value . . . . . . . . . . . . . . Difference between including local features or not (table), Nairobi. . . . . . . . . . Red is with characteristics and blue is without . . . . . . . . . . . . . . . . . . . . Difference between including local features or not (table), Lima . . . . . . . . . . . Difference between including local features or not, Lima . . . . . . . . . . . . . . . Results table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graphic comparison of employed methods . . . . . . . . . . . . . . . . . . . . . .
29 29 30 30 31 31 32 33
Page 43 of 45
1 2 4 5 7
5.4.2 5.4.3 5.4.4 5.4.5
The area around the river is defined in the ground truth data as deprived area. The area to the left is standard urban settlements. The orange polygons in the image is the prediction from random forest. . . . . . . . . . . . . . . . . . . . . . . . . . . The image shows a closeup of the predicted areas from random forest. . . . . . . . The image shows a closeup of the predicted areas classified by NN. . . . . . . . . . The image shows a closeup of the predicted areas classified by NN. . . . . . . . . .
Page 44 of 45
34 35 36 37
Appendix Appendices is to be found in an external document attached to this project report.
Appendices in A4: A Deprived areas maps for Nairobi and Lima B Machine learning python script (random forest, support vector machine, and k-nearest neighbor classification). C Deep learning python script (Neural network classification)
Page 45 of 45