GRD Journals- Global Research and Development Journal for Engineering | Volume 4 | Issue 5 | April 2019 ISSN: 2455-5703
Survey on Feature Extraction using Neural Networks to Classify Remote Sensing Images T. Gladima Nisia Assistant Professor Department of Information Technology AAA College of Engg & Tech., Sivakasi, Tamil Nadu
Dr. S. Rajesh Associate Professor Department of Information Technology AAA College of Engg & Tech., Sivakasi, Tamil Nadu
Abstract Remote Sensing (RS) image classification is one of the key research areas in the image processing field. The main important part of this classification is the efficient extraction of features from the RS image. The feature extraction process is also a complex process. In earlier days, there are some kind of features extracted like spectral features. But, while considering the spatial domain of the RS image, it contains more information than the spectral features. So, spectral features dominated the classification area for few years. Many researches were conducted to still improve the classification accuracy. Thus, it resulted in the extraction of features using the different neural networks, which proved to increase the accuracy. This paper surveys and discuss the different works at different duration carried out by researchers to extract the features using neural networks. Also, this survey provides a marginal overview for the future research and improvements. Keywords- Remote Sensing, Feature Extraction, Neural Networks, Spatial Feature, Spectral Feature
I. INTRODUCTION In the recent years, the classification of remote sensing images is found to be a very attractive field for the researchers. The RS image contains lots of information in every single pixel. So, using the remote sensing image the land use mapping and land cover mapping is done. The land cover area is the earth cover which consist of forest, water, bare land, saline land, mountain range etc. The land use is the land cover area converted into a built environment such as residential buildings, commercial buildings, transport and agricultural land. To better understand the land cover/land use mapping let us consider the remotely sensed image of a geographical location. The land cover/ land use has to be identified and classified. Land-cover and land-use information are required for many different kinds of spatial planning, from urban planning at a local level up to regional development. They play an important role in agricultural policy making. For proper management of natural resources, the land-cover data is important. They are increasingly needed for the assessment of impacts of economic development on the environment. Hence, at various geographical levels they are fundamental for guiding decision making. The Earth’s surface is changing at different levels namely local, regional, national and global scales. Land management and land planning needs the current status of the landscape. Understanding current land cover status, it’s uses, and monitoring the timely changes is responsible for land management. Also, the reason for the changes in the land condition can be found easily through land cover mapping. Keeping in mind these applications, the classification of RS images has to be done efficiently. The neural networks are employed for obtaining the features from the RS images which in turn is used for classifying each and every pixel of the image. The remaining chapters are organised as follows: The chapter II discusses about some of the neural networks used for extracting the features from the RS image. The chapter III discusses about the neural networks which are trained layer wise. Chapter IV provides the conclusion of the study.
II. DEEP NEURAL NETWORK ARCHITECTURE A. Deep Belief Network A deep belief network (DBN) is a class of deep neural network introduced by Geoff Hinton [3][4] and his students in 2006. DBN composed of multiple layers of latent variables ("hidden units"). The network has connections between the layers and there is no connection between units within each layer. When trained on a set of examples without supervision, a DBN can learn to probabilistically reconstruct its inputs. The layers then act as feature detectors. After this learning step, a DBN can be further trained with supervision to perform classification. DBNs can be viewed as a composition of simple, unsupervised networks such as restricted Boltzmann machines (RBMs) or autoencoders, where each sub-network's hidden layer serves as the visible layer for the next. This stack of RBMs might end with a Softmax layer to create a classifier, or it may simply help cluster unlabelled data in an unsupervised learning scenario. DBN’s hidden layer serves as both input and output to the layers before and after it respectively.
All rights reserved by www.grdjournals.com
41
Survey on Feature Extraction using Neural Networks to Classify Remote Sensing Images (GRDJE/ Volume 4 / Issue 5 / 007)
1) Training The training process of a DBN can be divided into two stages: the pre-training stage and the fine-tuning stage. In the pre-training stage, an unsupervised learning-based training is carried out in the down-up direction for feature extraction, while in the finetuning stage, a supervised learning based up-down algorithm is used. The improved performance of the DBNs can be largely attributed to the pre-training stage in which the initial weights of the network are learned from the structure of the input data. Compared with the randomly initialized ones, these weights are closer to the global optima and can therefore bring better performance. B. Deep Boltzmann Machine Deep Boltzmann Machine (DBM) is an unsupervised, probabilistic, generative model with entirely undirected connections between different layers [5]. It contains visible units and multiple layers of hidden units. Like deep belief networks, DBM’s have the potential of learning internal representations that become increasingly complex, which is considered to be a promising way of solving object and speech recognition problems. DBM has nonlinear activation functions. DBM enables to learn complex relationship between features and high-level representation of features. 1) Training The hidden units act as latent variables (features) that allow the Boltzmann machine to model distributions over visible state vectors that cannot be modelled by direct pairwise interactions between the visible units. The learning rule of DBM remains unchanged even with hidden units. So, it’s possible to learn binary features to obtain higher-order structure in the data. C. Stacked Autoencoders The Stacked Autoencoders (SAE) [6] is stacking autoencoders into hidden layers by an unsupervised layer-wise learning algorithm and then fine-tuned by a supervised method. The working of SAE is of three steps. First, train the first autoencoder by input data and obtain the learned feature vector. Second, the feature vector of the former layer is used as the input for the next layer, and this procedure is repeated until the training completes. Third, after all the hidden layers are trained, backpropagation algorithm (BP) is used to minimize the cost function and update the weights with labelled training set to achieve fine-tuning. 1) Training The stacked autoencoders use greedy layer-wise training to obtain parameter. To do this, first train the first layer on raw input to obtain parameters for first layer weight and bias W(1,1),W(1,2),b(1,1),b(1,2). This first layer achieves vector consisting of activation of the hidden units, A by transforming the raw input. Train the second layer on this vector to obtain parameters W(2,1), W(2,2),b(2,1),b(2,2) for second layer weight and bias. The same procedure is repeated for the remaining layers, using the output of each layer as input for the subsequent layer. By doing so, the parameters of each layer are individually trained. After which the fine-tuning is done using back propagation. D. Stacked Denoising Autoencoders Stacked Denoising Autoencoders (SDA) [8] is a denoised autoencoders. It is an autoencoder which has multiple layers except that it's training is not same as a multi layered NN. It is unsupervised pre-training done layer by layer, as input is fed through. The input may contain noise. The input is passed through the hidden layer. Output is generated and loss is calculated between the output and the original input. The process continues until the loss is minimized. Then finally the full data is passed through the network and the data present in the hidden layer is collected. This is the new input. Now, the collected input is taken and noise is passed to it and the same procedure is followed thereafter. After the process is done with the last layer, the data collected in the last hidden layer is now the new data. 1) Training The network is trained to obtain input from a corrupted version of it. After the completion of pre-training to conduct feature selection and extraction on the input from the preceding layer, a second stage of supervised fine-tuning can follow. Once the first k layers are trained, the k+1-th layer can be trained because now it is possible to compute the code or latent representation from the layer below. Then train the entire network as like training a multilayer perceptron. At this point, only consider the encoding parts of each auto-encoder. This stage is supervised.
III. LAYER WISE TRAINING FRAMEWORK A. Restricted Boltzmann Machine The Restricted Boltzmann Machine (RBM)s [7] are 2-layer neural nets that constitute the building blocks of deep-belief networks. The first layer of the RBM is input layer and the second is the hidden layer. RBM consists out of one input layer (v1,‌,v6), one hidden layer (h1, h2) and corresponding biases vectors Bias a and Bias b. The absence of an output layer is apparent. There is no necessity of the existence of output layer since the predictions are made differently. The nodes are connected to each other across layers, but no two nodes of the same layer are linked. RBM is a stochastic neural network. All rights reserved by www.grdjournals.com
42
Survey on Feature Extraction using Neural Networks to Classify Remote Sensing Images (GRDJE/ Volume 4 / Issue 5 / 007)
1) Training The Gibbs sampler is used to train RBM. Randomly start with any one layer and perform Gibbs sampling to generate data from an RBM. Once the states of the units in one layer are given, all the units in the other layers will be updated. This update process will carry on until the equilibrium distribution is reached. Next, the weights within an RBM are obtained by maximizing the likelihood of this RBM. B. Autoencoder Autoencoder (AE) is a simple 3-layer, unsupervised Machine learning algorithm neural network where output units are directly connected back to input units [1]. Here, the number of hidden units is much less than the number of visible ones. It applies backpropagation, setting the target values to be equal to the inputs. AE is trained to copy its input to its output. The hidden layer is used to represent the input. AE is a one-hidden-layer feed-forward neural network similar to the MLP. The difference between an MLP and an AE is that the aim of the AE is to reconstruct the input, while the purpose of the MLP is to predict the target values with certain inputs. The numbers of nodes in the input layer and the output layer are identical. In the coding process, the AE first converts the input vector x into a hidden representation h using a weight matrix ω, then in the decoding process, the AE maps h to the original input vector to obtain x˜ with another weight matrix ω′. Theoretically, ω′ should be the transpose of ω. Parameter optimization is adopted to minimize the average reconstruction error. Mean square errors (MSEs) are used to measure the accuracy of reconstruction. 1) Training The training process for an AE can also be divided into two stages: the first stage is to learn features using unsupervised learning and the second is to fine-tune the network using supervised learning. To be specific, in the first stage, feed-forward propagation is first performed for each input to obtain the output value x˜. Then squared errors are used to measure the deviation of x˜ from the input value. Finally, the error will be backpropagated through the network to update the weights. In the fine-tuning stage, with the network having suitable features at each layer, the standard supervised learning method is adopted and the gradient descent algorithm is used to adjust the parameters at each layer. C. Convolutional Neural Network One of the special feed-forward neural networks is the Convolutional Neural Network (CNN) [9]. In the traditional neural network, the neurons are 1-D. In the convolutional neural network, the layers are 3-dimension, which are height, width and depth. In order to reduce the number of parameters the CNN has two important concepts, locally connected and parameters sharing. The architecture of CNN is as shown in figure 1.
Fig. 1: Architecture of CNN
There are three main types of layers to build CNN architectures: (1) the convolutional layer, (2) the pooling layer, and (3) the fully-connected layer. The fully-connected layer is like the regular neural networks. And the convolutional layer is to perform convolution many times. The pooling layer can be though as downsampling by the maximum of each 2 x 2 block of the previous layer. D. Locally Connected Network In image processing, the information of an image is the pixel. But if the fully connected network is used like before, it may result in too many parameters. The 512 x 512 RGB image will have 512x 512 x 3 = 786432 parameters per neuron. The large number of the parameters leads to very slow process and would lead to overfitting. After some investigation of images and optical systems, it is understood that the features in an image are usually local and one just notice the low-level features first in the optical system. So, it is possible to reduce the full connected network to the locally connected network. It is one of the main ideas in the CNN. Just like the mostly image processing does, locally connect a square block to a neuron. The block size can be 3 x 3 or 5 x 5 for instance. The block is like a feature window in some image processing tasks. By doing so, the number of parameters can be
All rights reserved by www.grdjournals.com
43
Survey on Feature Extraction using Neural Networks to Classify Remote Sensing Images (GRDJE/ Volume 4 / Issue 5 / 007)
reduced to very small but it will not lower the performance. To extract more features, connect the same block to another neuron. The depth in the layers is how many times we connect the same area to different neuron. The stride means the shifting distance of the window. Let us consider an example, if we use stride 1 and window size 3 x 3 in 7 x 7 x 3 image without zero-pad, there are 5 x 5 x depth neurons in the next layer. If we change the stride 1 to stride 2 and others remain the same, there are 3 x 3 x depth neurons in the next layer. So, if we use stride s, window size w x w in w x h image, there are[(W-w)/s+1] x [(H-w)/s+1] x depth neurons in the next layer. E. Parameters Sharing For example, there are 32 x 32 x 5 neurons in the next layer with stride 1, window size 5 x 5 and with zero-pad, and the depth is 5. Each neuron has 5 x 5 x 3 = 75 parameters (or weights). So, there are 75 x 32 x 32 x 5 = 384000 parameters in the next layer. The idea is to share the parameters in each depth. That is 32 x 32 neurons in each depth use the same parameters. So, there are only 5 x 5 x 3 = 75 parameters in each depth and 75 x 5 = 375 parameters in total. It greatly decreases the number of parameters. By doing so, the neurons in each depth in the next layer is just like applying convolution to the image. F. Activation Function In the traditional neuron model, we often use the sigmoid function for the activation function. Other choices for the activation function are also available. One of them is Rectified Linear Units (ReLUs). The function is f(x) = max (0, x). Krizhevsky et al. [9] compared the performances of using the ReLUs function and the sigmoid function as the activation function in CNNs. The ReLU model needs less iteration time with the same training error rate. G. Pooling Layer Although locally connected networks and parameter sharing are used, there are still many parameters in the neural networks. Compared with a relatively small dataset, it might cause overfitting. So, often the pooling layers is inserted to the networks. It can progressively reduce the number of parameters and hence the computation time in the networks. It operates on depth of every previous layer. It means that the depth of the next layer is the same as that of the previous layer. Also, the number of pixels can be set when we move the window, or stride, as the convolutional layer. The pooling is explained with the Figure 2.
Fig. 2: A simple example of the pooling layer
Note that there are two type of pooling layers. If the window size equal to stride, it is traditional pooling. If the window size is larger than the stride, then it is overlapping pooling. In practice, the window size 2 2 and the stride size 2 is used in the traditional pooling and use the window size 3 3 and the stride size 2 in the overlapping pooling. In additional to max pooling, other functions can also be used. For example, to calculate the average of the window represent the value of the next layer, which is called average pooling, and use L2-norm, which is called L2-norm pooling.
IV. CONCLUSION The paper thus discusses different neural networks used previously for the extraction of features from the remote sensing image. The paper also discusses how the individual neurons are trained and how the layer wise training happens for the different neural networks. Every network has their own advantages and disadvantages. After a long and tedious effort by various researchers it is found that the CNN works better with the extraction of features. But, still even the CNN have some practical disadvantages. So, these issues have to be handled and overcome in future.
REFERENCES [1] [2] [3] [4]
G. E. Hinton and R. S. Zemel, “Autoencoders, minimum description length, and helmholtz free energy,” Advances in neural information processing systems, pp. 3, 1994. G.E. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, July 2006. G. E. Hinton, S. Osindero, and Y. W. Teh, “A fast learning algorithm for deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006. Mohamed,A.,Dahl, G.,Hinton, G. “Deep belief networks for phone recognition,” Proc.NIPS Workshop, Dec. 2009.
All rights reserved by www.grdjournals.com
44
Survey on Feature Extraction using Neural Networks to Classify Remote Sensing Images (GRDJE/ Volume 4 / Issue 5 / 007) [5] [6] [7] [8] [9]
Salakhutdinov R and Hinton G. E., “Deep Boltzmann machines”, AISTATS’2009, pp. 448-455, 2009. Yu Qi, Yueming Wang, Xiaoxiang Zheng, Zhaohui Wu, “Robust feature learning by stacked autoencoder with maximum correntropy criterion ”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 11, pp. 3371-3408, 2010. Hinton G. E., “A practical guide to training restricted Boltzmann machines,” Technical Report UTML TR2010-003, Department of Computer Science, University of Toronto, 2010. Vincent P., Larochelle H., Lajoie I., Bengio Y., and Manzagol P., “Stacked denoising autoencoders”, J. Machine Learning Res., vol. 11, pp. 3371-3408, 2010. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks”, in Proc. Adv. Neural Inf.Process.Syst. Conf., 2012.
All rights reserved by www.grdjournals.com
45