1 minute read
11. Appendix III: Methodology for subsurface defects detection
11. Appendix III: Methodology for subsurface defects detection
11.1 Xception Convolutional Neural Network (CNN) has been widely used to deal with vision-based tasks. Different networks may share similar sets of feature extraction layers, which are referred to as the backbone. There are some frequently used backbones, such as AlexNet [80] and VGG16/19 [81]. In this study, Xception [82] is selected as the backbone. Xception is a building block for deep nets developed by Google, which focus on the efficiency of convolution neural network by introducing the depth-wise separable convolutions. In other words, depth-wise separable convolution means convolution kernel is performed for each channel independently to extract spatial information and features. It consists of two steps: point-wise convolution and depth-wise convolution. As shown in Figure 10, 1x1 convolutions are applied to input to reduce the dimension first, and then n x n convolutions are applied to each channel to conduct depth-wise convolution. The extracted features are stacked to pass to the next layer.
Figure 10. Depth-wise separable convolution
11.2 DeepLabV3+ DeeplabV3+ is a powerful semantic segmentation module developed by Google [83], it utilizes an encoder-decoder architecture with Atrous spatial pyramid pooling (ASPP).
ASPP is able to encode multi-scale contextual information. ASPP is Atrous convolution based spatial pyramid pooling. The top part of Figure 11 shows the Atrous convolution process. It can be presented in equation (13).
y[j]=∑kx[j+r∗n]w[n] (13)
Where j is the location, n is the filter size, w is the filter weight, and r is the Atrous rate corresponding to the stride used to sample the input.
By utilizing the advantages of Atrous convolution and spatial pyramid pooling, ASPP is developed to robustly segment objects at multiple scales, as shown in Figure 12. The encoderdecoder architecture of DeepLab V3+ enables the location/spatial information to be discovered. In the encoder stage, ASPP is used to extract the local features, and the output of the encoder stage has a much smaller spatial resolution than input image resolution. In the decoder stage, the encoder features are upsampled and then concatenated with the corresponding low-level features extracted at the beginning. Figure 12 shows an example of encoder-decoder architecture.
Figure 11. Atrous Spatial Pyramid Pooling
Figure 12. Encoder-Decoder Architecture