Underground Pipeline Mapping Based on Dirichlet Process Mixture Model

Page 1

Received May 18, 2020, accepted June 4, 2020, date of publication June 29, 2020, date of current version July 7, 2020. Digital Object Identifier 10.1109/ACCESS.2020.3005420

Underground Pipeline Mapping Based on Dirichlet Process Mixture Model QINGYUAN WU , XIREN ZHOU , AND HUANHUAN CHEN , (Senior Member, IEEE) UBRI, School of Computer Science and Technology, University of Science and Technology of China (USTC), Hefei 230027, China

Corresponding author: Huanhuan Chen (hchen@ustc.edu.cn)

ABSTRACT Underground pipeline mapping is important in urban construction. There are few specific procedures and approaches to map underground pipelines using ground penetration radar (GPR) without knowing the number of buried pipelines. In this paper, an automatic pipeline mapping model, the Dirichlet Process Pipeline Mapping Model (DPPMM), is introduced with GPR and Global Position System (GPS) data as input. By combining the GPR and GPS the position, direction, depth and size of pipelines could be estimated. The number of buried pipelines in the detection site could be automatically estimated with the benefit of DPPMM, without any prior knowledge. By adopting this model, the probabilities of each survey point belonging to each pipeline are calculated, and the pipeline directions and locations are also estimated. The experimental results demonstrate that this model could obtain more accurate pipeline maps than other state-ofthe-art algorithms in various experimental settings. INDEX TERMS Ground penetrating radar (GPR), pipeline mapping, clustering, nonparametric Bayesian model.

I. INTRODUCTION

Underground pipeline mapping is an important part in the urban construction to avoid inaccurate excavations during pipes maintenance and rehabilitation. Ground Penetrating Radar (GPR) is a widely used piece of equipment for detecting underground pipelines due to its non-destructive property. Figure 7 shows how the buried pipelines are detected by the GPR when it is moved over a pipeline. As shown in Figure 2, when the GPR is moved perpendicularly to a pipeline,1 there will be a hyperbola in the B-scan image [1]. The depth and radius of buried pipelines could be estimated by fitting the hyperbolas. There have already been several effective algorithms for detecting pipelines from the B-scan images of a GPR [2]–[11]. In our previous work [10]–[12], an automatic GPR B-scan image detecting algorithm has been proposed, which could detect and fit the hyperbolas in B-scan images and obtain the depth and size of the pipes. The B-scan images are transformed to binary images by a thresholding method based on the gradient information and The associate editor coordinating the review of this manuscript and approving it for publication was Francesco Benedetto . 1 Theoretically, only when the GPR is moved exactly perpendicularly to the pipeline, there will be a hyperbola in the B-scan image. However, in real operation, it is difficult to achieve the requirement. In this paper, when the GPR is moved within the valid range of the pipeline (shown in Figure 7), the shape in the B-scan image is roughly considered to be a hyperbola. 118114

the discrete noisy points are removed by opening and closing operations. Then the open-scan clustering algorithm (OSCA), the parabolic fitting-based judgment (PFJ) method, and the restricted algebraic-distance-based fitting (RADF) algorithm are adopted to detect and fit the hyperbolas in B-scan images. By connecting GPR with a Global Position System (GPS), the positions of the survey points could also be estimated. After detecting in the experimental site, the survey points measured by GPR and GPS are combined to map the buried pipelines. The points that belong to the same pipeline should be divided into the same cluster.2 Then the direction and location of each pipeline is estimated. A pipeline map is generated by combining all these pipelines together. To map the underground pipelines from these survey points, there are two main challenges: lack of prior knowledge and variety of environment. The lack of prior knowledge means that, in most applications, the number of pipelines is inaccurate before detecting.3 On the other hand, the variety of environment means that the data collected by GPR and GPS may be noisy. The depth and radius estimated from GPR 2 In this paper, each cluster represents a pipeline. It means that the shape of each cluster is a segment of straight line. 3 Although most cities have their own records of underground pipelines, it is difficult to ensure the accuracy of these records. It is common that some parts of the records are missing or out-of-date with the development of the city.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

VOLUME 8, 2020


Q. Wu et al.: Underground Pipeline Mapping Based on Dirichlet Process Mixture Model

FIGURE 2. A B-scan image with a hyperbola on it. Figure (a) is the original data measured by GPR. Figure (b) shows the result when transform it to a binary figure and fit the hyperbola. The horizontal axis of the image represents the moving direction of GPR. The vertical axis represents the time of receiving the echo. Since the velocity of radar wave is considered to be consistent, the vertical axis is proportional to depth and the top of the image represents the ground.

FIGURE 1. These figures show how a pipeline is detected by GPR. When the GPR is moved over a pipeline, there will be a hyperbola on the B-scan image. The detected pipeline is considered to be perpendicular to the moving direction of GPR, and the included angle is considered to be bias.

could be inaccurate due to the system noise, the heterogeneity of the medium, and mutual wave interactions. In the case of high-rise buildings in the city, GPS signals are often blocked, which could lead to errors in the location estimation. In this paper, the Dirichlet Process Pipeline Mapping Model (DPPMM) is proposed to map the buried pipelines from GPR and GPS data. It is based on a nonparametric Bayesian model, i.e. the Dirichlet Process Mixture Model (DPMM). The Bayesian model is consisted of parametric and nonparametric models. In the parametric Bayesian models [13]–[15], the prior is imposed to the parameters of the models to encourage the prior knowledge. As a probability model, the probabilities of each survey point belonging to each cluster are calculated, and the point is assigned to the cluster with the maximum probability. When the probability of a new cluster is larger than any other clusters, the new cluster is automatically added into the pipeline map. As a result, DPPMM could estimate the number of pipelines without any prior knowledge. Besides, each pipeline is defined as a distribution in this model. The pipeline map is a mixture distribution composed of these pipelines. By changing the parameters of these distributions, e.g. the variances, this model could work in various environments. There have been several kinds of clustering algorithms based on Dirichlet Process [16]–[19]. Especially, Kobayashi and Nakano [20] proposed a GPR signal processing algorithm based on DPMM. It could detect and fit the hyperbolas VOLUME 8, 2020

in B-scan images with the help of Dirichlet Process Crescent-signal Mixture Model (DPCMM). However in most cases, when calculating the probability of a survey point belonging to a new cluster, the base distribution4 is used. Since the base distribution is complex, Gibbs sampling is adopted to simulate the base distribution in many algorithms [21]–[25]. However, when mapping pipelines, the base distribution is unknown, which means Gibbs sampling could hardly work. In this paper, a randomly sampled survey point from the input dataset is used to simulate the base distribution. The main contributions of this paper are summarized as below: 1) As a nonparametric model, some parameters, such as the number of pipelines, could be automatically estimated during mapping. 2) By changing the parameters of distributions, e.g. the variances, this model could work in various environment. 3) By randomly sampling survey points from the dataset, this model could simulate the base distribution without any prior knowledge. II. DATASETS AND PROBABILITY MODEL

In this section, the input and output data are reported. Then the probability model, Dirichlet Process Mixture Model, and how it works with the survey points are introduced. A. INPUT AND OUTPUT

As shown in Figure 7, when GPR is moved over a pipeline, there will be a hyperbola in the B-scan image. By fitting the equation of the hyperbola, the depth and radius of the pipeline could be estimated. Besides, by connecting the GPR with a GPS, the position and the moving direction could also be measured. Since the GPS signals are received periodically, the GPR is considered to be moved straightly with a constant speed during the time slot. Although the moving direction may not be exactly perpendicular to the real pipeline, the detected pipeline is considered to be perpendicular to the moving direction and the angle between detected and real 4 The base distribution is a parameter of DPMM. It contains the prior knowledge of the mixture distribution. In this paper, it means the prior probability of the pipeline map, which is unknown.

118115


Q. Wu et al.: Underground Pipeline Mapping Based on Dirichlet Process Mixture Model

pipeline is considered to be bias. In conclusion, the dataset of detected points is described as X = {Xi | Xi = (xi , yi , θi , zi , ri ), 1 ≤ i ≤ N },

(1)

where Xi is the i-the survey point, (x, y) is the position of the survey point, θ is the direction, z is the depth, r is the radius, and N is the number of survey points. When detecting in real world sites, the GPR data could be noisy thus the depth and radius estimation could be inaccurate. The depth is estimated by the equation of hyperbola in the B-scan image, which means that the ground of experimental site should be flat and horizontal. To make the hyperbolas clear in the B-scan images, the velocity of GPR is low during experiments (1 ∼ 2m/s). The biases of depth and radius are mainly influenced by the pipeline detecting algorithm. When the velocity of GPR is consistent and the hyperbolas are all clear in the B-scan images, the hyperbola fitting algorithm could estimate the depth and radius accurately and the biases are small (several centimeters). However the ground in the experimental cites could not be exactly flat and the velocity of GPR could not be consistent all the time. As a result, the biases of depth and radius would be slightly larger than expected. Since the exact location and direction of pipelines are unknown, the GPR might not be moved perpendicularly to the pipeline, which means the direction is also in error. Besides, in real applications, some other non-pipeline interference objects such as stones might also be mistaken for pipelines, which would generate some noisy points in the detection site. The location data from GPS could also contain some errors due to the interference of buildings and trees. In this paper, it is assumed that the biases of location, depth and radius follow Gaussian distributions. The bias of direction is assumed to be following a uniform distribution since all the angle in valid range has the same probability. The output of the model is a cluster assignment of all the survey points along with the pipeline equations, which is described as C = {ci | 1 ≤ ci ≤ L, 1 ≤ i ≤ N }, M = {(θl , bl ) | 1 ≤ l ≤ L},

(2)

where C is the cluster assignment, M is the pipeline map, L is the number of clusters, N is the number of survey points in the dataset, ci is the clustering assignment of the i-th survey point and (θl , bl ) is the parameter of the l-th pipeline. Since pipelines are all considered to be straight and with limited lengths, the equation of pipeline is defined as a segment of a straight line. sin θl · x + cos θl · y + bl = 0, xmin ≤ x ≤ xmax , ymin ≤ y ≤ ymax .

(3)

where xmin , xmax , ymin , ymax are the boundary values of positions of the survey points in each cluster. This model is to estimate the pipeline with the maximum probability, which is the output. 118116

B. DIRICHLET PROCESS MIXTURE MODEL

Dirichlet Process Mixture Model (DPMM) is adopted as the probability model of the survey points. DPMM is a nonparametric Bayesian model. In this model, the underground pipeline map is considered to be a mixture distribution G, and the detecting dataset X is considered to be sampled from this distribution. The distribution G is defined as G=

L X

πi · f (2i ),

i=1 L X

πi = 1,

(4)

i=1

where πi represents the weight of the i-th pipeline, f () represents the distribution of data sampled from a single pipeline, and 2i is the parameter set of the i-th pipeline. The parameter set is defined as 2 = (µb , σb , µθ , σθ , µz , σz , µr , σr ),

(5)

where µb and µθ represent the straight line of the pipeline line : sin µθ · x + cos µθ · y + µb = 0, µz represents the depth of the pipeline, µr represents the radius of the pipeline, and σb , σθ , σz , σr represent the biases determined by antennas. The survey points are independent from each other. The parameters follow distributions as below. 2 ∼ G, X ∼ f (2).

(6) (7)

To be detailed, (x, y) ∼ N (line, σb2 ), θ ∼ U (µθ − σθ , µθ + σθ ), z ∼ N (µz , σz2 ), r ∼ N (µr , σr2 ),

(8)

where N () and U () represent the normal and the uniform distribution. These four distributions are independent from each other. For the convenience of calculation, Equation (8) can be transformed into d̃ = sin µθ · x + cos µθ · y + µb ∼ N (0, σb2 ), θ̃ = |θ − µθ | ∼ U (0, σθ ), z̃ = z − µz ∼ N (0, σz2 ), r̃ = r − µr ∼ N (0, σr2 ),

(9)

where d̃, θ̃ , z̃, r̃ represent differences between the detecting data and the real pipeline data. Since the domain of a normal-distributed random variable is the real number filed, d̃, z̃, r̃ can be negative. The mixture distribution is supposed to follow a Dirichlet Process, G ∼ DP(α, H ),

(10)

where α is an implicit parameter that represents the concentration and H is a base distribution of G. VOLUME 8, 2020


Q. Wu et al.: Underground Pipeline Mapping Based on Dirichlet Process Mixture Model

FIGURE 3. This figure shows how the DPMM explains the process of detecting survey points. The mixture distribution G is the pipeline map that we want to obtain. The distribution L represents a pipeline. The survey points are the input of the clustering algorithm.

As shown in Figure 3, the process of detecting a survey point could be considered to be sampling from the mixture distribution. The mixture distribution G is considered to be a pipeline map. First, choose a pipeline from the pipeline map, which means sample a parameter set 2 from the mixture distribution G. Then detect a survey point from the pipeline, which means sample a survey point from the distribution L with parameter set 2. By repeating this process to get N different survey points, a dataset is sampled from the mixture distribution. The clustering algorithm in the next section shows how to fit the pipelines from the dataset. III. DIRICHLET PROCESS PIPELINE MAPPING MODEL CLUSTERING

In this section, the equations used to calculate the probabilities are introduced. Then, the Dirichlet Process Pipeline Mapping Model (DPPMM) clustering algorithm is proposed in Subsection B.

The second factor P(Xi | 2l ) is the likelihood of the i-th point being sampled from the l-th pipeline. 2l is the parameters of the l-th pipeline fitted by the former survey points belonging to it. Since d̃, θ̃ , z̃, r̃ are independent from each other, it can be calculated as below. P(Xi | 2l ) = p(θ̃ | σθ )p(d̃ | σb ) · p(z̃ | σz )p(r̃ | σr ),   1 (θ̃ ≤ σ ) i θ p(θ̃ | σθ ) = σθ  0 (θ̃i > σθ ) ! x2 1 exp − 2 . p(d̃ | σb ) = √ 2σb 2π σb

(13) (14)

(15)

The last two factors of Equation (13) can be calculated in the same way as Equation (15). The probability of the i-th survey point belonging to the k + 1-th pipeline, which also means a new pipeline, is

A. FORMULAS FOR PROBABILITY CALCULATION

P(ci = k + 1 | c1...i−1 , X1...i ) ∝ P(ci = k + 1 | c1...i−1 )P(Xi | 2k+1 ).

When detecting pipelines by GPR, the dataset is always in a specific order determined by detecting route. However, when sampling from a mixture distribution, the order of the data should be random. To simulate it, the dataset is randomly shuffled to get a stochastic-ordered dataset as the input of the model. Then, the dataset is processed sequentially to calculate the probabilities of each survey point belonging to different pipelines. For the i-th survey point, suppose there are k pipelines formed by the former i − 1 survey points, the probability of it belonging to the l-th(1 ≤ l ≤ k) pipeline is

The first factorP P(ci = k + 1 | c1...i−1 ) is the prior probability. Since k+1 l=1 P(ci = l | c1...i−1 ) = 1, α P(ci = k + 1 | c1...i−1 ) = . (17) α+i−1 The second factor is the likelihood. For the k + 1-th pipeline, there is no survey point belonging to it now, therefore 2k+1 cannot be estimated. As a result, the marginal distribution of X is used to calculate the likelihood, Z P(Xi | 2k+1 ) = P(Xi | 2)G(2)d2. (18)

P(ci = l | c1...i−1 , X1...i ) ∝ P(ci = l | c1...i−1 )P(Xi | 2l ). (11) The first factor P(ci = l | c1...i−1 ) is the prior probability of Dirichlet Process, nl P(ci = l | c1...i−1 ) = , (12) α+i−1 where nl represents the number of survey points belonging to the l-th pipeline among the first i − 1 points. This factor indicates the prior probability of the i-th point belonging to the l-th pipeline without considering the data of the points. VOLUME 8, 2020

(16)

However the distribution G is unknown. Thus, Equation (18) cannot be used directly to calculate the likelihood. Since all the survey points could be considered to be generated from G, the distribution could be estimated by choosing some survey points randomly. m

P(Xi | 2k+1 ) =

1X P(Xi | 2(Xrj )), m

(19)

j=1

where Xrj means the j-th randomly picked survey points and 2(Xrj ) means the parameters of Xrj . If Xi does belong to the 118117


Q. Wu et al.: Underground Pipeline Mapping Based on Dirichlet Process Mixture Model

k +1-th pipeline, it means that none of the former i−1 survey points belong to the same pipeline as Xi , which also means ∀1 ≤ j < i, P(Xi | 2(Xj )) is very close to 0 or equal to 0. As a result, it is better to pick from the unprocessed survey points. To simplify the algorithm, let m equal to 1, P(Xi | 2k+1 ) = P(Xi | 2(Xr )).

(20)

If i < N (N is the total number of survey points), i < r ≤ N ; if i = N , which means the last survey point, there will be no more unprocessed survey points to be picked, therefor 1 ≤ r < N. After calculating all of the probabilities, Xi is assigned to the pipelines with the maximum probability. Then, the pipeline is re-fitted. Since the pipeline is considered to be a straight line with depth and radius, fitting it means to fit the line and to estimate µz and µr . Least-square method is adopted to fit the line. Suppose that the equation of the l-th line is y = a1 x + a0 and Yl = {Xi | ci = l} is the set of survey points belonging to it, the square error is defined as X φ= (yi − a1 xi − a0 )2 . (21) Xi ∈Yl

The aim of least-square method is to find the a1 and a0 that minimize the square error. Let the derivatives of φ with respect to a1 and a0 equal to 0. X ∂φ = 2xi (yi − a1 xi − a0 ) = 0, ∂a1 Xi ∈Yl X ∂φ = 2(yi − a1 xi − a0 ) = 0. (22) ∂a0 Xi ∈Yl

It turns out that P P P xi yi − xi yi , P 2 P nl xi − ( xi )2 P P yi xi − al , a0 = nl nl µθ = − arctan a1 , µb = −a0 cos θ. a1 =

nl

(23)

When nl = 1, the denominator of a1 will be 0. In this case, the direction of the single point is considered to be the direction of the line, which means µθ = θi (Xi ∈ Yl ). Maximum likelihood estimation is adopted to estimate µz and µr . Since the distributions of z and r are independent from each other, the likelihood is also independent. Y L(µz | Yl ) = p(zi | µz ), Xi ∈Yl

L(µr | Yl ) =

Y

p(ri | µr ).

Xi ∈Yl

Let the derivatives of logarithm to likelihood equal to 0. X d ln(p(zi | µz )) d ln(L(µz | Yl )) = dµz dµz Xi ∈Yl

118118

(24)

1 X (zi − µz ) = 0, σz2 Xi ∈Yl X d ln(p(ri | µr ))

=− d ln(L(µr | Yl )) = dµr

dµr

Xi ∈Yl

=−

1 X (ri − µr ) = 0. σr2

(25)

Xi ∈Yl

It means that µz and µr should be 1 X µz = zi , nl Xi ∈Yl 1 X ri . µr = nl

(26)

Xi ∈Yl

The maximum likelihood estimations of µz and µr are the average values in the cluster. B. CLUSTERING ALGORITHM

When detecting pipelines by using GPR, the dataset is always in a specific order determined by the detecting route. Since all the survey points are considered to be sampled from a mixture distribution, the dataset should be stochastic-ordered. Consequently, the dataset is randomly disordered at first. Then, the survey points are processed sequentially. For each survey point, the algorithm calculates the probabilities and chooses the case with maximum probability. Figure 5 shows the process of DPPMM and the pseudo codes are shown in Algorithm 1 and Algorithm 2. In the first loop, when processing the i-th survey point, the (i + 1)-th to n-th survey points are not clustered yet. After the first loop, which also could be called as the initial loop, all the survey points will have a cluster assignment. Nevertheless, the performance of the assignment is not the best yet. More loops are required to improve the performance. In each loop, it removes the current survey point from its cluster and calculate the probabilities again. In these loops, all the survey points have their assignments. Consequently, the denominators of Equation (12) and Equation (17) should be α + n − 1. If the assignments of all the survey points remain unchanged in the loop, the algorithm will stop. After the algorithm stops, each cluster will represent a pipeline in the pipeline map except that there will be several clusters with very few survey points, which are considered to be noisy points. Since the pipelines are considered to be horizontal straight lines, there are always two angles between two lines. When calculating Equation (14), θ̃, the angle between survey point and pipeline, needs to be calculated. The distribution of θ̃ shows that it should not be larger than π2 . As a result, θ̃ is defined as the smaller one of the two angles, which means 0 ≤ θ̃ ≤ π2 . When calculating the probabilities, if θ̃ is smaller than σθ , the probability is always larger than 0. If the probabilities are stored as double float numbers in memory, the minimum precision is 2−126 . When the real number of probability is VOLUME 8, 2020


Q. Wu et al.: Underground Pipeline Mapping Based on Dirichlet Process Mixture Model

FIGURE 4. The process of clustering each survey point. In this figure, both L4 and L5 pass through X. However, considering the direction of X and the prior factor, the probability of L4 is larger.

Algorithm 1 DPPMM Cluster Input: X [1 . . . n], α F X [i] = (xi , yi , θi , zi , ri ), α is the parameter of Dirichlet Process Output: C[1 . . . n] F C[i] = ci (1 ≤ ci ≤ L) L is the number of clusters, and ci represents that the i-th survey point belongs to the ci -th cluster 1: C = Initialize(X , α) 2: C 0 [1 . . . n] = 0 3: k = max(C) F k is the number of clusters 4: while C 0 6 = C do 5: C0 = C 6: for i = 1 to n do 7: P[1 . . . k + 1] =Probability(X , C, α, i) 8: t = arg maxj P[j] P 9: if kj=1 PP[j] = 0 or t = k + 1 then 10: if nj=1 1(C[j] = C[i]) = 1 then 11: F X [i] is the only one in C[i] 12: C[i] remains unchanged 13: else 14: C[i] = k + 1 15: k ++ 16: end if 17: else 18: C[i] = t 19: end if 20: end for 21: end while 22: return C[1 . . . n] smaller than the minimum precision, it will be treated as 0. However, in most cases, the precision of input data is far greater than 2−126 . As a result, a threshold ξ could be used to determine whether the probability should be treated as zero or not. The value of ξ is dependent on the biases of antennas. When the biases of antennas are larger, which may be caused by interaction or noise, the threshold should be smaller. It means that when the antennas are not reliable, the probabilities of a survey point may be smaller than with reliable antennas. If the Pauta Criterion is adopted, which means when d̃ > 3σb or r̃ > σr or z̃ > 3σz the probability should be treated as 0, the value of ξ is max ξ = P(Xi | 2l ), s.t. d̃ > 3σb ∨ r̃ > 3σr ∨ z̃ > 3σz . VOLUME 8, 2020

(27)

FIGURE 5. The process of the clustering algorithm. The C and C 0 in the figure represents the cluster assignments. Calculating and updating means calculating the probabilities for all the survey points again and re-clustering them.

When using this model, the GPR and GPS data are preprocessed as the input for the model. Then the initial loop is executed to initialize the cluster assignment. After that, more loops are executed to improve the performance of the assignment until there is no changes or the number of loops reaches the threshold. Finally, all the pipelines are fitted according to the cluster assignment and the pipeline map is generated. IV. EXPERIMENTAL STUDIES

In this section, a simulated dataset and three real-world datasets are used to evaluate the proposed model. First, the simulated experiment is introduced. Then the experimental results on the three real-world datasets are reported. After that, DPPMM is compared with other three algorithms on the three real-world datasets. A. EXPERIMENTAL RESULTS ON SIMULATED DATASET

In this subsection, a simulated dataset is generated to evaluate the precision of the model. This dataset is supposed to simulate a situation of two parallel pipelines with the same depth and radius that is close to each other. It is difficult to find this situation in real world. This experiment shows the effect of α. As shown in Figure 6, the equations of the two pipelines are y = 100 and y = 97. The depth and radius of the two pipelines are both 1 and 0.3. The biases of direction, depth 118119


Q. Wu et al.: Underground Pipeline Mapping Based on Dirichlet Process Mixture Model

Algorithm 2 Probability Calculation Input: X [1 . . . n], C[1 . . . n], α, i Output: P[1 . . . k + 1] 1: k = max(C) 2: for l = 1 to k do 3: Yl = {X [j] | C[j] = l, j 6 = i} 4: nl = |Yl | 5: if nl = 0 then 6: P[l] = 0 7: else 8: Calculate µθ , µb , µr and µz with (23) and (26) 9: P[l] = P(X [i] | 2l ) 10: if P[l] < ξ then 11: P[l] = 0 12: else nl · P[l] 13: P[l] = α+n−1 14: end if 15: end if 16: end for 17: if i < n then 18: r = random (i + 1, n) 19: else 20: r = random (1, n − 1) 21: end if α 22: P[k + 1] = α+n−1 · P(X [i] | 2(X [r])) 23: return P[1 . . . k + 1] π , σz = 0.1 and σr = 0.1 that and radius are set to σθ = 12 almost have no influence on the result. The bias of position is set to σb = 1 that is 13 of the distance between the two pipelines. Each pipeline has 25 survey points generated by the distributions shown in Section II. The ground truth of this experiment is shown in Figure 6(a). The other three figures in Figure 6 show the clustering results of different values of α (1, 10 and 100). Figure 7(a) shows how the number of loops varies according to α. Some detailed information is shown in Table 1. TABLE 1. Results of simulated experiment.

FIGURE 6. The clustering results on the same dataset with different α. Fig (a) shows the correct clustering result. α of (b) is 1, α of (c) is 10 and α of (d) is 100. The accuracy of these three results is 0.46, 0.92, and 0.82.

FIGURE 7. These figures show how the number of loops varies according to α.

with 2 noisy points when α = 10. In this experiment, the maximum number of loops are set to 101 (1 initial loop plus 100 updating loops). The larger α it is, the more loops it takes for the model to converge. B. EXPERIMENTAL RESULTS ON REAL-WORLD DATASETS

Valid clusters in Table 1 means the number of clusters with at least 5 survey points. When α = 1, there is only one valid cluster in the result, which means that the model fails to distinguish the two pipelines from each other. When α equals to 10 or 100, both of the results show that the two pipelines are clustered and fitted. Nevertheless, when α = 100, there are 9 points that are considered to be noisy points, compared 118120

In this subsection, three different datasets are measured for experiment. As shown in Figure 8, the B-scan images are detected by a SIR-30 GPR from GSSI and the position data is measured by a P206 GPS from UniStrong. When detecting in the experimental sites, the principal directions of pipelines could be estimated by the manholes. As a result, the datasets are measured by parallel scan lines that are perpendicular to estimated pipelines. Each scan line is about 5 to 10 meters and the intervals between scan lines are about 10 meters. The start and end positions of each scan line are recorded. If a hyperbola is detected in the B-scan image of a scan line, VOLUME 8, 2020


Q. Wu et al.: Underground Pipeline Mapping Based on Dirichlet Process Mixture Model

FIGURE 8. This figure shows the antenna of GPR used in this experiment. Its frequency is 200MHz.

the position and direction of the survey point are calculated by the recorded positions and the survey point is added into the dataset. If there is no pipeline under the scan line or the hyperbola is not clear enough to be detected, there will be no survey point on this scan line. When measuring positions, the bias of GPS is influenced by the surroundings. When the GPS is surrounded by high-rise buildings or trees, the GPS signal might be blocked, which would cause the position to be imprecise or even lost. When in an open area, such as the first dataset, the GPS signal is continuous and stable. As a result, the bias of position σb is influenced by both antennas and environment while σθ is determined by antenna directly. The biases of depth and radius, σz and σr , are influenced mainly by the pipeline detecting algorithm. In this paper, the B-scan images are processed by the algorithm in [10].

FIGURE 10. The mapping result of dataset – open space. The points with the same color are from the same pipeline. The lines are pipelines fitted by these survey points. The points in the shape of letter ‘x’ are considered to be noisy points.

A proper value of α could help the algorithm cluster the survey points fast and correctly, which means the value of α for this dataset should be under 20. By comparing the results of different values of α, the number of noisy points gets the minimum value when α = 4 in this experiment. Since the algorithm could distinguish noisy points from survey points easily, the larger number of noisy points means that there are more survey points that are mistaken as noisy points. As a result, the value of α is set to 4 for this dataset. The cluster assignment is shown in Figure 10. The detailed information is shown in Table 2.

FIGURE 9. The first dataset – open space. Fig (a) shows the map of surroundings. Fig (b) shows the survey points in Cartesian coordinate system. The positive direction of x-axis presents East and the positive direction of y-axis presents North.

As shown in Figure 9(a), the first dataset was detected in an open space. Without the interference of buildings or trees, the position data is accurate in this dataset. Since the exact value of bias is unable to measure and unnecessary in this experiment, σb is set to 1m for this dataset. The bias of position in this paper is estimated by receiving GPS signals when the GPS antenna is fixed. Based on the information of antennas and pipeline detecting algorithm, the other biases are set to π , σz = 0.1m and σr = 0.1m. The number of loops σθ = 12 it takes for the model to converge with different α is shown in Figure 7(b). Since the new cluster is chosen randomly, the result varies every time. However, it still could be seen that when α is larger than 20, the number of loops rises sharply. VOLUME 8, 2020

FIGURE 11. The second dataset – lecture buildings. Fig (a) shows the map of surroundings. Fig (b) shows the survey points in Cartesian coordinate system.

As shown in Figure 11, the second dataset is detected around lecture buildings, which is noisier than the first one. Most of the pipelines in this site are close to buildings or under trees. With the interferences of trees and buildings, the GPS signal is not as stable as the first dataset. As a result, the position bias σb is set to 1.5m. The other parameters are set to the same values as the first dataset. As shown in Figure 7(c), the number of loops rises faster than the first dataset. The clustering result is show in Figure 12. More detailed information is shown in Table 2. 118121


Q. Wu et al.: Underground Pipeline Mapping Based on Dirichlet Process Mixture Model

TABLE 2. Results of real-world experiments.

FIGURE 14. The mapping result of dataset – dormitories.

in this area than the other two. The parameter set is the same as the first one. The clustering result is shown in Figure 14. More detailed information is shown in Table 2. C. COMPARISONS WITH OTHER ALGORITHMS

FIGURE 12. The mapping result of dataset – lecture buildings.

In this subsection, the result of DPPMM is compared with other three algorithms, DBScan [26], Hough Transform [27] and K-means [28]. DBScan is a density clustering algorithm. It could cluster datasets with any shapes according to their densities and it does not need any prior knowledge. Hough transform is designed to detect various shapes including straight lines. By converting each point to a line in r − θ space, the points with most intersections are the parameters of straight lines. It could discriminate noisy points and identify straight lines without knowing the number of lines. K-means is a classic clustering algorithm. It is fast and effective, and can work for different datasets. However, K-means needs the number of clusters to work and is unable to discriminate noisy points. TABLE 3. Comparison with other algorithms.

FIGURE 13. The second dataset – dormitories. Fig (a) shows the map of surroundings. Fig (b) shows the survey points in Cartesian coordinate system.

As shown in Figure 13, the third dataset is detected around dormitories, which has less noise than the second one. Although there are also high buildings and trees in this site, most of the survey points could be detected without the interferences of buildings or trees. There are fewer pipelines 118122

The results of these algorithms on three real-world datasets are shown in Table 3. Since the ground truth of these three real-world datasets is unable to obtain, the accuracy in the table is calculated by comparing the results of algorithms VOLUME 8, 2020


Q. Wu et al.: Underground Pipeline Mapping Based on Dirichlet Process Mixture Model

with the real pipeline maps of these three areas. For each pipeline in the map, the positions of several chosen points are measured and the equation of each pipeline is fitted. The depth and radius of the pipelines were obtained from the status records of the pipelines in these experimental sites. As a result, the accuracy is calculated by comparing the data of survey points with the real-world pipelines. DBScan is unable to distinguish parallel pipelines from each other in this experiment. As a result, it gets the worst result. Although Hough Transform is good at identifying straight lines, the biases and sparsity of survey points make it difficult for Hough Transform to obtain good results. For the second dataset – Lecture Buildings, the biases and noisy points make all the three algorithms fail to obtain a good result. Besides, Hough Transform is unable to use the depth information. When there are two pipelines at the same position but with different depths, Hough Transform is unable to distinguish them from each other. On the other hand, DBScan and K-means could use the depth and radius information when calculating the distances between points. However, K-means needs the number of clusters as a parameter to work and it is strongly influenced by noisy points. DPPMM does not need any prior knowledge, which makes it more suitable than K-means in most cases. Besides, it could distinguish parallel pipelines and pipelines at different depths from each other. Although DPPMM shows no advantage in the computational time, it could obtain better results than the other three algorithms. Especially for the dataset of Lecture Buildings, DPPMM is the only one that obtains a good result. DPPMM could work with different antennas in various environment by changing the parameters of distributions, which makes it more useful than the other three algorithms in various situations.

the noise points are distinguished correctly, the number of clusters decreases. It means these loops will take less time. For the datasets in this paper, the number of clusters change slightly during clustering. As a result, the time cost could be considered to be proportional to the number of loops. The three real-world datasets show the performance of DPPMM in different surroundings. The first dataset is detected in a flat open area with few noisy points. The slighter bias make it easier to get high performance. On the other hand, the second dataset is detected in a complex surroundings. The stronger GPS bias makes it more difficult to map the pipelines. The bias of the third dataset is almost the same as the first one but with fewer pipelines. To help the algorithm work on these different environments, the values of α and σ all need to be adjusted. The values of σ are the upper boundaries of biases which could help the algorithm to distinguish noisy points. Since it is unable to estimate the exact values of biases in the experiments, the values of σ are set as the upper boundaries of biases.5 The closer it is to the real value of biases, the better results could be obtained by the algorithm. If the values of σ are too large, some noisy points that are close to pipelines could be mistaken as survey points. On the other hand, if the values of σ are too small, some survey points that are far away from their pipelines could be mistaken as noisy points. In most cases, noisy points are all far away6 from all the pipelines, which means that it is better to set the values of σ slightly larger rather than smaller than the real values. As shown in Table 2, the second dataset has fewer survey points but more clusters and noisy points than the first one, which makes it more time-consuming. The third dataset has the least survey points and it takes the least loops to obtain a good result. The mapping results are all good enough to identify every pipeline. The results show that this model can work well on both high precision and noisy datasets.

D. ANALYSIS OF EXPERIMENTS

The results of simulated experiment show the role of α. A proper value of α helps the model to map pipelines fast and precisely. If α is too small, the survey points incline to be divided into the same clusters, which means there will be fewer clusters. On the other hand, if α is too big, survey points incline to be assigned to different clusters, which means that the number of clusters increases. As shown in Table 1, both α of 10 and 100 could estimate the two pipelines correctly. However when α equals to 100, the result shows that there will be more points that are considered to be noisy points and more loops are required before the model converges. As can be seen in Figure 7(a), the number of loops increases with the rising of α. In this simulated situation, to map it fast and precisely, α = 10 is the best choice. The time cost of this algorithm is proportional to the number of loops. For the same input dataset, the time cost of each loop is dependent on the number clusters. For the first few loops the number of clusters are larger than the output because some survey points are not clustered correctly. It means these loops will take more time. When most of the survey points are clustered correctly, especially when VOLUME 8, 2020

V. CONCLUSION

In this paper, the Dirichlet Process Pipeline Mapping Model is proposed. It is a probability model based on Dirichlet Process Mixture Model. It takes the dataset measured by GPR and GPS with noise as input and the pipeline map as output. It can automatically align the detected points to different pipelines and fit their equations on the map even though they are close to each other. By using the biases of the dataset as parameters, DPPMM could work in various environment. As a nonparametric Bayesian model, it could work without designating the number of clusters. The randomly sampled survey points help DPPMM to simulate the base distribution without any prior knowledge. The value of α helps DPPMM to estimate the number of clusters and the values of σ limit the upper 5 Since the biases of position, depth and radius are considered to follow Gaussian distributions, the upper boundaries mean 13 of the maximum values of these biases based on the Pauta Criterion. The bias of direction follows a normal distribution, which means that the upper boundary is the maximum value of biases. 6 In this paper, the distances between noisy points and pipelines are the values of probabilities.

118123


Q. Wu et al.: Underground Pipeline Mapping Based on Dirichlet Process Mixture Model

boundaries of biases which could help DPPMM to distinguish noisy points. The experimental results on real-world datasets show that it works well on both high-precision and noisy datasets. By comparing with other state-of-art algorithms, DPPMM shows the best performance on all the datasets. The future work is to incorporate the spatial temporal relationship [29], [30] into the underground pipeline mapping. REFERENCES [1] D. J. Daniels, Ground Penetrating Radar (Encyclopedia of RF and Microwave Engineering). Stevenage, U.K.: Institution Engineering and Technology, 2005. [2] J. M. Muggleton, M. J. Brennan, and Y. Gao, ‘‘Determining the location of buried plastic water pipes from measurements of ground surface vibration,’’ J. Appl. Geophys., vol. 75, no. 1, pp. 54–61, Sep. 2011. [3] A. Royal, P. R. Atkins, and M. Brennan, ‘‘Site assessment of multiple sensor approaches for buried utility detection,’’ Int. J. Geophys., vol. 2011, Jan. 2011, Art. no. 496123. [4] H. Chen and A. G. Cohn, ‘‘Buried utility pipeline mapping based on multiple spatial data sources: A Bayesian data fusion approach,’’ in Proc. 22nd Int. Joint Conf. Artif. Intell., 2011, pp. 2411–2417. [5] H. Chen and A. G. Cohn, ‘‘Buried utility pipeline mapping based on street survey and ground penetrating radar,’’ in Proc. 19th Eur. Conf. Artif. Intell., 2010, pp. 987–988. [6] H. Chen and A. G. Cohn, ‘‘Probabilistic conic mixture model and its applications to mining spatial ground penetrating radar data,’’ in Proc. Workshop SIAM Conf. Data Mining, 2010, pp. 1–9. [7] H. Chen and A. G. Cohn, ‘‘Probabilistic robust hyperbola mixture model for interpreting ground penetrating radar data,’’ in Proc. Int. Joint Conf. Neural Netw. (IJCNN), Jul. 2010, pp. 1–8. [8] S. Chicarella, V. Ferrara, F. Frezza, A. D’Alvano, and L. Pajewski, ‘‘Improvement of GPR tracking by using inertial and GPS combined data,’’ in Proc. 24th Int. Conf. Softw., Telecommun. Comput. Netw. (SoftCOM), Sep. 2016, pp. 1–5. [9] R. Dutta, A. G. Cohn, and J. M. Muggleton, ‘‘3D mapping of buried underworld infrastructure using dynamic Bayesian network based multi-sensory image data fusion,’’ J. Appl. Geophys., vol. 92, pp. 8–19, May 2013. [10] X. Zhou, H. Chen, and J. Li, ‘‘An automatic GPR B-Scan image interpreting model,’’ IEEE Trans. Geosci. Remote Sens., vol. 56, no. 6, pp. 3398–3412, Jun. 2018. [11] X. Zhou, H. Chen, and T. Hao, ‘‘Efficient detection of buried plastic pipes by combining GPR and electric field methods,’’ IEEE Trans. Geosci. Remote Sens., vol. 57, no. 6, pp. 3967–3979, Jun. 2019. [12] X. Zhou, H. Chen, and J. Li, ‘‘Probabilistic mixture model for mapping the underground pipes,’’ ACM Trans. Knowl. Discovery Data, vol. 13, no. 5, p. 47, 2019. [13] H. Chen, P. Tino, and X. Yao, ‘‘Predictive ensemble pruning by expectation propagation,’’ IEEE Trans. Knowl. Data Eng., vol. 21, no. 7, pp. 999–1013, Jul. 2009. [14] H. Chen, P. Tino, and X. Yao, ‘‘Probabilistic classification vector machines,’’ IEEE Trans. Neural Netw., vol. 20, no. 6, pp. 901–914, Jun. 2009. [15] H. Chen and X. Yao, ‘‘Regularized negative correlation learning for neural network ensembles,’’ IEEE Trans. Neural Netw., vol. 20, no. 12, pp. 1962–1979, Dec. 2009. [16] R. Granell, C. J. Axon, and D. C. H. Wallom, ‘‘Clustering disaggregated load profiles using a Dirichlet process mixture model,’’ Energy Convers. Manage., vol. 92, pp. 507–516, Mar. 2015. [17] D. B. Dahl, ‘‘Model-based clustering for expression data via a Dirichlet process mixture model,’’ Bayesian Inference Gene Expression Proteomics, vol. 4, pp. 201–218, Jan. 2006. [18] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, ‘‘Sharing clusters among related groups: Hierarchical Dirichlet processes,’’ in Proc. Adv. Neural Inf. Process. Syst., 2005, pp. 1385–1392. [19] S. Kim, M. G. Tadesse, and M. Vannucci, ‘‘Variable selection in clustering via Dirichlet process mixture models,’’ Biometrika, vol. 93, no. 4, pp. 877–893, Dec. 2006. [20] M. Kobayashi and K. Nakano, ‘‘Dirichlet process crescent-signal mixture model for ground-penetrating radar signals,’’ in Proc. 40th Annu. Conf. IEEE Ind. Electron. Soc., Oct. 2014, pp. 3431–3437. 118124

[21] S. Jain and R. M. Neal, ‘‘A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model,’’ J. Comput. Graph. Statist., vol. 13, no. 1, pp. 158–182, Mar. 2004. [22] P. De Blasi, S. Favaro, A. Lijoi, R. H. Mena, I. Prunster, and M. Ruggiero, ‘‘Are gibbs-type priors the most natural generalization of the Dirichlet process?’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 2, pp. 212–229, Feb. 2015. [23] R. M. Neal, ‘‘Markov chain sampling methods for Dirichlet process mixture models,’’ J. Comput. Graph. Statist., vol. 9, no. 2, pp. 249–265, Jun. 2000. [24] S. G. Walker, ‘‘Sampling the Dirichlet mixture model with slices,’’ Commun. Statist.-Simul. Comput., vol. 36, no. 1, pp. 45–54, Jan. 2007. [25] D. M. Blei and M. I. Jordan, ‘‘Variational methods for the Dirichlet process,’’ in Proc. 21st Int. Conf. Mach. Learn., 2004, p. 12. [26] P. Liu, D. Zhou, and N. Wu, ‘‘VDBSCAN: Varied density based spatial clustering of applications with noise,’’ in Proc. Int. Conf. Service Syst. Service Manage., Jun. 2007, pp. 1–4. [27] A. Simi, S. Bracciali, and G. Manacorda, ‘‘Hough transform based automatic pipe detection for array GPR: Algorithm development and on-site tests,’’ in Proc. IEEE Radar Conf., May 2008, pp. 1–6. [28] A. K. Jain, ‘‘Data clustering: 50 years beyond K-means,’’ Pattern Recognit. Lett., vol. 31, no. 8, pp. 651–666, Jun. 2010. [29] H. Chen, P. Tino, A. Rodan, and X. Yao, ‘‘Learning in the model space for cognitive fault diagnosis,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 25, no. 1, pp. 124–136, Jan. 2014. [30] H. Chen, F. Tang, P. Tino, and X. Yao, ‘‘Model-based kernel for efficient time series analysis,’’ in Proc. 19th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2013, pp. 11–14.

QINGYUAN WU was born in Suzhou, Jiangsu, China, in 1995. He received the B.S. degree in computer science from the University of Science and Technology of China, in 2017.

XIREN ZHOU was born in Anhui, China, in 1992. He received the B.S. degree in computer science from Shandong University, China, in 2014, and the M.S. and Ph.D. degrees in computer science from the University of Science and Technology of China, in 2019.

HUANHUAN CHEN (Senior Member, IEEE) received the B.S. degree from the University of Science and Technology of China, in 2004, and the Ph.D. degree from the University of Birmingham, in 2008. Dr. Chen has been an Associate Editor of the IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (TNNLS) and the IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE (TETCI), since 2016. He is the IEEE Computational Intelligence Society Student Activities Committee Chair, since 2015. He was the IEEE World Congress on Computational Intelligence (IEEE WCCI) Publications Integrity Chair, in 2016.

VOLUME 8, 2020


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.