Prediction of Methane Inclusion Types Using Support Vector Machine

Page 1

Scientific Journal of Earth Science June 2015, Volume 5, Issue 2, PP.18-27

Prediction of Methane Inclusion Types Using Support Vector Machine Guangren Shi

Department of Expert, Research Institute of Petroleum Exploration and Development, PetroChina, P. O. Box 910, Beijing 100083, China Email: grs@petrochina.com.cn

Abstract Three classification algorithms and one regression algorithm have been applied to forecast methane inclusion types. The three classification algorithms are the classification of support vector machine (C-SVM), the naïve Bayesian (NBAY), and the Bayesian successive discrimination (BAYSD), while the one regression algorithm is the multiple regression analysis (MRA). In the four algorithms, only MRA is a linear algorithm whereas the other three are nonlinear algorithms In general, when all these four algorithms are used to solve a real-world problem, they often produce different solution accuracies. Toward this issue, the solution accuracy is expressed with the total mean absolute relative residual for all samples, R(%). And two criteria are proposed: a) an algorithm is applicable if its R(%)<10, otherwise the algorithm is imapplicable; and b) R(%) of MRA is employed to measure the nonlinearity degree of a studied problem: weak if R(%)<10, moderate if 10≤ R(%)≤30, and strong if R(%)>30. A case study at the Puguang Gasfield has been used to validate the proposed approach. This case study is a classification problem. The calculation results indicate that a) this case study is a strongly nonlinear problem besause the R(%) value of MRA is 32; and b) C-SVM is applicable since its R(%) value is 0, whereas NBAY and BAYSD are all imapplicable besause their R(%) values are 18 and 38, respectively. For the case study, it is concluded that the preferable algorithm is C-SVM, while BAYSD can serve as a promising dimension reduction tool. Keywords: Data Mining; Naïve Bayesian; Bayesian Successive Discrimination; Multiple Regression Analysis; Nonlinearity; Solution Accuracy; Dimensionality Reduction; Puguang Gasfield

1 INTRODUCTION In the recent years, the regression and classification algorithms have seen preliminary success in petroleum geology[1, 2] , but the application of these algorithms to methane inclusion type prediction has not started yet. Liu et al (2013) presented the occurrence and genesis of multiple types of high density methane inclusions in Puguang Gasfield[3]. Using all the samples with complete parameters given by [3], this paper presents a prediction of methane inclusion types using four algorithms below. The benefit of this proposed approach is to reduce the experment of determing the methane inclusion types. Three classification algorithms and one regression algorithm have been applied to forecast methane inclusion types. The three classification algorithms are the classification of support vector machine (C-SVM), the naïve Bayesian (NBAY), and the Bayesian successive discrimination (BAYSD), while the one regression algorithm is the multiple regression analysis (MRA). In the four algorithms, only MRA is a linear algorithm whereas the other three are nonlinear algorithms In general, when all these four algorithms are used to solve a real-world problem, they often produce different solution accuracies. Toward this issue, the solution accuracy is expressed with the total mean absolute relative residual for all samples, R(%). It is proposed that a) R(%) of MRA is employed to measure the nonlinearity degree of a studied problem: weak if R(%)<10, moderate if 10≤ R(%)≤30, and strong if R(%)>30; and b) an algorithm is applicable if its R(%)≤10, otherwise the algorithm is imapplicable. A case study at the Puguang Gasfield, a classification problem, has been used to validate the proposed approach. - 18 http://www.j-es.org


1 METHODOLOGY The methodology consists of the following three major parts: definitions commonly used by regression and classification algorithms; methods of four algorithms; dimensionality reduction.

2.1 Definitions Commonly Used by Regression and Classification Algorithms The aforementioned classification and regression algorithms share the data of samples. The essential difference between the two types of algorithms is that the output of regression algorithms is real-type value and in general differs from the real number given in the corresponding learning sample, whereas the output of classification algorithms is integer-type value and must be one of the integers defined in the learning samples. In the view of dataology, the integer-type value is called as discrete attribute, while the real-type value is called as continuous attribute. The four algorithms (C-SVM, NBAY, BAYSD, MRA) use the same known parameters, and also share the same unknown that is predicted. The only difference between them is the approach and calculation results. Assume that there are n learning samples, each associated with m+1 numbers (x1, x2, …, xm, y*) and a set of observed values (xi1, xi2, …, xim, yi* ), with i=1, 2, …, n for these numbers. In principle, n>m, but in actual practice n>>m. The n samples associated with m+1 numbers are defined as n vectors: xi=(xi1, xi2, …, xim, yi* ) (i=1, 2, …, n)

(1)

where n is the number of learning samples; m is the number of independent variables in samples; xi is the ith learning sample vector; xij is the value of the jth independent variable in the ith learning sample, j=1, 2, …, m; and yi* is the observed value of the ith learning sample. Equation 1 is the expression of learning samples. Let x0 be the general form of a vector of (xi1, xi2, …, xim). The principles of NBAY, BAYSD and MRA are the same, i.e., try to construct an expression, y=y(x0), such that Eq. 2 is minimized. Certainly, these three different algorithms use different approaches and obtain calculation results in differing accuracies. n

* ∑  y ( x 0 i ) − yi 

2

(2)  where y=y(x0i) is the calculation result of the dependent variable in the ith learning sample; and the other symbols have been defined in Eq. 1. i =1 

However, the principle of C-SVM algorithm is to try to construct an expression, y=y(x0), such that to maximize the margin based on support vector points so as to obtain the optimal separating line. This y=y(x0) is called the fitting formula obtained in the learning process. The fitting formulas of different algorithms are different. In this paper, y is defined as a single variable. The flowchart is as follows: the 1st step is the learning process, using n learning samples to obtain a fitting formula; the 2nd step is the learning validation, substituting n learning samples (xi1, xi2, …, xim) into the fitting formula to get prediction values (y1, y2, …, yn), respectively, so as to verify the fitness of an algorithm; and the 3rd step is the prediction process, substituting k prediction samples expressed with Eq. 3 into the fitting formula to get prediction values (yn+1, yn+2, …, yn+k), respectively. xi=(xi1, xi2, …, xim) (i=n+1, n+2, …, n+k)

(3)

where k is the number of prediction samples; xi is the ith prediction sample vector; and the other symbols have been defined in Eq. 1. Equation 3 is the expression of prediction samples. In the four algorithms, only MRA is a linear algorithm whereas the other three are nonlinear algorithms, this is due to the fact that MRA constructs a linear function whereas the other three construct nonlinear functions, respectively. - 19 http://www.j-es.org


To express the calculation accuracies of the prediction variable y for learning and prediction samples when the four algorithms are used, the following four types of residuals are defined. The absolute relative residual for each sample, R(%)i (i=1, 2, …, n, n+1, n+2, …, n+k), is defined as R (%)i = ( yi -yi* ) / yi* × 100

(4)

th

where yi is the calculation result of the dependent variable in the i sample; and the other symbols have been defined in Eqs. 1 and 3. R(%)i is the fitting residual to express the fitness for a sample in learning or prediction process. It is noted that zero must not be taken as a value of yi* to avoid floating-point overflow. Therefore, for regression algorithm, delete the sample if its yi* =0; and for classification algorithm, positive integer is taken as values of yi* . The mean absolute relative residual for all learning samples, R1(%), is defined as n

R1(%)= ∑ R (%)i / n

(5)

i =1

where all symbols have been defined in Eqs. 1 and 4. R1(%) is the fitting residual to express the fitness of learning process. The mean absolute relative residual for all prediction samples, R2(%), is defined as k

R2(%)= ∑ R (%)i / k

(6)

i= n +1

where all symbols have been defined in Eqs. 3 and 4. R2(%) is the fitting residual to express the fitness of prediction process. The total mean absolute relative residual for all samples, R(%), is defined as n+k

R (%)= ∑ R (%)i / ( n + k )

(7)

i =1

where all symbols have been defined in Eqs. 1, 3 and 4. If there are no prediction samples, k=0, then R(%)=R1(%). R(%) is the fitting residual to express the fitness of learning and prediction processes. When the four algorithms (C-SVM, NBAY, BAYSD, MRA) are used to solve a real-world problem, they often produce different solution accuracies. Toward this issue, the following two criteria are proposed. 1) Criterion 1: Solution Accuracy of a Given Algorithm Application Whether linear algorithm (MRA) or nonlinear algorithms (C-SVM, NBAY, BAYSD), their R(%) of a studied problem expresses the accuracy of y=y(x) obtained by each algorithm, i.e. solution accuracy of each algorithm for solving the studied problem. This solution accuracy can be divided into three classes: high, moderate, and low (Table 1). 2) Criterion 2: Nonlinearity Degree of a Studied Problem Since MRA is a linear algorithm, its R(%) for a studied problem expresses the nonlinearity degree of y=y(x) to be solved, i.e. the nonlinearity degree of the studied problem. This nonlinearity degree can be divided into three classes: weak, moderate, and strong (Table 1). TABLE 1 CRITERION 1 (SOLUTION ACCURACY OF A GIVEN ALGORITHM APPLICATION) AND CRITERION 2 (NONLINEARITY DEGREE OF A STUDIED PROBLEM) Range of R(%) R(%)<10 10≤ R(%)≤30 R(%)>30

Criterion 1 (solution accuracy of a given algorithm application) based on R(%) of C-SVM, NBAY, BAYSD, MRA, respectively High Moderate Low

Criterion 2 (nonlinearity degree of a studied problem) based on R(%) of MRA Weak Moderate Strong

Now we can present a machine-learning scheme, based on the aforementioned two proposed criteria. The two criteria proved to be useful by the results of a case study below. Based on the analyses of this case study, the - 20 http://www.j-es.org


presented scheme, therefore, can be constructed by two major rules: a) R(%) of MRA is used to measure the nonlinearity degree of a given problem, and thus MRA should be used as a first choice; and b) for a classification problem, if its nonlinearity degree is strong, C-SVM can be used, but NBAY or BAYSD cannot be used. In deed, if the nonlinearity degree is weak or moderate, BAYSD can be used but NBAY can sometimes be used[1, 2].

2.2 Methods of Four Algorithms Through the learning process, each algorithm constructs its own function y=y(x). It is noted that y=y(x) created by the four algorithms (C-SVM, NBAY, BAYSD, MRA) are explicit expressions, i.e. which are expressed as usual mathematical formulas. The following will describe the methods of the four algorithms. 1) C-SVM The C-SVM procedure, a binary classifier[1, 2, 4-8], has been gradually applied since the 1990's, and widely applied in this century. C-SVM is an approach utilizing machine-learning based on statistical learning theory. It is essentially performed by converting a real-world problem (the original space) into a new higher dimensional feature space using the kernel function, and then constructing a linear discriminate function in the new space to replace the nonlinear discriminate function. Theoretically, C-SVM can obtain the global optimal solution and avoid converging to a local optimal solution. In the case study below, the binary classifier has been employed. Moreover, it is better to take RBF (radial basis function) as a kernel function than to take the linear, polynomial and sigmoid functions under strong nonlinear conditions[1, 2, 6], and thus the kernel function used is the RBF. And the termination of calculation accuracy TCA is fixed to 0.001.

The formula created by this technique is an expression with respect to a vector x, which is so-called a nonlinear function y=C-SVM(x1, x2, …, xm)[1, 2, 6]: n

= y ∑  yiα i exp( −γ || x − xi ||2 ) + b i =1

(8)

where α is the vector of Lagrange multipliers, α=(α1, α2, …, αn), 0≤αi≤C where C is the penalty factor, and the n

constraint ∑ yiα i =0; exp( −γ || x − xi ||2 ) is a RBF kernel function; γ is the regularization parameter, γ>0; and b is the i =1

offset of the separating hyperplane, which can be calculated using the free vectors xi. These free xi are those vectors corresponding to αi>0, on which the final C-SVM model depends. αi, C, and γ can be solved using the dual quadratic optimization:

{

n

α i =

n

1

}

⋅ ∑ α iα j yi y j exp( −γ || xi − x j ||2 ) 1= 2 i, j 1 

max ∑ α i −

(9)

It is noted that in the case study below the formulas corresponding to Eq. 8 are not concretely written out due to their large size. 2) NBAY The NBAY procedure has been widely applied since the 1990's, and widely applied in this century[1, 2, 9]. The following introduces a NBAY technique, i.e. the naïve Bayesian. The formula created by this technique is a set of nonlinear products with respect to m parameters (x1, x2, …, xm)[1, 2, 10, 11]:



 − ( x j − µ jl ) 2   (10)   (l=1, 2, …, L)  2σ 2 jl  σ jl 2π    where l is the class number, L is the number of classes, Nl(x) is the discrimination function of the lth class with respect to x, σjl is the mean square error of xj in Class l, μjl is the mean of xj in Class l. Eq. 10 is so-called a naïve Bayesian discrimination function. N l ( x ) = ∏ mj=1 

1

exp 

- 21 http://www.j-es.org


Once Eq. 10 is created, any sample shown by Eq. 1 or Eq. 3 can be substituted in Eq. 10 to obtain L values: N1, N2, …, NL. If N l = max { Nl } , then b

1≤l ≤ L

(11)

y=lb for this sample. Eq. 11 is so-called a nonlinear function y=NBAY (x1, x2, …, xm). 3) BAYSD

The BAYSD procedure has been widely applied since the 1990's, and widely applied in this century[1, 2, 12, 13]. The following introduces BAYSD technique. The formula created by this technique is a set of nonlinear combinations with respect to m parameters (x1, x2, …, xm), plus two constant terms[1, 2]: m

Bl ( x= ) ln( pl ) + c0l + ∑ c jl x j j =1

l 1, 2,...L ) (=

(12)

where l is the class number, L is the number of classes, Bl(x) is the discrimination function of the lth class with respect to x, cjl is the coefficient of xj in the lth discrimination function, pl and c0l are two constant terms in the lth discrimination function. The constants pl, c0l, c1l, c2l, …, cml are deduced using Bayesian theorem and calculated by the successive Bayesian discrimination of BAYSD. Eq. 12 is so-called a Bayesian discrimination function. In rare cases an introduced xk can be deleted in the Bayesian discrimination function, and in much rarer cases a deleted xk could be again introduced into the Bayesian discrimination function. Therefore, usually Eq. 12 is solved via m iterations. Once Eq. 12 is created, any sample shown by Eq. 1 or Eq. 3 can be substituted in Eq. 12 to obtain L values: B1, B2, …, BL. If Bl = max {Bl } , then b

1≤l ≤ L

y=lb

(13)

for this sample. Eq. 13 is so-called a nonlinear function y=BAYSD(x1, x2, …, xm). 4) MRA The MRA procedure has been widely applied since the 1970's, and the successive regression analysis, the most popular MRA technique, is still a very useful tool[1, 2, 14, 15]. The formula created by this technique is a linear combination with respect to m parameters (x1, x2, …, xm), plus a constant term, which is so-called a linear function y=MRA(x1, x2, …, xm)[1, 2]: y=b0+b1x1+b2x2+…+bmxm

(14)

where the constants b0, b1, b2, …, bm are deduced using regression criteria and calculated by the successive regression analysis of MRA. Eq. 14 is a so-called “regression equation”. In rare cases an introduced xk can be deleted in the regression equation, and in much rarer cases a deleted xk could be again introduced into the regression equation. Therefore, usually Eq. 14 is solved via m iterations.

2.3 Dimensionality Reduction The definition of dimensionality reduction is to reduce the number of dimensions of a data space as small as possible but the results of studied problem are unchanged. The benefits of dimensionality reduction are to reduce the amount of data can enhance the calculating speed, to reduce the independent variables can extend applying ranges, and to reduce misclassification ratio of prediction samples can enhance processing quality. Among the aforementioned four algorithms, each of MRA and BAYSD can serve as a promising dimensionreduction tool, respectively, because the two algorithms all can give the dependence of the predicted value (y) on independent variables (x1, x2, …, xm), in decreasing order. However, because MRA belongs to data analysis in linear correlation whereas BAYSD is in nonlinear correlation, in applications the preferable tool is BAYSD, whereas MRA is applicable only when the studied problems are linear. The called “promising tool” is whether it succeeds or not - 22 http://www.j-es.org


needs a high-class nonlinear tool (e.g., C-SVM) for the validation, so as to determine how many independent variables can be reduced. For instance, the classification problem in the case study below indicates that a 6-D problem (x1, x2, x3, x4, x5, y) cannot be reduced to 5-D problem (x1, x3, x4, x5, y).

2 CASE STUDY: THE PREDICTION OF METHANE INCLUSION TYPES The objective of this case study is to conduct methane inclusion classification (MIC) for the high density methane inclusions, which has practical value when experiment data is less limited. Using data of 25 samples from the Puguang Gasfield in south-western China[3], and each sample contains 5 independent variables (x1= major mineral, x2 = CH4, x3 = high wave number, x4 = low wave number, x5 = C6H6) and an experiment result (y* = MIC), 23 are taken as learning samples and 2 as prediction samples for the prediction of MIC (Table 2) by C-SVM, NBAY and BAYSD. TABLE 2 INPUT DATA FOR METHANE INCLUSION CLASSIFICATION OF THE PUGUANG GASFIELD [Modified from (Liu et al., 2013)] Sample type

Learning samples

Prediction samples

Sample No.

Inclusion No.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

PG5-d PG5-e PG5-g PG5-h PG5-u PG5-L PG5-m PG5-n PG3-c PG3-d PG3-e PG3-f PG3-g PG3-h PG3-3 PG3-36 PG3-33 PG3-44 PG3-14 PG3-16 PG3-18 PG3-65 PG3-66

Displacement of characteristic peaks measured by Raman spectra for high density methane inclusions (cm-1) a CO2 x1 x2 x5 x3 x4 1 2910.79 1281.46 1384.86 3065.3 1 2911.44 1281.96 1384.64 3066.6 1 2910.79 1281.73 1384.33 3064.9 1 2911.69 1282.03 1384.89 3067.8 1 2910.97 1281.46 1384.86 3065.4 1 2911.44 1281.95 1384.85 3064.1 1 2911.51 1281.56 1385 3065.4 1 2911.51 1281.91 1384.95 3065.2 2 2911 1281 1384.5 3063 2 2910.3 1281.5 1384.1 3061.7 2 2911 1280.9 1384.4 3063 2 2911 1281.4 1383.6 3063.03 2 2911 1281.2 1384.1 3064.3 2 2910 1280.1 1383.3 3063 2 2910.4 1280.7 1384 3063.5 2 2910.8 1280.9 1383.3 3060.9 2 2910 1280.7 1383.8 3064.8 2 2909.8 1280.6 1383.7 3064.8 2 2910.4 1280.7 1383.7 3603.6 2 2910.4 1280.1 1383.3 3062.9 2 2910.4 1280.7 1383.2 3065.4 2 2910.4 1280.5 1383.3 3062.5 2 2910.4 1281.4 1384.5 3063.5

24

PG5-a

1

2910.76

1281.23

1384.20

3062.1

MIC b y* 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 (1)

25 PG3-17 2 2910.4 1280 1383.5 3065.4 (2) x1= major mineral (1–calcite, 2–quartz), x2 = CH4, x3 = high wave number, x4 = low wave number, x5 = C6H6. b * y = MIC = methane inclusions classification (1–single phase, 2–gas/solid) determined by the experimentation, number in parenthesis is not input data, but is used for calculating R(%)i. a

Table 2 shows that in each of x2, x3, x4, and x5, the values are very approximate to one another. It is a rare case in data mining, so that algolithms with high discriminating ability are required.

3.1 Learning Process Using the 23 learning samples (Table 2) and by C-SVM, NBAY, BAYSD and MRA, the following four functions of MIC (y) with respect to 5 independent variables (x1, x2, x3, x4, x5) have been constructed. Using C-SVM, the result is an explicit nonlinear function corresponding to Eq. 8: y=C-SVM(x1, x2, x3, x4, x5) - 23 http://www.j-es.org

(15)


with C=8192, γ =8, 22 free vectors xi, and the cross validation accuracy CVA=73.913%. Using NBAY, the result is an explicit nonlinear discriminate function corresponding to Eq. 10: Nl ( x ) =

 ∏ 5j =1 

 σ jl

1

 − ( x j − µ jl ) 2 exp   2σ 2 2π jl 

    (l=1, 2)  

(16)

where for l=1, σj1=0.479, 0.469, 0.517, 0.527, 1.8, μj1=1.64, 2911, 1281, 1384, 3064; for l=2, σj2=0.471, 0.589, 0.631, 0.719, 170, μj2=1.67, 2911, 1281, 1384, 3124. Using BAYSD, the result is an explicit nonlinear discriminate function corresponding to Eq. 12:

 B1= ( x ) ln(0.609) − 0.301 × 108 + 16399 x1 + 20469 x2 − 3297 x3 + 3537 x4 + 1.76 x5  ( x ) ln(0.391) − 0.301 × 108 + 16398 x1 + 20469 x2 − 3298 x3 + 3537 x4 + 1.77 x5  B2 =

(17)

From the successive process, MIC (y) is shown to depend on the 5 independent variables in decreasing order: x5, x3, x1, x4, and x2. Though MRA is regression algorithm but not classification algorithm, MRA can provide the nonlinearity degree of the studied problem, and thus it is required to run MRA. Using MRA, the result is an explicit linear function corresponding to Eq. 14: y=−359−0.274x1−0.000641x2−0.213x3−0.0618x4+0.00113x5

(18)

Equation 18 yields a residual variance of 0.894 and a multiple correlation coefficient of 0.325. From the regression process, MIC (y) is shown to depend on the 5 independent variables in decreasing order: x5, x3, x1, x4, and x2.

3.2 Prediction Process TABLE 3 PREDICTION RESULTS FROM METHANE INCLUSION CLASSIFICATION OF THE PUGUANG GASFIELD MIC a Classification algorithm Sample Sample type No. y* C-SVM NBAY y R(%)i y R(%)i y 1 1 1 0 1 0 1 2 1 1 0 1 0 2 3 1 1 0 1 0 1 4 1 1 0 1 0 2 5 1 1 0 1 0 1 6 2 2 0 1 50 1 7 2 2 0 1 50 1 8 2 2 0 1 50 1 9 1 1 0 1 0 1 10 1 1 0 1 0 1 11 1 1 0 1 0 2 Learning 12 1 1 0 1 0 2 samples 13 1 1 0 1 0 2 14 1 1 0 1 0 2 15 1 1 0 1 0 2 16 1 1 0 1 0 1 17 1 1 0 1 0 2 18 2 2 0 1 50 2 19 2 2 0 2 0 2 20 2 2 0 1 50 2 21 2 2 0 1 50 2 22 2 2 0 1 50 2 23 2 2 0 1 50 2 24 1 1 0 1 0 1 Prediction samples 25 2 2 0 1 50 2 a MIC = methane inclusions classification (1–single phase, 2–gas/solid) - 24 http://www.j-es.org

BAYSD R(%)i 0 100 0 100 0 50 50 50 0 0 100 100 100 100 100 0 100 0 0 0 0 0 0 0 0

MRA y 1.43 1.34 1.41 1.31 1.43 1.33 1.40 1.33 1.28 1.19 1.31 1.25 1.26 1.54 1.37 1.37 1.39 1.42 2.00 1.54 1.43 1.46 1.19 1.52 1.56

R(%)i 43.5 34.3 40.9 31.4 43.5 33.6 29.8 33.4 27.8 19.5 30.5 24.8 26.1 54.4 37.4 37.1 38.8 29.2 0.0364 22.8 28.8 27.1 40.3 52.1 22.2


Substituting the values of 5 independent variables (x1, x2, x3, x4, x5) given by the 23 learning samples and 2 prediction samples (Table 2) in Eqs. 15, 16 (and then use Eq. 11), 17 (and then use Eq. 13) and 18, respectively, the MIC (y) of each sample is obtained (Table 3). From Table 4, it shown that a) the nonlinearity degree of this studied problem is strong due to R(%) of MRA is 32; and b) the solution accuracies of C-SVM, NBAY and BAYSD are high, moderate and low, respectively. Therefore, a) only C-SVM is obviously applicable; though the solution accuracy of NBAY is moderate, it is inapplicable. This might be due to fact that in each of x2, x3, x4, and x5, the values are very approximate to one another, showing CSVM has high discriminating ability. TABLE 4 COMPARISON AMONG THE APPLICATIONS OF CLASSIFICATION ALGORITHMS (C-SVM, NBAY AND BAYSD) TO METHANE INCLUSION CLASSIFICATION OF THE PUGUANG GASFIELD Mean absolute relative residual Algorithm

C-SVM NBAY BAYSD MRA

Fitting formula Nonlinear, explicit Nonlinear, explicit Nonlinear, explicit Linear, explicit

Dependence of the predicted value (y) on independent variables (x1, x2, x3, x4, x5), in decreasing order

Time consuming on PC (Intel Core 2)

Solution accuracy

R1(%)

R2(%)

R(%)

0

0

0

N/A

5s

High

17.4

25

18

N/A

<1 s

Moderate

41.3

0

38

x5, x3, x1, x4, x2

1s

Low

32

37

32

x5, x3, x1, x4, x2

<1 s

Strong nonlinearity

3.3 Dimension-reduction Failed Both BAYSD and MRA give the dependence of the predicted value (y) on 5 independent variables, in decreasing order: x5, x3, x1, x4, x2 (Table 4). According to this dependence order, at first, deleting x2 and running C-SVM, it is found the results of C-SVM are changed, i.e., R(%)=18 which is greater much than previous R(%)=0 (Table 4). Thus the 6-D problem (x1, x2, x3, x4, x5, y) cannot become 5-D problem (x1, x3, x4, x5, y), showing that the expression of y needs all of x1, x2, x3, x4, x5. In general, the dimension-reduction succeeds for high-dimensional problem[1, 5, 8].

3 CONCLUSIONS Through the aforementioned case study, five major conclusions can be drawn as follows: 1) the proposed two criteria (solution accuracy of a given algorithm application, nonlinearity degree of a studied problem) are practical; 2) the total mean absolute relative residual R(%) of MRA can be used to measure the nonlinearity degree of a studied problem, and thus MRA should be run at first; 3) any of NBAY, BAYSD and MRA cannot be applied to any classification problems with strong nonlinearity, but C-SVM could be applied; 4) if a classification problem has weak or moderate nonlinearity, in genaral, C-SVM and BAYSD are applicable, whereas NBAY is sometimes applicable; 5) beside this case study is a strongly nonlinear problem, it is a rare case in data mining that each of x2, x3, x4, and x5, the values are very approximate to one another, showing C-SVM has high discriminating ability.

ACKNOWLEDGMENT This work was supported by the Research Institute of Petroleum Exploration and Development (RIPED) and PetroChina.

REFERENCES [1]

Shi G. “Data Mining and Knowledge Discovery for Geoscientists.� Elsevier Inc, USA, 2013 - 25 http://www.j-es.org


[2]

Shi G. “Optimal prediction in petroleum geology by regression and classification methods.” Sci J Inf Eng 5(2): 14-32, 2015

[3]

Liu D, Xiao X, Tian H, Dai J, Wang Y, Yang C, Hu A, Mi J. “Occurrence and genesis of multiple types of high density methane

[4]

Shi G. “The use of support vector machine for oil and gas identification in low-porosity and low-permeability reservoirs.” Int J

inclusions in Puguang Gas Field. ” Sci J Earth Sci 3(1): 1-10, 2013 Math Model Numer Optimisa 1(1/2): 75-87, 2009 [5]

Shi G, Yang X. “Optimization and data mining for fracture prediction in geosciences.” Procedia Comput Sci 1(1): 1353-1360, 2010

[6]

Chang C, Lin C. “LIBSVM: a library for support vector machines, Version 3.1.” Retrieved from www.csie.ntu.edu.tw/~cjlin /libsvm, 2011

[7]

Zhu Y, Shi G. “Identification of lithologic characteristics of volcanic rocks by support vector machine.” Acta Petrolei Sinica 34(2): 312-322, 2013

[8]

Shi G, Zhu Y, Mi S, Ma J, Wan J. “A big data mining in petroleum exploration and development.” Adv Petrol Expl Devel 7(2): 18, 2014

[9]

Ramoni M, Sebastiani P. “Robust Bayes classifiers.” Artificial Intelligence 125(1-2): 207-224, 2001

[10] Tan P, Steinbach M, Kumar V. “Introduction to Data Mining.” Pearson Education, Boston, MA, USA, 2005 [11] Han J, Kamber M. “Data Mining: Concepts and Techniques, 2nd Ed.” Morgan Kaufmann, San Francisco, CA, USA, 2006 [12] Denison DGT, Holmes CC, Mallick BK, Smith AFM. “Bayesian Methods for Nonlinear Classification and Regression.” John Wiley & Sons Inc, Chichester, England, UK, 2002 [13] Shi G. “Four classifiers used in data mining and knowledge discovery for petroleum exploration and development.” Adv Petrol Expl Devel 2(2): 12-23, 2011 [14] Sharma MSR, O'Regan M, Baxter CDP, Moran K, Vaziri H, Narayanasamy R. “Empirical relationship between strength and geophysical properties for weakly cemented formations.” J Petro Sci Eng 72(1-2): 134-142, 2010 [15] Singh J, Shaik B, Singh S, Agrawal VK, Khadikar PV, Deeb O, Supuran CT. “Comparative QSAR study on para-substituted aromatic sulphonamides as CAII inhibitors: information versus topological (distance-based and connectivity) indices.” Chem Biol Drug Design 71, 244-259, 2008

AUTHOR Guangren Shi was born in Shanghai, China in February, 1940, Expert and Professor with qualification of directing Ph. D. students. He was graduated from Xi’an Jiaotong University, China in 1963, majoring in applied mathematics (1958–1963). Since 1963, he has been engaged in computer applications for petroleum exploration and development. In the recent 30 years, his research only contains two fields of basin modeling (petroleum system) and data mining for geosciences. In the recent years, however, he has focused on the latter more than the former. He has more than 50 years of professional career, working for the computer center, Daqing Oilfield, petroleum ministry, China as Associate Engineer and Head of software group (1963–1967), the computer center, Shengli Oilfield, petroleum ministry, China as Engineer and Director (1967–1978), the computer center, petroleum ministry, China as Engineer and Head of software group (1978–1985), the Aldridge laboratory of applied geophysics, Columbia university at New York City, U.S.A as Visiting Scholar (1985–1987), the computer application technology research department, Research Institute of Petroleum Exploration and Development (RIPED), China National Petroleum Corporation (CNPC), China as Professor, Director (1987–1997), the RIPED, PetroChina Company Limited (PetroChina), China as Professor with qualification of directing Ph. D. students, Deputy Chief Engineer (1997–2001), and the department of experts in RIPED of PetroChina, China as Expert, Professor with qualification of directing Ph. D. students (2001–Present). He published eight books, in which there are three in English, i.e. 1) Shi G. R. 2013. Data Mining and Knowledge Discovery for Geoscientists. Elsevier Inc, USA. 367pp; 2) Shi G. R. 2005. Numerical Methods of Petroliferous Basin Modeling, 3rd edition. Petroleum Industry Press, Beijing, China. 338pp, which was book-reviewed by Mathematical Geosciences in 2009; and 3) Shi G. R. 2000. Numerical Methods of Petroliferous Basin Modeling, 2nd edition. Petroleum Industry Press, Beijing, China. 233pp, which was book-reviewed by Mathematical Geology in 2006. He also published 74 articles, in which there are 16 in English, e.g. four articles indexed by SCI: 1) Shi G. R., Zhang Q., C., Yang X. S., Mi S. Y. 2010. Oil and gas assessment of the Kuqa Depression of Tarim Basin in western China by simple fluid flow models of primary and secondary migrations of hydrocarbons. Journal of Petroleum Science and Engineering, 75 (1-2): 77–90; 2) Shi G. R. 2009. A simplified dissolution-precipitation model of the smectite to illite - 26 http://www.j-es.org


transformation and its application. Journal of Geophysical Research-Solid Earth, 114, B10205, doi:10.1029/2009JB006406; 3) Shi G. R. 2008. Basin modeling in the Kuqa Depression of the Tarim Basin (Western China): A fully temperature-dependent model of overpressure history. Mathematical Geosciences, 40 (1): 47–62; and 4) Shi G. R., Zhou X. X., Zhang G. Y., Shi X. F., Li H. H. 2004. The use of artificial neural network analysis and multiple regression for trap quality evaluation: a case study of the Northern Kuqa Depression of Tarim Basin in western China. Marine and Petroleum Geology, 21 (3): 411–420. Prof. Shi is Member of Society of Petroleum Engineers (International), Member of Chinese Association of Science and Technology, Member of Petroleum Society of China. And he also is Regional Editor (Asia), International Journal of Mathematical Modelling and Numerical Optimisation, and Member of Editorial Board, Journal of Petroleum Science Research. He received three honors: 1) A Person Studying Overseas and Return with Excellent Contribution, appointed by the Ministry of Education of China (1991); 2) Special Government Allowance, awarded by the State Council of China (1994); and 3) Grand Award of Sun Yueqi Energy, awarded by the Ministry of Science-Technology of China (1997). And he also obtained four awards of Science-Technology Progress, in which one is China National Award, and three are from CNPC and PetroChina.

- 27 http://www.j-es.org


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.