Optimal Selection of Classification Algorithms for Well Log Interpretation by menez

Scientific Journal of Control Engineering June 2015, Volume 5, Issue 3, PP.37‐50

Optimal Selection of Classification Algorithms for Well Log Interpretation Guangren Shi†, Jinshan Ma, Dan Ba Research Institute of Petroleum Exploration and Development, PetroChina, P. O. Box 910, Beijing 100083, China †

Email: grs@petrochina.com.cn

Abstract Three classification algorithms and one regression algorithm have been applied to well log interpretation. The three classification algorithms are the classification of support vector machine (C-SVM), the naïve Bayesian (NBAY), and the Bayesian successive discrimination (BAYSD), while the one regression algorithm is the multiple regression analysis (MRA). In these four algorithms, only MRA is a linear algorithm whereas the other three are nonlinear algorithms. In general, when all these four algorithms are used to solve a real-world problem, they often produce different solution accuracies. Toward this issue, the solution accuracy is expressed with the total mean absolute relative residual for all samples, R(%). Then three criteria have been proposed: 1) nonlinearity degree of a studied problem based on R(%) of MRA (weak if R(%)<10, and strong if R(%)≥10); 2) solution accuracy of a given algorithm application based on its R(%) (high if R(%)<10, and low if R(%)≥10); and 3) results availability of a given algorithm application based on its R(%) (applicable if R(%)<10, and inapplicable if R(%)≥10). Four case studies have been used to validate the proposed approach. These four case studies are a classification problem. The calculation results indicate that a) when a case study is a weakly nonlinear problem, C-SVM, NBAY and BAYSD are all applicable, and BAYSD is better than CSVM and NBAY; b) when a case study is a strongly nonlinear problem, C-SVM is applicable, NBAY is inapplicable, whereas BAYSD is sometimes applicable; and c) BAYSD and C-SVM can be applied to dimensionality reduction. Keywords: Support Vector Machine; Naïve Bayesian; Bayesian Successive Discrimination; Multiple Regression Analysis; Problem Nonlinearity; Solution Accuracy; Results Availability; Dimensionality Reduction

1 INTRODUCTION In the recent years, the classification algorithms have seen enormous success in some fields of business and sciences, but the application of these classification algorithms to well log interpretation is still in initial stage. This is because the well log interpretation is different from the other fields, with miscellaneous data types, huge quantity, different measuring precision, and lots of uncertainties to results. Three classification algorithms and one regression algorithm have been applied to well log interpretation. The three classification algorithms are the classification of support vector machine (C-SVM), the naïve Bayesian (NBAY), and the Bayesian successive discrimination (BAYSD), while the one regression algorithm is the multiple regression analysis (MRA). In the four algorithms, only MRA is a linear algorithm whereas the other three are nonlinear algorithms. In general, when all these four algorithms are used to solve a real-world problem, they often produce different solution accuracies. Toward this issue, the solution accuracy is expressed with the total mean absolute relative residual for all samples, R(%). Then three criteria have been proposed that: 1) nonlinearity degree of a studied problem based on R(%) of MRA (weak if R(%)<10, and strong if R(%)≥10); 2) solution accuracy of a given algorithm application based on its R(%) (high if R(%)<10, and low if R(%)≥10); and 3) results availability of a given algorithm application based on its R(%) (applicable if R(%)<10, and inapplicable if R(%)≥10). Four case studies have been used to validate the proposed approach.

2 METHODOLOGY The methodology consists of the following three major parts: definitions commonly used by four algorithms; - 37 http://www.sj-ce.org

instruction of four algorithms; dimensionality reduction.

2.1 Definitions Commonly Used by Four Algorithms The aforementioned four algorithms share the data of samples. The four algorithms (C-SVM, NBAY, BAYSD, MRA) use the same known parameters, and also share the same unknown that is predicted. The only difference between them is the approach and calculation results. Assume that there are n learning samples, each associated with m+1 numbers (x1, x2, …, xm, y*) and a set of observed values (xi1, xi2, …, xim, yi* ), with i=1, 2, …, n for these numbers. In principle, n>m, but in actual practice n>>m. The n samples associated with m+1 numbers are defined as n vectors: *

xi=(xi1, xi2, …, xim, yi ) (i=1, 2, …, n)

(1) th

where n is the number of learning samples; m is the number of independent variables in samples; xi is the i learning sample vector; xij is the value of the jth independent variable in the ith learning sample, j=1, 2, …, m; and yi* is the observed value of the ith learning sample. Equation 1 is the expression of learning samples. Let x0 be the general form of a vector of (xi1, xi2, …, xim). The principles of NBAY, BAYSD and MRA are the same, i.e., try to construct an expression, y=y(x0), such that Eq. 2 is minimized. Certainly, these three different algorithms use different approaches and obtain calculation results in differing accuracies. n

*   y  x 0i   yi  i 1 

(2)

where y=y(x0i) is the calculation result of the dependent variable in the ith learning sample; and the other symbols have been defined in Eq. 1. However, the principle of C-SVM algorithm is to try to construct an expression, y=y(x0), such that to maximize the margin based on support vector points so as to obtain the optimal separating line. This y=y(x0) is called the fitting formula obtained in the learning process. The fitting formulas of different algorithms are different. In this paper, y is defined as a single variable. The flowchart is as follows: the 1st step is the learning process, using n learning samples to obtain a fitting formula; the 2nd step is the learning validation, substituting n learning samples (xi1, xi2, …, xim) into the fitting formula to get prediction values (y1, y2, …, yn), respectively, so as to verify the fitness of a algorithm; and the 3rd step is the prediction process, substituting k prediction samples expressed with Eq. 3 into the fitting formula to get prediction values (yn+1, yn+2, …, yn+k), respectively. xi=(xi1, xi2, …, xim) (i=n+1, n+2, …, n+k) (3) where k is the number of prediction samples; xi is the ith prediction sample vector; and the other symbols have been defined in Eq. 1. Equation 3 is the expression of prediction samples. In the four algorithms, only MRA is a linear algorithm whereas the other three are nonlinear algorithms, this is due to the fact that MRA constructs a linear function whereas the other three construct nonlinear functions, respectively. To express the calculation accuracies of the prediction variable y for learning and prediction samples when the four algorithms are used, the following four types of residuals are defined. The absolute relative residual for each sample, R(%)i (i=1, 2, …, n, n+1, n+2, …, n+k), is defined as R (%)i = ( yi -yi* ) / yi*  100

(4)

where yi is the calculation result of the dependent variable in the i sample; and the other symbols have been defined in Eqs. 1 and 3. R(%)i is the fitting residual to express the fitness for a sample in learning or prediction process. It is noted that zero must not be taken as a value of yi* to avoid floating-point overflow. Therefore, for regression - 38 http://www.sj-ce.org

algorithm, delete the sample if its yi* =0; and for classification algorithm, positive integer is taken as values of yi* . The mean absolute relative residual for all learning samples, R1(%), is defined as n

R1(%)=  R (%)i / n

(5)

i 1

where all symbols have been defined in Eqs. 1 and 4. R1(%) is the fitting residual to express the fitness of learning process. The mean absolute relative residual for all prediction samples, R2(%), is defined as k

R2(%)=  R (%)i / k

(6)

i  n 1

where all symbols have been defined in Eqs. 3 and 4. R2(%) is the fitting residual to express the fitness of prediction process. The total mean absolute relative residual for all samples, R(%), is defined as n k

R (%)=  R (%)i / (n  k )

(7)

i 1

where all symbols have been defined in Eqs. 1, 3 and 4. If there are no prediction samples, k=0, then R(%)=R1(%). R(%) is the fitting residual to express the fitness of learning and prediction processes. When the four algorithms (C-SVM, NBAY, BAYSD, MRA) are used to solve a real-world problem, they often produce different solution accuracies. Toward this issue, the following three criteria have been proposed (Table 1). TABLE 1 PROPOSED THREE CRITERIA Criterion 1: nonlinearity degree of a studied problem based on R(%) of MRA

Criterion 2: solution accuracy of a given algorithm application based on its R(%)

R(%)<10

Weak

High

Criterion 3: Results availability of a given algorithm application based on its R(%) Applicable

R(%)≥10

Strong

Low

Inapplicable

Range of R(%)

2.2 Four Algorithms Through the learning process, each algorithm constructs its own function y=y(x). It is noted that y=y(x) created by the four algorithms (C-SVM, NBAY, BAYSD, MRA) are explicit expressions, i.e. which are expressed as usual mathematical formulas. The following will describe the four algorithms. 1) C-SVM The C-SVM procedure, a binary classification algorithm[1–7], has been gradually applied since the 1990's, and widely applied in this century. C-SVM is an approach utilizing machine-learning based on statistical learning theory. It is essentially performed by converting a real-world problem (the original space) into a new higher dimensional feature space using the kernel function, and then constructing a linear discriminate function in the new space to replace the nonlinear discriminate function. Theoretically, C-SVM can obtain the global optimal solution and avoid converging to a local optimal solution. In the four case studies below, the binary classification algorithm has been employed. Moreover, it is better to take RBF (radial basis function) as a kernel function than to take the linear, polynomial and sigmoid functions under strong nonlinear conditions[1, 2, 5], and thus the kernel function used is the RBF. And the termination of calculation accuracy TCA is fixed to 0.001. The formula created by this technique is an expression with respect to a vector x, which is so-called a nonlinear function y=C-SVM(x1, x2, …, xm)[1, 2, 5]: n

y    yi i exp( || x  xi ||2 )  b i 1 - 39 http://www.sj-ce.org

(8)

where α is the vector of Lagrange multipliers, α=(α1, α2, …, αn), 0≤αi≤C where C is the penalty factor, and the n

constraint  yi i =0; exp( || x  xi ||2 ) is a RBF kernel function; γ is the regularization parameter, γ>0; and b is the i 1

offset of the separating hyperplane, which can be calculated using the free vectors xi. These free xi are those vectors corresponding to αi>0, on which the final C-SVM model depends. αi, C, and γ can be solved using the dual quadratic optimization: 1 n n     i j yi y j exp( || xi  x j ||2 )  (9) 2 i , j 1 i 1  It is noted that in the four case studies below the formulas corresponding to Eq. 8 are not concretely written out due to their large size. max    i  α

2) NBAY The NBAY procedure has been widely applied since the 1990's, and widely applied in this century[1, 2, 8]. The following introduces a NBAY technique, i.e. the naïve Bayesian. The formula created by this technique is a set of nonlinear products with respect to m parameters (x1, x2, …, xm)[1, 2, 9, 10]: 

  ( x j   jl ) 2   (10) Nl ( x )  exp    (l=1, 2, …, L) 2  2 jl      jl 2 where l is the class number, L is the number of classes, Nl(x) is the discrimination function of the lth class with respect to x, σjl is the mean square error of xj in Class l, μjl is the mean of xj in Class l. Eq. 10 is so-called a naïve Bayesian discrimination function.   mj1 

Once Eq. 10 is created, any sample shown by Eq. 1 or Eq. 3 can be substituted in Eq. 10 to obtain L values: N1, N2, …, NL. If Nl  max  Nl  , then b

1l  L

y=lb

(11)

for this sample. Eq. 11 is so-called a nonlinear function y=NBAY (x1, x2, …, xm). 3) BAYSD The BAYSD procedure has been widely applied since the 1990's, and widely applied in this century[1, 2, 11, 12]. The following introduces BAYSD technique. The formula created by this technique is a set of nonlinear combinations with respect to m parameters (x1, x2,…, xm), plus two constant terms[1, 2]: m

Bl ( x )  ln( pl )  c0l   c jl x j j 1

(l  1, 2,...L)

(12)

where l is the class number, L is the number of classes, Bl(x) is the discrimination function of the lth class with respect to x, cjl is the coefficient of xj in the lth discrimination function, pl and c0l are two constant terms in the lth discrimination function. The constants pl, c0l, c1l, c2l, …, cml are deduced using Bayesian theorem and calculated by the successive Bayesian discrimination of BAYSD. Eq. 12 is so-called a Bayesian discrimination function. In rare cases an introduced xk can be deleted in the Bayesian discrimination function, and in much rarer cases a deleted xk could be again introduced into the Bayesian discrimination function. Therefore, usually Eq. 12 is solved via m iterations. Once Eq. 12 is created, any sample shown by Eq. 1 or Eq. 3 can be substituted in Eq. 12 to obtain L values: B1, B2, …, BL. If Bl  max Bl  , then b

1l  L

y=lb

(13)

for this sample. Eq. 13 is so-called a nonlinear function y=BAYSD(x1, x2, …, xm). 4) MRA The MRA procedure has been widely applied since the 1970's, and the successive regression analysis, the most - 40 http://www.sj-ce.org

popular MRA technique, is still a very useful tool[1, 2, 13, 14]. The formula created by this technique is a linear combination with respect to m parameters (x1, x2, …, xm), plus a constant term, which is so-called a linear function y=MRA(x1, x2, …, xm)[1, 2]: y=b0+b1x1+b2x2+…+bmxm (14) where the constants b0, b1, b2, …, bm are deduced using regression criteria and calculated by the successive regression analysis of MRA. Eq. 14 is a so-called “regression equation”. In rare cases an introduced xk can be deleted in the regression equation, and in much rarer cases a deleted xk could be again introduced into the regression equation. Therefore, usually Eq. 14 is solved via m iterations.

2.3 Dimensionality Reduction The definition of dimensionality reduction is to reduce the number of dimensions of a data space as small as possible but the results of studied problem are unchanged. The benefits of dimensionality reduction are to reduce the amount of data can enhance the calculating speed, to reduce the independent variables can extend applying ranges, and to reduce misclassification ratio of prediction samples can enhance processing quality. Among the aforementioned four algorithms, each of MRA and BAYSD can serve as a promising dimensionreduction tool, respectively, because the two algorithms all can give the dependence of the predicted value (y) on independent variables (x1, x2, …, xm), in decreasing order. However, because MRA belongs to data analysis in linear correlation whereas BAYSD is in nonlinear correlation, in applications the preferable tool is BAYSD, whereas MRA is applicable only when the studied problems are linear. The called “promising tool” means whether it succeeds or not that needs a high-class nonlinear tool (e.g., C-SVM) for the validation, so as to determine how many independent variables can be reduced. For instance, Case study 1 below can be reduced from 5-D problem (x1, x2, x3, x4, y) to 2-D problem (x3, y) by using BAYSD and C-SVM; Case study 3 below cannot be reduced from 9-D problem (x1, x2, x3, x4, x5, x6, x7, x8, y) to 8-D problem (x2, x3, x4, x5, x6, x7, x8, y) by using BAYSD and C-SVM.

3 CASE STUDY 1: THE PREDICTION OF MAJOR LITHOLOGY TYPES TABLE 2 INPUT DATA FOR MAJOR LITHOLOGY CLASSIFICATION (MLC) OF CHISHUI REGION OF GUIZHOU PROVINCE IN CHINA (Modified from [15]) Four parameters related to y a

y* x1 x2 x3 x4 MLC b GR AC DEN CNL (API) (μs/m) (g/cm3) (%) 1 13 48 2.73 6 1 2 20 47 2.76 0 1 3 8 47 2.69 1 1 Learning 4 12 47 2.87 5 2 5 23 50 2.85 12 2 samples 6 24 49 2.88 6 2 7 4 52 3.05 0 3 8 5 55 3.02 0 3 9 7 45 2.95 1 3 10 16 47 2.77 0 (1) Prediction 11 30 48 2.89 6 (2) samples 12 5 53 2.97 0 (3) a x1 = natural gamma ray (GR), x2 = acoustictime (AC),, x3= density (DEN),, x4 = compensated neutron log (CNL). b * y = MLC = major lithology classification (1–limestone, 2–dolomite, 3–anhydrite) determined by experiment, number in parenthesis is not input data, but is used for calculating R(%)i. Sample type

Sample No.

The objective of this case study is to conduct major lithology classification (MLC) using conventional well-logging data, which has practical value when experiment data is less limited. Using data of 12 samples from the Chishui Region of Guizhou Province in China, and each sample contains 4 independent variables (x1= GR, x2 = AC, x3 = DEN, x4 = CNL) and an experiment result (y* = MLC), Xia et al. (2003) adopted rough set and neural network for the prediction of MLC[15]. In the case study, among these 12 samples, 9 are taken as learning samples and 3 as prediction samples for the prediction of MLC (Table 2) by C-SVM, NBAY and - 41 http://www.sj-ce.org

BAYSD.

3.1 Learning Process Using the 9 learning samples (Table 2) and by C-SVM, NBAY, BAYSD and MRA, the following four functions of MLC (y) with respect to 4 independent variables (x1, x2, x3, x4) have been constructed. Using C-SVM, the result is an explicit nonlinear function corresponding to Eq. 8: y=C-SVM(x1, x2, x3, x4) with C=32768,  =0.000031, 7 free vectors xi, and the cross validation accuracy CVA=88.9%.

(15)

Using NBAY, the result is an explicit nonlinear discriminate function corresponding to Eq. 10:  1   ( x j   jl ) 2 N l ( x )   4j 1  exp  2  2 jl    jl 2

    (l=1, 2, 3)  

(16)

where for l=1, σj1=4.92, 0.471, 0.029, 2.63, μj1=13.7, 47.3, 2.73, 2.33; for l=2, σj2=5.44, 1.25, 0.012, 3.09, μj2=19.7, 48.7, 2.87, 7.67; for l=3, σj3=1.25, 4.19, 0.042, 0.471, μj3=5.33, 50.7, 3.01, 0.333. Using BAYSD, the result is an explicit nonlinear discriminate function corresponding to Eq. 12:  B1  x   ln(0.333)  4733  6.75 x1  25.5 x2  3929 x3  20.7 x4 (17)  B2  x   ln(0.333)  5339  7.05 x1  27.3 x2  4177 x3  22.6 x4  B3  x   ln(0.333)  5821  7.83 x1  28.4 x2  6362 x3  22.9 x4 From the successive process, MLC (y) is shown to depend on the 4 independent variables in decreasing order: x3, x4, x2, and x1.

Though MRA is a regression algorithm rather than a classification algorithm, MRA can provide the nonlinearity degree of the studied problem, and thus it is required to run MRA. Using MRA, the result is an explicit linear function corresponding to Eq. 14: (18) y=−!5.9−0.0196x1−0.0475x2+7.11x3+0.0317x4 Equation 18 yields a residual variance of 0.0337 and a multiple correlation coefficient of 0.983. From the regression process, MLC (y) is shown to depend on the 4 independent variables in decreasing order: x3, x2, x1, and x4.

3.2 Prediction Process Substituting the values of 4 independent variables (x1, x2, x3, x4) given by the 9 learning samples and 3 prediction samples (Table 2) in Eqs. 15, 16 (and then use Eq. 11), 17 (and then use Eq. 13) and 18, respectively, the MLC (y) of each sample is obtained (Table 3). TABLE 3 PREDICTION RESULTS FROM MAJOR LITHOLOGY CLASSIFICATION (MLC) OF CHISHUI REGION OF GUIZHOU PROVINCE IN CHINA MLC a Classification algorithm Sample Sample MRA type No. y* C-SVM NBAY BAYSD R(%) R(%) R(%) y y y y 1 1 1 0 1 0 1 0 1.15 2 1 1 0 1 0 1 0 1.08 3 1 1 0 1 0 1 0 0.852 4 2 2 0 2 0 2 0 2.18 Learning 5 2 2 0 2 0 2 0 1.90 samples 6 2 2 0 2 0 2 0 1.95 7 3 3 0 3 0 3 0 3.22 8 3 3 0 3 0 3 0 2.85 9 3 3 0 3 0 3 0 2.82 10 1 1 0 1 0 1 0 1.23 Prediction 11 2 2 0 2 0 2 0 1.95 samples 12 3 3 0 3 0 3 0 2.59 a * y = MLC = major lithology classification (1–limestone, 2–dolomite, 3–anhydrite) determined by experiment. - 42 http://www.sj-ce.org

R(%) 14.9 8.30 14.8 9.00 4.91 2.35 7.36 5.15 6.17 23.2 2.29 13.8

TABLE 4 COMPARISON AMONG THE APPLICATIONS OF CLASSIFICATION ALGORITHMS (C-SVM, NBAY AND BAYSD) TO MAJOR LITHOLOGY CLASSIFICATION (MLC) OF CHISHUI REGION OF GUIZHOU PROVINCE IN CHINA Mean absolute relative residual Algorithm

C-SVM NBAY BAYSD MRA

Fitting formula

Nonlinear, explicit Nonlinear, explicit Nonlinear, explicit Linear, explicit

Dependence of the predicted value (y) on independent variables (x1, x2, x3, x4), in decreasing order

Time consuming on PC (Intel Core 2)

Problem nonlinearity

Solution accuracy

Results availability

R1(%)

R2(%)

R(%)

N/A

High

Applicable

N/A

<1 s

N/A

High

Applicable

x3, x4, x2, x1

N/A

High

Applicable

8.11

13.1

9.36

x3, x2, x1, x4

<1 s

Weak

N/A

From Table 4 and based on Table 1, it shown that a) the nonlinearity degree of this studied problem is weak since R(%) of MRA is 9.36; b) the solution accuracies of C-SVM, NBAY and BAYSD are high due to the fact that their R(%) values are all 0; and c) the results availability of C-SVM, NBAY and BAYSD are applicable due to the indication that their R(%) values are all 0.

3.3 Dimension-reduction succeeded from 5-D to 2-D problem by using BAYSD and C-SVM BAYSD gives the dependence of the predicted value (y) on 4 independent variables, in decreasing order: x3, x4, x2, x1 (Table 4). According to this dependence order, at first, deleting x1 and running C-SVM, it is found the results of C-SVM are the same as before, i.e., R(%)=0, thus 5-D problem (x1, x2, x3, x4, y) can become 4-D problem (x2, x3, x4, y). In the same way, it is found that this 4-D problem can become 3-D problem by deleting x2, this 3-D problem can become 2-D problem by deleting x4. Therefore, the 5-D problem (x1, x2, x3, x4, y) at last can become 2-D problem (x3, y).

4 CASE STUDY 2: THE PREDICTION OF OIL LAYER The objective of this case study is to conduct oil layer classification in clayey sandstone using conventional welllogging data, which has practical value when oil test data is less limited. From Table 5 and based on Table 1, it shown that a) the nonlinearity degree of this studied problem is weak since R(%) of MRA is 5.04; b) the solution accuracies of C-SVM, NBAY and BAYSD are high due to the fact that their R(%) values are all 0; and c) the results availability of C-SVM, NBAY and BAYSD are applicable due to the indication that their R(%) values are all 0. TABLE 5 COMPARISON AMONG THE APPLICATIONS OF CLASSIFICATION ALGORITHMS (C-SVM, NBAY AND BAYSD) TO OIL LAYER CLASSIFICATION OF LOWER H3 FORMATION IN XIA’ERMEN OILFIELD OF HENAN OIL PROVINCE IN CENTRAL CHINA (Modified from [2]) Mean absolute relative residual Algorithm

C-SVM NBAY BAYSD MRA

Fitting formula

Nonlinear, explicit Nonlinear, explicit Nonlinear, explicit Linear, explicit

Dependence of the predicted value (y) on independent variables (x1, x2, …, x8), in decreasing order

Time consuming on PC (Intel Core 2)

Problem nonlinearity

Solution accuracy

Results availability

R1(%)

R2(%)

R(%)

N/A

High

Applicable

N/A

<1 s

N/A

High

Applicable

N/A

High

Applicable

1.68

15.11

5.04

<1 s

Weak

N/A

x5, x8 x6, x1, x3, x2, x4, x7 x5, x8 x6, x1, x4, x2, x3, x7

- 43 http://www.sj-ce.org

TABLE 6 INPUT DATA FOR HYDROCARBON LAYER CLASSIFICATION (HLC) OF TARIM BASIN IN WESTERN CHINA (Modified from [16]) Sample type

Sample No.

Well No.

Interval (m)

Learning samples

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

Lu9 Lu9 Lu9 Lu9 Lu9 Lu9 Lu9 Lu9 Lu9 Lu9 Lu101 Lu101 Lu101 Lu102 Lu102 Lu103 Lu103 Lu104 Lu104 Lu104 Lu104 Lu108 Lu108 Lu108 Lu109 Lu111 Lu111 Lu111 Lu112 Lu112 Lu113 Lu113 Lu113

Prediction samples

Lu102

1031.3–1034 1037–1042 1122.5–1127 1186–1192 1233–1237 1295-1299.1 1323.1–1328 1415–1418 1424–1426 1434–1437 1180–1187.7 1273–1275 1409–1413 1216–1219 1400–1403.3 1367–1270 1352–1356 1002–1010 1192.2–1194 1202–1208 1214–1218.9 1030–1036 1122–1126 1215–1217 1346–1350 1166.8–1172 1197–1202 1276–1278 1133–1136 1225–1228 1264.5–1266 1334–1335 1333.8–1340

x1 H (m) 2.7 5.1 4.6 6.1 4.1 4.1 5 3.1 2.1 2.8 6.8 2.1 4.8 3.1 3.3 3.1 4.1 8 1.9 6.6 5 6.1 4.1 2.1 3.4 5.2 5.1 1.9 3.1 2.1 1.5 1.1 2.1

x2 RT (Ω·m) 8 5 5 9 7 5 7 11 5 5 6 7 7 5 5 9 13 5 5 6 5 5 5 6 8 11 8 7 6 5 6 7 6

1164–1168

4.1

Eight parameters related to y a x3 x4 x5 x6 RXO SP GR AC (Ω·m) (mv) (API) (μs/m) 8.6 -56.9 58.9 108 6.1 -57.6 60 105 7 -53.7 55 106 9.4 -54.4 61.6 104.2 8.3 -59.1 64 106 7.2 -58.4 64.9 102 9 -58 67.4 104 8.3 -58 63.9 99.9 8.3 -58.2 67.4 102 7.5 -60.1 67.4 102 5.2 -20.2 62 106 11 -22.6 65 94.9 5.9 -24.5 68 98 4.6 -5.4 64.6 102 5.4 -7.8 69.3 105 11 -11.2 65.4 103 15 -16 70.5 96.9 14 -20.9 65.8 105 14 -20.9 65.8 105 6 -17.9 82.1 98 5 -20.3 60.9 104 6.7 -49.5 64 107 7 -49.8 69.2 108 7.9 -52.4 77.2 103 8.6 -25.6 65 97.5 13 -74.4 61.1 111 9.7 -72.4 64.1 120 9.3 -71.1 64.2 111 6.2 -14.8 77.4 102 6.2 -18.7 69.5 104 6.8 -30.5 74.1 99.2 7.2 -43.1 67.2 102 7.3 -48.2 64.6 97.9 4.9

-5.2

64.2

106

x7 CNL (%) 27 35 32.9 31.8 34.3 32.4 33.1 33.1 32.3 32.3 32.2 30.4 30.9 34.1 32.7 33.9 30.9 33.4 33.4 32.1 33.7 35.3 34.4 34.8 31.9 31.4 30.7 31.1 32.3 33.5 31.4 33.4 30.6

x8 DEN (g/cm3) 2.11 2.15 2.17 2.16 2.13 2.16 2.14 2.15 2.14 2.14 2.12 2.22 2.18 2.16 2.14 2.13 2.2 2.16 2.16 2.3 2.13 2.15 2.16 2.17 2.18 2.14 2.12 2.12 2.2 2.14 2.23 2.19 2.19

y* HLC

2.12

(3)

1 4 4 3 3 4 2 2 2 4 3 3 2 3 3 3 3 4 4 3 4 4 4 4 3 2 3 3 4 4 4 2 4

x1= thickness (H), x2 = true formation resistivity (RT), x3 = micro-spherically focused log (RXO), x4 = spontaneous potential (SP), x5 = natural gamma ray (GR), x6 = acoustictime (AC), x7 = compensated neutron log (CNL), x8= density (DEN). b * y = HLC = hydrocarbon layer classification (1–gas layer, 2–oil layer, 3–water/oil layer, 4–water layer) determined by the well test, number in parenthesis is not input data, but is used for calculating R(%)i.

5 CASE STUDY 3: THE PREDICTION OF HYDROCARBON LAYER The objective of this case study is to conduct hydrocarbon layer classification (HLC) using conventional welllogging data, which has practical value when well test data is less limited. Using data of 34 samples from the Tarim Basin in western CHINA, and each sample contains 8 independent variables (x1= H, x2 = RT, x3 = RXO, x4 = SP, x5 = GR, x6 = AC, x7 = CNL, x8 = DEN) and a well test result (y* = HLC), Wang et al. (2007) adopted fuzzy clustering for the prediction of HLC[16]. In the case study, among these 34 samples, 33 are taken as learning samples and one as prediction sample for the prediction of HLC (Table 6) by C-SVM, NBAY and BAYSD. - 44 http://www.sj-ce.org

5.1 Learning Process Using the 33 learning samples (Table 6) and by C-SVM, NBAY, BAYSD and MRA, the following four functions of HLC (y) with respect to 8 independent variables (x1, x2, x3, x4, x5, x6, x7, x8) have been constructed. Using C-SVM, the result is an explicit nonlinear function corresponding to Eq. 8: y=C-SVM(x1, x2, x3, x4, x5, x6, x7, x8)

(19)

with C=8,  =2, 33 free vectors xi, and the cross validation accuracy CVA=54.5%. Using NBAY, the result is an explicit nonlinear discriminate function corresponding to Eq. 10: 

  ( x j   jl ) 2   (20)   (l=1, 2, 3, 4) 2  2 jl      jl 2 where for l=1, σj1=0, 0, 0, 0, 0, 0, 0, 0, μj1=2.7, 8, 8.6, -56.9, 58.9, 108, 27, 2.11; for l=2, σj2=1.57, 2.24, 2.2, 15.5, 2.5, 4.11, 0.934, 0.021, μj2=3.55, 8, 8.62, -52.7, 65.8, 103, 32.4, 2.16; for l=3, σj3=1.59, 2.1, 2.88, 23.9, 5.31, 6.64, 1.29, 0.052, μj3=4.14, 7.5, 8.63, -32, 66.5, 104, 32.2, 2.17; for l=4, σj4=1.81, 0.452, 2.63, 16.8, 6.13, 2.71, 1.32, 0.026, μj4=3.76, 5.29, 7.78, -39.7, 66.8, 104, 33.2, 2.17. Nl ( x )  8j 1 

exp 

Using BAYSD, the result is an explicit nonlinear discriminate function corresponding to Eq. 12:  B1  x   ln(0.030)  7066  25.7 x1  11.7 x2 1.225 x3  1.43 x4  13.5 x5  28.1x6  47.2 x7  5064 x8   B2  x   ln(0.182)  7315  25.6 x1  11.8 x2  0.833 x3  1.34 x4  13.4 x5  27.8 x6  51.5 x7  5128 x8 (21)   B3  x   ln(0.364)  7372  25.6 x1  11.8 x2  0.826 x3  1.43 x4  13.5 x5  28.1x6  51.2 x7  5148 x8  B4  x   ln(0.424)  7393  25.6 x1  10.7 x2  0.503 x3  1.37 x4  13.4 x5  28.0 x6  51.9 x7  5156 x8 From the successive process, HLC (y) is shown to depend on the 8 independent variables in decreasing order: x7, x2, x8, x4, x6, x3, x5, and x1.

(22)

Equation 22 yields a residual variance of 0.52 and a multiple correlation coefficient of 0.693. From the regression process, HLC (y) is shown to depend on the 8 independent variables in decreasing order: x2, x7, x3, x8, x6, x4, x5, and x1.

5.2 Prediction Process Substituting the values of 8 independent variables (x1, x2, x3, x4, x5, x6, x7, x8) given by the 33 learning samples and 1 prediction sample (Table 6) in Eqs. 19, 20 (and then use Eq. 11), 21 (and then use Eq. 13) and 22, respectively, the HLC (y) of each sample is obtained (Table 7). From Table 8 and based on Table 1, it shown that a) the nonlinearity degree of this studied problem is strong since R(%) of MRA is 18.3; b) the solution accuracies of C-SVM, NBAY and BAYSD are high, low and low due to the fact that their R(%) are 0, 19.4 and 12.3, respectively; and c) the results availability of C-SVM, NBAY and BAYSD are applicable, inapplicable and inapplicable due to the indication that their R(%) are 0, 19.4 and 12.3, respectively.

5.3 Dimension-reduction Failed by Using BAYSD and C-SVM BAYSD gives the dependence of the predicted value (y) on 8 independent variables, in decreasing order: x7, x2, x8, x4, x6, x3, x5, x1 (Table 8). According to this dependence order, at first, deleting x1 and running C-SVM, it is found the results of C-SVM are changed with R(%)=3.68 which is greater than previous R(%)=0 (Table 8). Thus the 9-D problem (x1, x2, x3, x4, x5, x6, x7, x8, y) cannot become 8-D problem (x2, x3, x4, x5, x6, x7, x8, y), showing that the expression of y needs all of x1, x2, x3, x4, x5, x6, x7, x8. In general, the dimension-reduction succeeds for highdimensional problem[1, 4, 7]. - 45 http://www.sj-ce.org

TABLE 7 PREDICTION RESULTS FROM HYDROCARBON LAYER CLASSIFICATION (HLC) OF TARIM BASIN IN WESTERN CHINA Sample type

Sample No.

Learning samples

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

1 4 4 3 3 4 2 2 2 4 3 3 2 3 3 3 3 4 4 3 4 4 4 4 3 2 3 3 4 4 4 2 4

y 1 4 4 3 3 4 2 2 2 4 3 3 2 3 3 3 3 4 4 3 4 4 4 4 3 2 3 3 4 4 4 2 4

HLC a Classification algorithm C-SVM NBAY BAYSD R(%) R(%) R(%) y y 0 3 200 1 0 0 4 0 4 0 0 4 0 4 0 0 2 33.3 3 0 0 2 33.3 4 33.3 0 4 0 4 0 0 2 0 2 0 0 2 0 2 0 0 2 0 4 100 0 2 50 4 0 0 4 33.3 3 0 0 3 0 3 0 0 3 50 3 50 0 4 33.3 3 0 0 4 33.3 3 0 0 3 0 3 0 0 3 0 3 0 0 4 0 4 0 0 4 0 4 0 0 3 0 4 33.3 0 4 0 3 25 0 4 0 4 0 0 4 0 4 0 0 4 0 4 0 0 2 33.3 3 0 0 3 50 3 50 0 3 0 3 0 0 2 33.3 3 0 0 4 0 3 25 0 4 0 4 0 0 3 25 4 0 0 2 0 4 100 0 2 50 4 0

MRA y 1.69 3.79 3.69 2.61 3.24 3.35 3.05 2.18 3.27 3.20 2.88 3.21 2.66 3.65 3.32 3.02 2.16 4.16 4.23 3.77 3.41 3.90 3.82 3.63 2.86 2.34 2.66 2.76 3.33 3.48 3.35 3.36 2.98

R(%) 68.9 5.36 7.64 13.0 8.01 16.3 52.6 8.80 63.6 20.0 3.97 7.14 32.8 21.6 10.7 0.520 28.0 4.03 5.86 25.6 14.7 2.49 4.61 9.28 4.71 17.2 11.5 8.16 16.8 12.9 16.2 68.2 25.5

Prediction 10 3 3 0 3 0 3 0 2.85 5.11 samples a * y = HLC = hydrocarbon layer classification (1–gas layer, 2–oil layer, 3–water/oil layer, 4–water layer) determined by the well test.

TABLE 8 COMPARISON AMONG THE APPLICATIONS OF CLASSIFICATION ALGORITHMS (C-SVM, NBAY AND BAYSD) TO HYDROCARBON LAYER CLASSIFICATION (HLC) OF TARIM BASIN IN WESTERN CHINA Mean absolute relative residual Algorithm

C-SVM NBAY BAYSD MRA

Fitting formula

Nonlinear, explicit Nonlinear, explicit Nonlinear, explicit Linear, explicit

Dependence of the predicted value (y) on independent variables (x1, x2, …, x8), in decreasing order

Time consuming on PC (Intel Core 2)

Problem nonlinearity

Solution accuracy

Results availability

R1(%)

R2(%)

R(%)

N/A

High

Applicable

20.0

19.4

N/A

<1 s

N/A

Low

Inapplicable

12.6

12.3

N/A

Low

Inapplicable

18.7

5.11

18.3

<1 s

Strong

N/A

x7, x2, x8, x4, x6, x3, x5, x1 x2, x7, x3, x8, x6, x4, x5, x1

6 CASE STUDY 4: THE PREDICTION OF FRACTURE The objective of this case study is to predict fractures using conventional well-logging data, which has practical - 46 http://www.sj-ce.org

values when the data of imaging log and core sample are limited. Using data of 29 learning samples and 4 prediction samples from Wells An1 and An2 in the Anpeng Oilfield at the southeast of the Biyang Sag in Nanxiang Basin in central China, and each sample contains 7 parameters (x1 =AC, x2 =CND, x3 = CNP, x4 = RXO, x5 = RD, x6 = RS, x7 = absolute difference of RD and RS) and the imaging log (y* = fracture identification: fracture, nonfracture), see Table 3.10 in [1], [2] adopted C-SVM, NBAY and BAYSD, and the results related to algorithm comparison are listed in Table 9. TABLE 9 COMPARISON AMONG THE APPLICATIONS OF CLASSIFICATION ALGORITHMS (C-SVM, NBAY AND BAYSD) TO FRACTURE PREDICTION OF WELLS AN1 AND AN2 IN THE ANPENG OILFIELD AT THE SOUTHEAST OF THE BIYANG SAG IN NANXIANG BASIN IN CENTRAL CHINA (MODIFIED FROM [2]) Mean absolute relative residual Algorithm

C-SVM NBAY BAYSD MRA

Fitting formula

Nonlinear, explicit Nonlinear, explicit Nonlinear, explicit Linear, explicit

Dependence of the predicted value (y) on independent variables (x1, x2, …, x7), in decreasing order

Time consuming on PC (Intel Core 2)

Problem nonlinearity

Solution accuracy

Results availability

R1(%)

R2(%)

R(%)

N/A

High

Applicable

12.07

10.6

N/A

<1 s

N/A

Low

Inapplicable

N/A

High

Applicable

16.2

22.79

<1 s

Strong

N/A

x5, x2, x4, x7, x3, x6, x1 x5, x2, x4, x7, x3, x6, x1

From Table 9 and based on Table 1, it shown that a) the nonlinearity degree of this studied problem is strong since R(%) of MRA is 17; b) the solution accuracies of C-SVM, NBAY and BAYSD are high, low and high due to the fact that their R(%) are 0, 10.6 and 0, respectively; and c) the results availability of C-SVM, NBAY and BAYSD are applicable, inapplicable and applicable due to the indication that their R(%) are 0, 10.6 and 0, respectively.

7 SUMMARY OF THE FOUR CASE STUDIES TABLE 10 SUMMARY OF THE FOUR CASE STUDIES Case Study No.

Problem nonlinearity

Weak

Strong

C-SVM NBAY BAYSD MRA C-SVM NBAY BAYSD MRA C-SVM NBAY BAYSD

Total mean absolute relative residual R(%) 0 0 0 9.36 0 0 0 5.04 0 19.4 12.3

Dependence of the predicted value (y) on independent variables (x1, x2, …, xm), in decreasing order N/A N/A x3, x4, x2, x1 x3, x2, x1, x4 N/A N/A x5, x8x6,x1,x3,x2,x4,x7 x5, x8x6,x1,x4,x2,x3,x7 N/A N/A x7,x2,x8,x4,x6,x3,x5,x1

MRA

18.3

C-SVM NBAY BAYSD MRA

0 10.6 0 17

Algorithm

Time consuming on PC (Intel Core 2)

Solution accuracy

Results availability

5s <1 s 1s <1 s 5s <1 s 1s <1 s 5s <1 s 1s

High High High N/A High High High N/A High Low Low

Applicable Applicable Applicable N/A Applicable Applicable Applicable N/A Applicable Inapplicable Inapplicable

x2,x7,x3,x8,x6,x4,x5,x1

<1 s

N/A

N/A N/A x5, x2, x4, x7, x3,x6,x1 x5, x2, x4, x7, x3,x6,x1

5s <1 s 1s <1 s

High Low High N/A

Applicable Inapplicable Applicable N/A

Dimensionality reduction

Succeeded from 5-D to 2-D problem N/A

Failed

N/A

From Tables 4, 5, 8 and 9, Table 10 summarizes the important results of each algorithm in the four case studies. Comparing with C-SVM, the major advantages of BAYSD are (Table 10): a) BAYSD runs much faster than C-SVM, - 47 http://www.sj-ce.org

b) it is easy to code the BAYSD program whereas very complicated to code the C-SVM program, and c) BAYSD can serve as a promising dimension-reduction tool. So BAYSD is better than C-SVM when nonlinearity of studied problem is weak.

8 CONCLUSIONS Through the aforementioned four case studies, five major conclusions can be drawn as follows: 1)

the total mean absolute relative residual R(%) of MRA can be used to measure the nonlinearity degree of a studied problem, and thus MRA should be run at first;

the proposed three criteria [nonlinearity degree of a studied problem based on R(%) of MRA, solution accuracy of a given algorithm application based on its R(%), results availability of a given algorithm application based on its R(%)] are practical;

if a classification problem has weak nonlinearity, in genaral, C-SVM, NBAY and BAYSD are applicable, and BAYSD is better than C-SVM and NBAY;

any of NBAY cannot be applied to any classification problems with strong nonlinearity, but BAYSD sometims could be applied, and C-SVM always could be applied;

BAYSD and C-SVM can be applied to dimensionality reduction.

ACKNOWLEDGMENT This work was supported by the Research Institute of Petroleum Exploration and Development (RIPED) and PetroChina.

REFERENCES [1]

Shi G. “Data Mining and Knowledge Discovery for Geoscientists.” Elsevier Inc, USA, 2013

[2]

Shi G. “Optimal prediction in petroleum geology by regression and classification methods.” Sci J Inf Eng 5(2): 14-32, 2015

[3]

Shi G. “The use of support vector machine for oil and gas identification in low-porosity and low-permeability reservoirs.” Int J

[4]

Shi G, Yang X. “Optimization and data mining for fracture prediction in geosciences.” Procedia Comput Sci 1(1): 1353-1360, 2010

Math Model Numer Optimisa 1(1/2): 75-87, 2009 [5]

Chang C, Lin C. “LIBSVM: a library for support vector machines, Version 3.1.” Retrieved from www.csie.ntu.edu.tw/~cjlin /libsvm, 2011

[6]

Zhu Y, Shi G. “Identification of lithologic characteristics of volcanic rocks by support vector machine.” Acta Petrolei Sinica 34(2): 312-322, 2013

[7]

Shi G, Zhu Y, Mi S, Ma J, Wan J. “A big data mining in petroleum exploration and development.” Adv Petrol Expl Devel 7(2): 18, 2014

[8]

Ramoni M, Sebastiani P. “Robust Bayes classifiers.” Artificial Intelligence 125(1-2): 207-224, 2001

[9]

Tan P, Steinbach M, Kumar V. “Introduction to Data Mining.” Pearson Education, Boston, MA, USA, 2005

[10] Han J, Kamber M. “Data Mining: Concepts and Techniques, 2nd Ed.” Morgan Kaufmann, San Francisco, CA, USA, 2006 [11] Denison DGT, Holmes CC, Mallick BK, Smith AFM. “Bayesian Methods for Nonlinear Classification and Regression.” John Wiley & Sons Inc, Chichester, England, UK, 2002 [12] Shi G. “Four classifiers used in data mining and knowledge discovery for petroleum exploration and development.” Adv Petrol Expl Devel 2(2): 12-23, 2011 [13] Sharma MSR, O'Regan M, Baxter CDP, Moran K, Vaziri H, Narayanasamy R. “Empirical relationship between strength and geophysical properties for weakly cemented formations.” J Petro Sci Eng 72(1-2): 134-142, 2010 [14] Singh J, Shaik B, Singh S, Agrawal VK, Khadikar PV, Deeb O, Supuran CT. “Comparative QSAR study on para-substituted aromatic sulphonamides as CAII inhibitors: information versus topological (distance-based and connectivity) indices.” Chem Biol Drug Design 71, 244-259, 2008 [15] Xia K, Song J, Li C. “An approach to oil log data mining based on rough set & neural network.” Information and Control 32(4): 300-303, 2003 - 48 http://www.sj-ce.org

[16] Wang Z, Wang X, Qu J, Fang H, Jian H. “Study and application of data mining technologe in petroleum exploration and development.” Petrol Ind Comp Appl 15(1): 17-20, 2007

AUTHORS 1

Guangren Shi was born in Shanghai,

simplified dissolution-precipitation model of the smectite to

Expert,

illite transformation and its application. Journal of Geophysical

Professor with qualification of directing

Research-Solid Earth, 114, B10205, doi:10.1029/2009JB006406;

Ph. D. students. He was graduated from

3) Shi G. R. 2008. Basin modeling in the Kuqa Depression of

Xi’an Jiaotong University, China in 1963,

the Tarim Basin (Western China): A fully temperature-

majoring in applied mathematics (1958–

dependent model of overpressure history. Mathematical

1963). Since 1963, he has been engaged in

Geosciences, 40 (1): 47–62; and 4) Shi G. R., Zhou X. X.,

computer

petroleum

Zhang G. Y., Shi X. F., Li H. H. 2004. The use of artificial

exploration and development. In the recent 30 years, his research

neural network analysis and multiple regression for trap quality

only contains two fields of basin modeling (petroleum system)

evaluation: a case study of the Northern Kuqa Depression of

and data mining for geosciences. In the recent years, however,

Tarim Basin in western China. Marine and Petroleum Geology,

he has focused on the latter more than the former.

21 (3): 411–420.

He has more than 51 years of professional career, working for

Prof. Shi is Member of Society of Petroleum Engineers

the computer center, Daqing Oilfield, petroleum ministry, China,

(International), Member of Chinese Association of Science and

as Associate Engineer and Head of software group (1963–1967);

Technology, Member of Petroleum Society of China. And he

for the computer center, Shengli Oilfield, petroleum ministry,

also is Regional Editor (Asia), International Journal of

China, as Engineer and Director (1967–1978); for the computer

Mathematical Modelling and Numerical Optimisation, and

center, petroleum ministry, China, as Engineer and Head of

Member of Editorial Board, Journal of Petroleum Science

software group (1978–1985); for the Aldridge laboratory of

Research. He received three honors: 1) A Person Studying

applied geophysics, Columbia university at New York City,

Overseas and Return with Excellent Contribution, appointed by

U.S.A, as Visiting Scholar (1985–1987); for the computer

the Ministry of Education of China (1991); 2) Special

application technology research department, Research Institute

Government Allowance, awarded by the State Council of China

of Petroleum Exploration and Development (RIPED), China

(1994); and 3) Grand Award of Sun Yueqi Energy, awarded by

National Petroleum Corporation (CNPC), China, as Professor,

the Ministry of Science-Technology of China (1997). And he

Director (1987–1997); for the RIPED, PetroChina Company

also obtained four awards of Science-Technology Progress, in

Limited (PetroChina), China, as Professor with qualification of

which one is China National Award, and three are from CNPC

directing Ph. D. students, Deputy Chief Engineer (1997–2001);

and PetroChina.

China

February,

applications

1940,

for

and for the department of experts in RIPED of PetroChina,

China, as Expert, Professor with qualification of directing Ph. D.

Province, China in November, 1974. He

students (2001–Present). He published eight books, in which

was graduated from Shandong University,

there are three in English, i.e. 1) Shi G. R. 2013. Data Mining

China in 1998, majoring in computer

and Knowledge Discovery for Geoscientists. Elsevier Inc, USA.

sciences (1994–1998), and received a

367pp; 2) Shi G. R. 2005. Numerical Methods of Petroliferous

Bachelor of Science degree in computer

Jinshan Ma, was born in Shandong

Basin Modeling, 3 edition. Petroleum Industry Press, Beijing,

and application. From 1998 to 2000, he

China. 338pp, which was book-reviewed by Mathematical

was an assistant engineer at Shengli Oilfield, China, working for

Geosciences in 2009; and 3) Shi G. R. 2000. Numerical Methods

China Petroleum and Chemical Corporation (Sinopec). He was

of Petroliferous Basin Modeling, 2nd edition. Petroleum Industry

graduate student at the Research Institute of Petroleum

Press, Beijing, China. 233pp, which was book-reviewed by

Exploration and Development (RIPED), Petrochina (2000–2003),

Mathematical Geology in 2006. He also published 75 articles, in

and received a Master of Engineering degree in mineral

which there are 17 in English, e.g. four articles indexed by SCI:

prospecting and exploration. Since 2003, he has been working

1) Shi G. R., Zhang Q., C., Yang X. S., Mi S. Y. 2010. Oil and

for RIPED. His experiences with Shengli Oilfield and RIPED

gas assessment of the Kuqa Depression of Tarim Basin in

are primarily in petroleum data management. He engages in

western China by simple fluid flow models of primary and

petroleum data managemeent and petroleum database design for

secondary migrations of hydrocarbons. Journal of Petroleum

PetroChina. He participated directly in the effort of EPDM

Science and Engineering, 75 (1-2): 77–90; 2) Shi G. R. 2009. A

(exploration and production data model) designing, which was

- 49 http://www.sj-ce.org

an oil and gas E&P data model for PetroChina and was released

Master of Science degree in intelligent web technology, and

in 2012 as enterprise standard. He participated in the publication

achieved a distinction honor. Particularly, her research fields are

of several papers on petroleum data management, database

primarily in information retrieval (semantic web) and data

design, and data mining. He is now a Senior Engineer of

interchange. In July 2013, she was employed by Research

information management.

Institute of Petroleum Exploration and Development, Petrochina,

Dan Ba, was born in Heilongjiang

Province, China in July, 1988. She was graduated

from

Northeast

Petroleum

University, China in 2011, majoring in computer

information

technology

(2007â&#x20AC;&#x201C;2011), and received a Bachelor of Engineering degree in computer science & technology. Then she went aboard and studied at Queen Mary, University of London, UK. At the end of 2012, received a

China. Currently, she is an Assistant Engineer in establishing global petroleum resources database system and developing oil and gas resources evaluation software. She participated directly in the design of data mining (dedicating on textual data mining) tools, which is a sub-project belonging to one of the important national science & technology specific project. Besides, she has already participated in the publication of two papers, one on Earth Science Frontiers while another one on China Petroleum Exploration.

- 50 http://www.sj-ce.org