Scientific Journal of Earth Science June 2015, Volume 5, Issue 2, PP.33-40
Data Mining in Petroleum Upstream ——The Use of Regression and Classification Algorithms Dawei Li#, Guangren Shi Research Institute of Petroleum Exploration & Development, PetroChina, Beijing 100083, China Email: leedw@petrochina.com.cn
Abstract Data mining (DM) technique has seen enormous successes in some fields, but its application to petroleum upstream (PUP) is still at initial stage, as PUP is quite different from the other fields in many aspects. The most popular DM algorithms in PUP are regression and classification. Through many DM applications to PUP, we have found that: a) the preferable algorithm for regression is back-propagation neural network (BPNN), the next are regression of support vector machine (R-SVM) and multiple regression analysis (MRA), and the preferable algorithm for classification is classification of support vector machine (C-SVM), the next is Bayesian successive discrimination (BAYSD); b) C-SVM can also be applied in data cleaning; c) both MRA and BAYSD can also be applied in dimension-reduction, and BAYSD is preferable one; d) R-mode cluster analysis (RCA) can be applied in dimension-reduction, while Q-mode cluster analysis (QCA) can be applied in sample-reduction. A case study in PUP indicates that R-SVM, BPNN and MRA are not applicable for regression, whereas C-SVM is applicable for classification in this case. Keywords: Data Mining; Regression; Classification; Data Cleaning; Dimension-reduction; Sample-reduction; Oil Layer
1 INTRODUCTION Petroleum upstream (PUP) has already entered “big data” epoch for many years. For example, there are about 50 large information systems that have been built in PetroChina, and there are about 590TB PUP data and 280000 wells’ structured data in only one of them. At present, the major applications of PUP data in these information systems are only information storage and inquiry, which is far away from fully utilizing the values of these data assets. Data mining (DM) is a preferable solution for this problem. DM technique has seen enormous successes in some fields of business and sciences, but its application to PUP is still at initial stage. This is because that PUP is different from the other fields in many aspects, such as miscellaneous data types, huge quantity, different measuring precision, and lots of uncertainties to data mining results. DM can be divided into two classes: small data mining (SDM), which deals with several thousand samples at most; and big data mining (BDM), which deals with ten thousand samples at least. In PUP, the seismic, remote sensing and well log data are potential applications of BDM, e.g. Zhu and Shi (2013) and Shi et al. (2014) used BDM in well log[1][2]; but the other data are potential applications of SDM, e.g. Shi (2013) used SDM in 36 case studies[3], and Shi (2015) used SDM in 8 case studies of petroleum geology[4]. In DM, the most popular algorithms are regression and classification. There mainly exist three regression and five classification algorithms[3]: the multiple regression analysis (MRA), the error back-propagation neural network (BPNN), the regression of support vector machine (R-SVM), the classification of support vector machine (C-SVM), the decision trees (DTR), the Naïve Bayesian (NBAY), the Bayesian discrimination (BAYD), and the Bayesian successive discrimination (BAYSD). These eight algorithms use the same known parameters, and also share the same unknown that is predicted. The only difference between them is the method and calculation results. It is well known that MRA, BPNN and R-SVM are regression algorithms with real number results, while C-SVM, DTR, NBAY, BAYD and BAYSD are classification algorithms with integer number results. In the eight algorithms, only MRA is a linear algorithm whereas the other seven are nonlinear algorithms, this is due to the fact that MRA - 33 http://www.j-es.org
constructs a linear function whereas the other seven algorithms construct nonlinear functions, respectively. Since the usage of DTR is quite complicated and BAYSD is superior to BAYD, these two classification algorithms (DTR and BAYD) are not introduced in this paper.
2 REGRESSION AND CLASSIFICATION ALGORITHMS Assume that there are n learning samples, each associated with m+1 numbers (x1, x2, …, xm, yi* ) and a set of observed values (x1i, x2i, …, xmi, yi* ), with i=1, 2, …, n for these numbers. In principle, n>m, but in actual practice n>>m. The n samples associated with m+1 numbers are defined as n vectors: xi = (xi1, xi2, …, xim, yi* ) (i=1, 2, …, n)
(1) th
where, n is the number of learning samples; m is the number of independent variables in samples; xi is the i learning sample vector; xij is the value of the jth independent variable in the ith learning sample, j=1, 2, …, m; and yi* is the value of the ith learning sample, the observed value. Equation (1) is the expression of learning samples. Let x0 be the general form of a vector of (xi1, xi2, …, xim). The principles of MRA, BPNN, NBAY and BAYSD are the same, i.e. try to construct an expression, y=y(x0), such that Equation (2) is minimized. Certainly, these three different algorithms use different approaches and result in different accuracy of calculation results. n
* ∑ y ( x 0 i ) − yi
i =1
2
(2)
where, y ( x 0i ) is the calculation result of the dependent variable in the ith learning sample; and the other symbols have been defined in Equation (1). However, the principles of R-SVM and C-SVM algorithms are to try to construct an expression, y=y(x0), in order to maximize the margin based on support vector points to obtain the optimal separating line. This y=y(x0) is called the fitting formula obtained in the learning process. The fitting formulas of different algorithms are different. In this paper, y is defined as a single variable. The flowchart is as follows: the 1st step is the learning process, using n learning samples to obtain a fitting formula; the 2nd step is the learning validation, substituting n learning samples into the fitting formula to get prediction values (y1, y2, …, yn), respectively, so as to verify the fitness of an algorithm; and the 3rd step is the prediction process, substituting k prediction samples expressed with Equation (3) into the fitting formula to get prediction values (yn+1, yn+2, …, yn+k), respectively. xi = (xi1, xi2, …, xim) (i=n+1, n+2, …, n+k)
(3)
where, k is the number of prediction samples; xi is the ith prediction sample vector; and the other symbols have been defined in Equation (1). Equation (3) is the expression of prediction samples.
2.1 Error Analysis of Calculation Results To express the calculation accuracy of the prediction variable y for learning and prediction samples when the aforementioned six algorisms (MRA, BPNN, R-SVM, C-SVM, NBAY, and BAYSD) are used, the absolute relative residual R(%)i, the mean absolute relative residual R(%) and the total mean absolute relative residual R *(%) are adopted. The absolute relative residual for each sample, R(%), is defined as R (%)i = ( yi -yi* ) / yi* × 100
(4)
where, yi is the calculation result of the dependent variable in the ith learning sample; and the other symbols have - 34 http://www.j-es.org
been defined in Equations (1) and (3). It is noted that zero must not be taken as a value of yi* to avoid floating-point overflow. Therefore, for regression algorithm, delete the sample if its yi* =0; and for classification algorithm, positive integer is taken as values of yi* . The mean absolute relative residual for all learning samples or prediction samples, R(%) , is defined as Ns
R(%)= ∑ R (%)i / N s i =1
(5)
where, Ns =n for learning samples while Ns = k for prediction samples; and the other symbols have been defined in Equations (1) and (3). For learning samples, R(%) and R(%) are called the fitting residual to express the fitness of learning process, and here R(%) is designated as R1(%) ; and for prediction samples, R(%) and R(%) are called the prediction residual to express the accuracy of prediction process, and here R(%) is designated as R2(%) . The total mean absolute relative residual for all samples, R *(%) , is defined as R *(%) = [ R1(%) + R2(%) ]/2
(6)
When there are no prediction samples, R *(%) = R1(%) .
2.2 Nonlinearity and Solution Accuracy of Studied Problem Since MRA is a linear algorithm, its R *(%) for a studied problem expresses the nonlinearity of y=y(x) to be solved, i.e. the nonlinearity of the studied problem. Whether linear algorithm (MRA) or nonlinear algorithms (BPNN, R-SVM, C-SVM, NBAY and BAYSD), their R *(%) of a studied problem expresses the accuracy of y=y(x) obtained by each algorithm, i.e. solution accuracy of the studied problem solved by each algorithm. y = y(x) created by BPNN is an implicit expression (it cannot be expressed as a usual mathematical formula); whereas that of the other four algorithms are explicit expressions (they are expressed as a usual mathematical formula).
3 DATA PREPROCESSING The methods of data preprocessing are various (such as data cleaning, data integration, data transformation and data reduction etc.). These methods will be used before data mining. Maimon and Rokach (2010) made more detailed summary to data preprocessing[5], Han and Kamber (2006) specified more detailed methods related to data preprocessing[6]. This paper only discusses the aforementioned six algorithms (MRA, BPNN, R-SVM, C-SVM, NBAY and BAYSD) used in data cleaning and data reduction.
3.1 Data Cleaning Realistic data are often noisy, imperfect and inconsistent. The main job for data cleaning is to fill up the missed data, make noisy data smooth, identify or eliminate abnormal data as well as solve inconsistent problems. For example, in the case study of Zhu and Shi (2013) and Shi et al. (2014), C-SVM was employed to 2242 leaning samples and it was found that R(%) of 27 leaning samples were not zero. By correcting yi* of the 27 samples, CSVM run again resulting R1(%) =0[1][2].
3.2 Data Reduction It usually takes lots of time to make analysis of the complicated data to the large scaled DB content, which often makes this analysis unrealistic and unfeasible, particularly interactive DM is needed. Data reduction technique is - 35 http://www.j-es.org
used to obtain a simplified data set from the original huge data set, and keep this simplified data set integration as the original data set. In this way, it is obviously more efficient to perform DM on the simplified data set, and the mined results are almost the same as the results from the original data set. The main strategies to do data reduction are: a) data aggregation, such as build up data cube aggregation, this aggregation is mainly used for the construction of data cube data warehouse operation, b) data reduction is mainly used for the detection and elimination of those unrelated, weak correlated or redundant attributes or dimensions, c) data compression, to compress data set by using coding technique (e.g. minimum coding length or wavelet); and d) numerosity reduction, to use simpler data expression mode (e.g. parameter model, non-parameter model clustering, sampling and histogram ) to displace the original data. Here we only discuss the dimension-reduction which can reduce the number of independent variables and samplereduction which can reduce the number of samples. MRA and BAYSD can serve as pioneering dimension-reduction tools[3], because the two algorithms both can give the dependence of the predicted value (y) on independent variables (x1, x2, …, xm), in decreasing order. However, because MRA belongs to data analysis in linear correlation whereas BAYSD is in nonlinear correlation, in the applications with very strong nonlinearity the ability of dimension-reduction of BAYSD is higher than that of MRA. For example, the case study of Zhu and Shi (2013) and Shi et al. (2014) is a 16-D problem (x1, x2, …, x15, y)[1][2]. For MRA, x1 is the minimum dependence of y, so we tried to delete x1 and run C-SVM, the results shows R1(%) =0.036 and R2(%) =8.13, but the results without this deletion are R1(%) =0 and R2(%) =6.30, which indicates this dimension-reduction is failed. For BAYSD, however, though it runs in the condition without x13, x7, x6, x14 x12, x3, x2 and x9, its R *(%) is 16.81, which shows the dimension of this studied problem can be reduced from 16-D to 8D. Why is the dimension-reduction of BAYSD successful but that of MRA failed? The reason is that the nonlinearity of the studied problem is strong due to R *(%) of MRA is 52.14, and BAYSD is a nonlinear algorithm while MRA is a linear algorithm. In addition, R-mode cluster analysis (RCA) can serve as a pioneering dimension-reduction tool, while Q-mode cluster analysis (QCA) can serve as a pioneering sample-reduction tool[3]. It is noted that both RCA and QCA are linear algorithms. The called “pioneering tool” is whether it succeeds or not needs nonlinear tool (BPNN for regression problem, CSVM for classification problem) for the second validation, so as to determine how many samples and how many independent variables can be reduced. Why does it need second validation? Because of the complexities of geosciences rules, the correlations between different classes of geosciences data are nonlinear in most cases.
4 CASE STUDY: OIL LAYER EVALUATION 4.1 Studied Problem The objective of this case study is to conduct the prediction of formation flow capacity, which has practical value when oil test data are less limited. Using data of 14 samples from the Tahe Oilfield in the Tarim Basin, China, and each sample contains 7 independent variables (x1 = oil viscosity, x2 = oil outcome, x3 = choke size, x4 = oil pressure, x5 = oil density, x6 = gas/oil ratio, and x7 = water cut) and an oil test result (y* = formation flow capacity), Kang et al. (2007) adopted BPNN for the prediction of formation flow capacity[7]. Actually, they only adopted 6 independent variables without x1 (oil viscosity) for BPNN. In our case study, among these 14 samples, 12 are taken as learning samples and one as prediction sample with 7 independent variables (Table 1).
4.2 Input Known Parameters They are the values of the known variables xi (i=1, 2, 3, 4, 5, 6, 7) for 12 learning samples and one prediction sample, and the value of the prediction variable y* for 12 learning samples (Table 1). Note: y* = formation flow capacity for regression calculation; y* = oil layer classification for classification calculation.
- 36 http://www.j-es.org
TABLE 1 INPUT DATA FOR FORMATION FLOW CAPACITY OF THE TAHE OILFIELD Sample type
Sample No.
Relative parameters for formation flow capacity a
Well No.
y*
x1
x2
x3
x4
x5
x6
x7
FFCb
OLCc
Learning samples
1 2 3 4 5 6 7 8 9 10 11 12
TK631 S65 TK413 TK404 TK409 S67 S74 TK609 TK313 TK607 TK444 TK629
35.6 8.0 78.5 78.0 28.3 11.7 22.6 25.5 1.9 24.6 9.2 22.6
133.98 298 81.57 219 160 413 136 116 201.7 203.3 197.9 83.4
6 8 4 6.9 7.9 8 6 7 6 8 6 6
9.56 7.3 7.8 11.69 5.16 10.35 9.28 6.1 17 9.6 8.75 4.3
0.96 0.95 0.96 0.95 0.96 0.97 0.98 0.96 0.90 0.95 0.99 0.98
0 0 68 43.9 52.9 0 22 0 88 37.78 0 0
8.5 0.29 0.08 0 0.1 0.2 0 7.41 0.5 0.03 0 0.11
19.50 23.00 3.90 67.77 26.07 102.50 82.11 21.21 126.00 191.31 130.19 6.993
3 3 4 2 2 1 2 3 1 1 1 3
Prediction sample
13
TK442
17.9
232
7
7.5
0.98
0
7.13
(4.80)
(3)
a
x1 = oil viscosity, mPa·s; x2 = oil outcome, t/d; x3 = choke size, mm; x4 = oil pressure, MPa; x5 = oil density, g/cm3; x6 = gas/oil ratio; and x7 = water cut, %. b * y = formation-flow-capacity (FFC, 1/(m·µm2)) or oil-layer-classification (OLC) determined by the oil test; number in parenthesis is not input data, but is used for calculating R(%). c when y*=OLC, it is oil layer classification (1–high-productivity oil layer, 2–intermediate-productivity oil layer, 3–low-productivity oil layer, 4–dry layer) determined by the oil test (Table 2).
TABLE 2 OIL LAYER CLASSIFICATION BASED ON FORMATION FLOW CAPACITY Oil layer classification High-productivity oil layer
Formation flow capacity (1/(m·µm2)) >100
y* 1
Intermediate-productivity oil layer
(25, 100]
2
Low-productivity oil layer
[4, 25]
3
4.3 Regression Calculation R-SVM, BPNN and MRA are adopted for regression calculation. 1) Learning Process Using the 12 learning samples (Table 1) and by R-SVM, BPNN and MRA, the three functions of formation flow capacity (y) with respect to 7 independent variables (x1, x2, x3, x4, x5, x6, x7) have been constructed. Using R-SVM[8][3], BPNN[3] and MRA[3], the result is fitting formula, respectively. Substituting the values of x1, x2, x3, x4, x5, x6, and x7 given by the 12 learning samples (Table 1) in the fitting formula of R-SVM, BPNN and MRA, respectively, the formation flow capacity (y) of each learning sample is obtained. Table 3 shows the results of learning process by R-SVM, BPNN and MRA. 2) Prediction Process Substituting the values of x1, x2, x3, x4, x5, x6, and x7 given by the one prediction sample (Table 1) in the fitting formula of R-SVM, BPNN and MRA, respectively, the formation flow capacity (y) of the prediction sample is obtained. Table 3 shows the results of prediction process by R-SVM, BPNN and MRA. It can been seen from R *(%) =516.65 of MRA (Table 4) that the nonlinearity of the relationship between the predicted value y and its relative independent variables (x1, x2, x3, x4, x5, x6, x7) is very strong. The solution accuracy of R-SVM and MRA are very low, and the solution accuracy of BPNN is moderate. Since this case study is a very strong nonlinear problem, R-SVM, BPNN and MRA are not applicable. - 37 http://www.j-es.org
TABLE 3 PREDICTION RESULTS OF FORMATION FLOW CAPACITY OF THE TAHE OILFIELD Sample type
Sample No.
Oil test
Learning samples
1 2 3 4 5 6 7 8 9 10 11 12
y* 19.50 23.00 3.90 67.77 26.07 102.50 82.11 21.21 126.00 191.31 130.19 6.993
y 45.38 46.95 46.17 47.23 46.61 47.44 46.86 45.26 47.77 47.22 46.94 46.11
Prediction sample
13
4.80
45.74
Formation flow capacity (1/(m·µm2)) Regression algorithm R-SVM BPNN R(%)i y R(%)i 132.7 19.5 0.2 104.1 41.7 81.2 1083.9 3.90 0 30.3 70.3 3.7 78.8 27.4 5.0 53.7 140 36.2 42.9 85.0 3.5 113.4 28.8 35.9 62.1 136 7.9 75.3 191 0 63.9 144 10.5 559.4 6.60 5.6 833.1
4.82
0.51
MRA y 30.1 33.2 -21.9 92.0 59.1 103 118 7.9 136 119 107 17.6
R(%)i 54.6 44.2 662.4 35.7 126.8 0.4 43.7 62.8 7.9 37.7 18.2 151.0
49.4
929.5
TABLE 4 COMPARISON AMONG THE APPLICATIONS OF THREE REGRESSION ALGORITHMS TO FORMATION FLOW CAPACITY OF THE TAHE OILFIELD
Algorithm
R-SVM BPNN MRA
Fitting formula Nonlinear, explicit Nonlinear, implicit Linear, explicit
Mean absolute relative residual
Time consuming on PC (Intel Core 2)
Solution accuracy
R1(%)
R 2(%)
R (%)
Dependence of the predicted value (y) on independent variables (x1, x2, x3, x4, x5, x6, x7), in decreasing order
200.05
852.99
526.51
N/A
3s
Very low
15.81.
0.51
8.16
N/A
30 s
Moderate
103.78
929.52
516.65
x4, x3, x5, x6, x1, x2, x7
<1 s
Very low
*
4.4 Classification Calculation C-SVM, NBAY, BAYSD and MRA are adopted for classification calculation. 1) Learning Process Using the 12 learning samples (Table 1) and by C-SVM, NBAY, BAYSD and MRA, the four functions of oil-layerclassification (y) with respect to 7 independent variables (x1, x2, x3, x4, x5, x6, x7) have been constructed. Using C-SVM[8][3], NBAY[3], BAYSD[3] and MRA[3], the result is fitting formula, respectively. Substituting the values of x1, x2, x3, x4, x5, x6, and x7 given by the 12 learning samples (Table 1) in the fitting formula of C-SVM, NBAY, BAYSD and MRA, respectively, the oil-layer-classification (y) of each learning sample is obtained. Table 5 shows the results of learning process by C-SVM, NBAY, BAYSD and MRA. 2) Prediction Process Substituting the values of x1, x2, x3, x4, x5, x6, and x7 given by the one prediction sample (Table 1) in the fitting formula of C-SVM, NBAY, BAYSD and MRA respectively, the oil-layer-classification (y) of the prediction sample is obtained. Table 5 shows the results of prediction process by C-SVM, NBAY, BAYSD and MRA. It can been seen from R *(%)=18.8 of MRA (Table 6) that the nonlinearity of the relationship between the predicted value y and its relative independent variables (x1, x2, x3, x4, x5, x6, x7) is strong. The results of C-SVM are very high, i.e. not only the R1(%) =0, but also R2(%) =0, and thus the total mean absolute relative residual R *(%) =0, which coincides with practicality. Since this case study is a strong nonlinear problem, only C-SVM is applicable, but NBAY and BAYSD are not applicable due to very low solution accuracy. - 38 http://www.j-es.org
TABLE 5 PREDICTION RESULTS FROM OIL LAYER CLASSIFICATION OF THE TAHE OILFIELD Sample type
Sample No.
Learning samples
1 2 3 4 5 6 7 8 9 10 11 12
y* 3 3 4 2 2 1 2 3 1 1 1 3
y 3 3 4 2 2 1 2 3 1 1 1 3
Prediction sample
13
3
3
Oil test
a
C-SVM R(%)i 0 0 0 0 0 0 0 0 0 0 0 0 0
Oil layer classification Classification algorithm NBAY BAYSD y R(%)i y R(%) 1 66.7 3 0 1 66.7 3 0 2 50 4 0 2 0 2 0 2 0 2 0 1 0 1 0 2 0 2 0 1 66.7 3 0 1 0 1 0 2 100 1 0 1 0 1 0 1 66.7 3 0 1
66.7
1
MRAb
66.7
y 3 3 4 2 2 1 1 3 1 1 1 3
R(%)i 0 0 0 0 0 0 50 0 0 0 0 0
2
33.3
a
y* = oil-layer-classification (1–high-productivity oil layer, 2–intermediate-productivity oil layer, 3–low-productivity oil layer, 4–dry layer) determined by the oil test (Table 2). b Here y calculated by MRA are converted from real number to integer number by using round rule.
TABLE 6 COMPARISON AMONG THE APPLICATIONS OF FOUR CLASSIFICATION ALGORITHMS TO OIL LAYER CLASSIFICATION OF THE TAHE OILFIELD Fitting formula
Algorithm
C-SVM NBAY BAYSD MRA
Nonlinear, explicit Nonlinear, explicit Nonlinear, explicit Linear, explicit
Mean absolute relative residual
Time consuming on PC (Intel Core 2)
Solution accuracy
R1(%)
R 2(%)
R (%)
Dependence of the predicted value (y) on independent variables (x1, x2, x3, x4, x5, x6, x7), in decreasing order
0
0
0
N/A
5s
Very high
34.72
66.7
50.7
N/A
<1 s
Very low
0
66.7
33.3
x1, x7, x4, x3, x5, x6, x2
1s
Very low
4.17
33.3
18.8
x2, x4, x1, x5, x3, x6, x7
<1 s
Strong nonlinearity
*
4.5 Summary Summarily, using data for the formation flow capacity and oil layer classification in the Tahe Oilfield based on seven independent variables (oil viscosity, oil outcome, glib, oil pressure, oil density, gas/oil ratio, and water cut) and an oil test result of 13 samples, of which 12 are taken as learning samples and one is taken as prediction sample, regression algorithms (R-SVM, BPNN, MRA) are adopted for the formation flow capacity, and classification algorithms (C-SVM, NBAY and BAYSD) are used for the oil layer classification. It is found that a) since the formation flow capacity is a very strong nonlinear problem, R-SVM, BPNN and MRA are not applicable; and b) since the oil layer classification is a strong nonlinear problem, only C-SVM is applicable, but NBAY and BAYSD are not applicable due to very low solution accuracy.
5 CONCLUSIONS DM is a useful technique for utilizing the values of data asset in PUP. Through many DM applications to PUP[1][2] [3][4], we have found that: a) the preferable algorithm for regression problems is BPNN, the next is R-SVM and MRA, and the preferable algorithm for classification problems is C-SVM, the next is BAYSD; b) C-SVM also can be applied in data cleaning; c) both MRA and BAYSD also can be applied in dimension-reduction, and BAYSD is preferable one; and d) R-mode cluster analysis (RCA) can be applied in dimension-reduction, while Q-mode cluster analysis (QCA) can be applied in sample-reduction. It is noted that both RCA and QCA are linear algorithms. Finally, a case study (oil layer evaluation) indicates that in this case R-SVM, BPNN and MRA are not applicable for - 39 http://www.j-es.org
regression, whereas C-SVM is applicable for classification.
REFERENCES [1]
Zhu, Y., Shi, G. “Identification of lithologic characteristics of volcanic rocks by support vector machine.” Acta Petrolei Sinica, 34 (2), 312-322, 2013(in Chinese)
[2]
Shi, G., Zhu, Y., Mi, S., Ma J., Wan J. “A Big Data Mining in Petroleum Exploration and Development.” CSCanada, 7(2), 1-8, 2014
[3]
Shi, G. “Data Mining and Knowledge Discovery for Geoscientists.” Elsevier Inc., USA, 2013
[4]
Shi G. “Optimal prediction in petroleum geology by regression and classification methods.” Sci J Inf Eng 5(2), 14-32, 2015
[5]
Maimon, O., Rokach, L. “The Data Mining and Knowledge Discovery Handbook,” 2nd edition. Springer, New York, NY, USA,2010
[6]
Han, J.W., Kamber, M. “Data Mining: Concepts and Techniques, 2nd edition.” Morgan Kaufmann, San Francisco, CA, USA, 2006
[7]
Kang, Z., Guo, C., Wu, W. “Technique of dynamic descriptions to the crack and cave carbonate rock reservoir in the Tahe oil field, Xinjiang, China.” Journal of Chengdu University of Technology (Science & Technology Edition), 34 (2), 143-146, 2007(in Chinese)
[8]
Chang, C., Lin, C. “LIBSVM: a library for support vector machines, Version 3.1.” Retrieved from http://www.csie.ntu.edu.tw/ ~cjlin/libsvm/, 2011
AUTHORS 1
Dawei Li is a senior engineer of
Huanghua Depression, China (in Chinese)” (Wuhan, Hubei
PetroChina, born in Hebei Province,
Province: China University of Geosciences Press, 2006). Two of
China, on May 5th, 1969. He got doctoral
his important published articles are: “Ten critical factors for the
degree of petroleum geology from China
evaluation of enterprise informatization benefits” (in Chinese,
University of Geosciences, Beijing, China
2014), “Discussion on informatization in PetroChina” (in
in 1996.
Chinese, 2009). His present job is to build and manage PUP IM
Since then, he has worked as an engineer
and DM systems for PetroChina.
of geophysics, petroleum geology and IT in different
Dr. Li is a member of China Geological Society and China
departments (such as Bureau of Geophysical Prospecting,
Petroleum Society.
Research Institute of Petroleum Exploration and Development)
2
of PetroChina. He has published several books and more than 40 articles. Two of the major published books are “Tectonic types of Oil and Gas Basins in China” (Beijing: Petroleum Industry Press, 2003) and “Neotectonism and Hydrocarbon Accumulation in
Guangren Shi is a professor of PetroChina, born in Shanghai,
China in February, 1940. His research contains two fields of basin modeling (petroleum system) and data mining for geosciences. He has published 3 books and about 12 articles on data mining.
- 40 http://www.j-es.org