The Use of Cluster Analysis and Classifiers for Determination and Prediction of Oil Family Type Base

Page 1

Scientific Journal of Information Engineering February 2016, Volume 6, Issue 1, PP.19-28

The Use of Cluster Analysis and Classifiers for Determination and Prediction of Oil Family Type Based on Geochemical Data Qian ZHANG†, Guangren Shi Research Institute of Petroleum Exploration and Development, PetroChina, P. O. Box 910, Beijing 100083, China †


Abstract The Q-mode cluster analysis (QCA), the multiple regression analysis (MRA) and three classifiers have been applied to an oil family type (OFT) problem using geochemical data, which has practical value when experiment data is less limited. The three classifiers are the support vector classification (SVC), the naïve Bayesian (NBAY), and the Bayesian successive discrimination (BAYSD). QCA is used to determine OFT at first, and then SVC, NBAY and BAYSD are used to predict OFT. In the four regression/classification algorithms, only MRA is a linear algorithm whereas SVC, NBAY and BAYSD are nonlinear algorithms. In general, when all these four algorithms are used to solve a real-world problem, they often produce different solution accuracies. Toward this issue, the solution accuracy is expressed with the sum of mean absolute relative residual (MARR) of all samples, R(%). And three criteria have been proposed: 1) nonlinearity degree of a studied problem based on R(%) of MRA (weak if R(%)<10, and strong if R(%)≥10); 2) solution accuracy of a given algorithm application based on its R(%) (high if R(%)<10, and low if R(%)≥10); and 3) results availability of a given algorithm application based on its R(%) (applicable if R(%)<10, and inapplicable if R(%)≥10). A case study (the OFT problem) has been used to validate the proposed approach. The case study is a classification problem. The calculation results indicate that a) when a case study is a weakly nonlinear problem, SVC, NBAY and BAYSD are all applicable, and BAYSD is better than SVC and NBAY; and b) BAYSD and SVC can be applied to dimensionality reduction. Keywords: Cluster Analysis; Multiple Regression Analysis; Support Vector Machine; Naïve Bayesian; Bayesian Successive Discrimination; Problem Nonlinearity; Solution Accuracy; Results Availability; Dimensionality Reduction

1 INTRODUCTION In the recent years, the regression/classification algorithms have seen the preliminary success in petroleum geochemistry[1, 2], but the application of these algorithms and the cluster analysis algorithm to oil family type (OFT) prediction have not started yet. Peters et al (2007) employed the decision trees to predict the OFT using the data of samples in which each sample contains 21 geochemical factors[3]. Using all the samples presented in [3], we employ the Q-mode cluster analysis (QCA) algorithm to determine OFT at first, and then employ four regression/classification algorithms below to predict OFT. This work is successful and has not been seen in literatures before. Besides QCA algorithm for determining OFT, one regression algorithm and three classification algorithms have been applied to predict OFT. The one regression algorithm is the multiple regression analysis (MRA), while the other three classification algorithms are the support vector classification (SVC), the naïve Bayesian (NBAY), and the Bayesian successive discrimination (BAYSD). In these four regression/classification algorithms, only MRA is a linear algorithm whereas the other three are nonlinear algorithms. In general, when all these four algorithms are used to solve a real-world problem, they often produce different solution accuracies. Toward this issue, the solution - 19

accuracy is expressed with the sum of mean absolute relative residual (MARR) of all samples, R(%). And three criteria have been proposed that: 1) nonlinearity degree of a studied problem based on R(%) of MRA (weak if R(%)<10, and strong if R(%)≥10); 2) solution accuracy of a given algorithm application based on its R(%) (high if R(%)<10, and low if R(%)≥10); and 3) results availability of a given algorithm application based on its R(%) (applicable if R(%)<10, and inapplicable if R(%)≥10) (Table 1). A case study (the OFT problem) has been used to validate the proposed approach. TABLE 1. PROPOSED THREE CRITERIA Range of R(%) R(%)<10 R(%)≥10

Criterion 1: nonlinearity degree of a studied problem based on R(%) of MRA Weak Strong

Criterion 2: solution accuracy of a Criterion 3: Results availability given algorithm application based on of a given algorithm application its R(%) based on its R(%) High Applicable Low Inapplicable

2 METHODOLOGY The methodology consists of the following three major parts: definitions commonly used by regression/classification algorithms; instruction of five algorithms; dimensionality reduction.

2.1 Definitions Commonly Used by Regression/classification Algorithms The aforementioned regression/classification algorithms share the same sample data. The essential difference between the two types of algorithms is that the output of regression algorithms is real-type value and in general differs from the real number given in the corresponding learning sample, whereas the output of classification algorithms is integer-type value and must be one of the integers defined in the learning samples. In the view of dataology, the integer-type value is called as discrete attribute, while the real-type value is called as continuous attribute. The four algorithms (MRA, SVC, NBAY, BAYSD) use the same known parameters, and also share the same unknown that is predicted. The only differences between them are the approach and calculation results. Assume that there are n learning samples, each is associated with m+1 numbers (x1, x2, …, xm, y*) and a set of observed values (xi1, xi2, …, xim, yi* ), with i=1, 2, …, n for these numbers. In principle, n>m, but in actual practice, n>>m. The n samples associated with m+1 numbers are defined as n vectors: xi=(xi1, xi2, …, xim, yi* ) (i=1, 2, …, n)


where n is the number of learning samples; m is the number of independent variables in samples; xi is the ith learning sample vector; xij is the value of the jth independent variable in the ith learning sample, j=1, 2, …, m; and yi* is the observed value of the ith learning sample. Eq. 1 is the expression of learning samples. Let x0 be the general form of a vector of (xi1, xi2, …, xim). The principles of MRA, NBAY and BAYSD are the same, i.e., try to construct an expression, y=y(x0), such that Eq. 2 is minimized. Certainly, these three different algorithms use different approaches and obtain calculation results in differing accuracies. n *   y  x0i   yi    i 1 



where y=y(x0i) is the calculation result of the dependent variable in the ith learning sample; and the other symbols have been defined in Eq. 1. However, the principle of SVC algorithm is to try to construct an expression, y=y(x0), such that to maximize the margin based on support vector points so as to obtain the optimal separating line. This y=y(x0) is called the fitting formula obtained in the learning process. The fitting formulas of different algorithms are different. In this paper, y is defined as a single variable. The flowchart is as follows: the 1st step is the learning process, using n learning samples to obtain a fitting formula; - 20

the 2nd step is the learning validation, substituting n learning samples (xi1, xi2, …, xim) into the fitting formula to get prediction values (y1, y2, …, yn), respectively, so as to verify the fitness of an algorithm; and the 3rd step is the prediction process, substituting k prediction samples expressed with Eq. 3 into the fitting formula to get prediction values (yn+1, yn+2, …, yn+k), respectively. xi=(xi1, xi2, …, xim) (i=n+1, n+2, …, n+k)


where k is the number of prediction samples; xi is the ith prediction sample vector; and the other symbols have been defined in Eq. 1. Eq. 3 is the expression of prediction samples. In the four algorithms, only MRA is a linear algorithm whereas the other three are nonlinear algorithms, this is due to the fact that MRA constructs a linear function whereas the other three construct nonlinear functions, respectively. To express the calculation accuracies of the prediction variable y for learning and prediction samples when the four algorithms are used, the following four types of residuals are defined. The absolute relative residual for each sample, R(%)i (i=1, 2, …, n, n+1, n+2, …, n+k), is defined as R(%)i = ( y -yi* ) / yi*  100 i


where yi is the calculation result of the dependent variable in the ith sample; and the other symbols have been defined in Eqs. 1 and 3. R(%)i is the fitting residual to express the fitness for a sample in learning or prediction process. It is noted that zero must not be taken as a value of yi* to avoid floating-point overflow. Therefore, for regression algorithm, delete the sample if its yi* =0; and for classification algorithm, positive integer is taken as values of yi* . The MARR for all learning samples, R1(%), is defined as n

R (%)=  R(%)i / n 1

i 1


where all symbols have been defined in Eqs. 1 and 4. R1(%) is the fitting residual to express the fitness of learning process. The MARR for all prediction samples, R2(%), is defined as n k

R (%)=  R(%)i / k 2 i n 1


where all symbols have been defined in Eqs. 3 and 4. R2(%) is the fitting residual to express the fitness of prediction process. The sum of MARR of all samples, R(%), is defined as n k

R(%)=  R(%)i / (n  k ) i 1


where all symbols have been defined in Eqs. 1, 3 and 4. If there are no prediction samples, k=0, then R(%)=R1(%). R(%) is the fitting residual to express the fitness of learning and prediction processes. When the four algorithms (SVC, NBAY, BAYSD, MRA) are used to solve a real-world problem, they often produce different solution accuracies. Toward this issue, the three criteria have been proposed (Table 1).

2.2 Instruction of Five Algorithms The five algorithms (QCA, MRA, SVC, NBAY, BAYSD) are detailedly described by Shi (2013), and the latter four algorithms are concisely introduced[1, 4–11]. Hence, we only concisely introduce QCA. Cluster analysis has been widely applied since the 1970's[1, 12]. Q-mode and R-mode cluster analyses is one of the most popular methods in cluster analysis, and still a very useful tool in some fields. Cluster analysis is a classification analysis approach of data set, which is a multiple linear statistical analysis. - 21

As mentioned above, regression/classification algorithms need n learning samples expressed with Eq. 1 and k prediction samples expressed with Eq. 3. From Eqs. 1 and 3, we know each learning sample must have the data of both independent variables (xi) and a dependent variable ( yi* ), while each prediction sample only has the data of independent variables (xi). But cluster analysis has neither learning process nor prediction process. Assume that xi=(xi1, xi2, …, xi n*) (i=1, 2, …, n*)


where n* is the number of samples; and the other symbols have been defined in Eq. 1. Equation 8 expresses n* sample vectors. The cluster analysis to n* sample vectors is conducted, thus this kind of cluster analysis is called Q-mode cluster analysis (QCA). Transposing the matrix defined by the right-hand side of Eq. 8, we obtained xj=(xj1, xj2, …, xjm) (j=1, 2, …, m)


Equation 9 expresses m independent variable vectors. The cluster analysis to m independent variable vectors is conducted, thus this kind of cluster analysis is called R-mode cluster analysis (RCA). Obviously, the matrix defined by the right-hand side of Eq.8 and that defined by the right-hand side of Eq. 9 are mutually transposed. This cluster analysis has two functions: QCA and RCA, conducting to samples and independent variables, respectively[1]. By adopting QCA, a cluster pedigree chart about n* samples (xi) will be achieved, indicating the order of dependence between samples, therefore, QCA possibly serves as a promising sample-reduction tool. By adopting RCA, a cluster pedigree chart about m independent variables (xj) will be achieved, indicating the order of dependence between independent variables, therefore, RCA possibly serves as a promising independent variable reduction (i.e., dimension-reduction) tool. The cluster analysis introduced in this paper only refers to QCA. As for RCA, the procedure is definitely same in essential as that of QCA. The only difference is that the input data for QCA is expressed with Eq. 8, while the input data for RCA is expressed with Eq. 9. The commonly used procedure for cluster analysis at present is aggregation procedure. Just as the name implies, aggregation procedure is to merge more classes into less till into one class. For example, QCA merges n* classes into one class while RCA merges m classes into one class. The procedure of QCA is performed in the following four steps[1, 12]: 1) Determine the number of cluster objects, i.e., the total number of classes is n* at the beginning of cluster analysis; 2) Select one method such as standard difference standardization, maximum difference standardization and maximum difference normalization, for conducting data normalization on the original data expressed by Eq. 8. 3) Select an approach involving a cluster statistic and a class-distance method to conduct procedures to every two classes in n* classes by a class-distance method (e.g., the shortest distance, the longest distance, and the weighted mean) of a cluster statistic (e.g., distance coefficient, analog coefficient, and correlation coefficient), achieving the order of dependence between classes, and merge the closest two classes (e.g., to distance coefficient it is minimum class-distance, to analog coefficient or correlation coefficient it is maximum class-distance) into a new class which is composed of two sample vectors, thus the total number of classes turns to (n*−1). 4) Use the above selected method to conduct procedures to every two classes in (n*−1) classes, achieving the order of dependence between classes, and merge the closest two classes into a new class, thus the total number of classes turns to (n*−2). Repeat this operation in this way till the total number of classes turns to 1, at this time n* classes are merged into 1 class, and then stop procedure. Up to now, nine clustering methods have been introduced: three class-distance methods of the shortest-distance method, the longest-distance method, and the weighted mean method under the definition of each of the three cluster - 22

statistics (distance coefficient, analog coefficient, and correlation coefficient). Even if the same kind of data normalization (to select one form of standard difference standardization, maximum difference standardization, and maximum difference normalization) is applied to the original data, the calculation results are found to be different with these nine methods. Then which method is the best one? A suitable one will be chosen depending on the practical application and specific data. It must be pointed out here that since cluster analysis is performed one by one in the light of ―two-in-one‖ aggregation, the number of the new class is one for the last, two for the previous, four for that before the previous, … and so on. Obviously, except for the last aggregation, the number of the new class is an even number. So the number of the new class that is obtained in the sample classification by QCA is an even number.

2.3 Dimensionality Reduction The definition of dimensionality reduction is to reduce the number of dimensions of a data space as small as possible but the results of studied problem are unchanged. The benefits of dimensionality reduction are to reduce the amount of data so as to enhance the calculating speed, to reduce the independent variables so as to extend applying ranges, and to reduce misclassification ratio of prediction samples so as to enhance processing quality. Each of MRA and BAYSD can serve as a promising dimension-reduction tool, respectively, because the two algorithms all can give the dependence of the predicted value (y) on independent variables (x1, x2, …, xm), in decreasing order. However, because MRA belongs to data analysis in linear correlation whereas BAYSD is in nonlinear correlation, in applications the preferable tool is BAYSD, whereas MRA is applicable only when the studied problems are linear. The called ―promising tool‖ is whether it succeeds or does not need a high-class nonlinear tool (e.g., SVC) for the validation, so as to determine how many independent variables can be reduced. For instance, the case study below can be reduced from 22-D problem to 21-D problem by using BAYSD and SVC.

3 CASE STUDY: THE DETERMINATION AND PREDICTION OF OIL FAMILY TYPE The objective of this case study is to determine and predict oil family type (OFT) using geochemical data, which has practical value when experiment data is less limited. Using data of 37 samples from the oversea oilfields presented in [3], and each sample contains 21 independent variables (x1 = C19/C23, x2 = C22/C21, x3= C24/C23, x4 = C26/C25, x5 = Tet/C23, x6 = C27T/C27, x7 = C28/H, x8 = C29/H, x9 = X/H, x10 = OL/H, x11 = C31R/H, x12 = GA/31R, x13 = C35S/C34S, x14 = S/H, x15 = %C27, x16 = %C28, x17 = %C29, x18 = δ13Csaturated, x19 =δ13Caromatic, x20 = C26T/Ts, x21 = C28/C29) and a variable [y* = oil family type (OFT)]. Peters et al (2007) employed the decision trees to predict the OFT[3]. In the case study, the former 30 samples are taken as learning samples and the latter 7 samples as prediction samples (Table 2) for the OFT determination by QCA at first and then for the OFT prediction by SVC, NBAY and BAYSD. TABLE 2. INPUT DATA FOR OIL FAMILY TYPE (OFT) OF OVERSEA OILFIELDS (MODIFIED FROM [3]) No.

21 geochemical factors related to y* x1


















4.53 0.43 0.74 1.28 4.16 0.11 0.04 0.53 0.29 0.02 0.29 0.08 0.5 0.35 13.7 26.7 59.6


2.68 0.51 0.73 1.13 6.82 0.03 0.02 0.69 0.11 0.04 0.27 0.12 1.06 0.31 27.2 26.9 45.9


0.23 0.5 0.74 0.98 0.67 0.02 0.03 0.68 0.12 0.02 0.29 0.17 0.65 0.52 27.2 21.8


0.12 0.38 0.87 0.61 0.45 0.46 0.07 0.54 0.05 0.01 0.36 0.09 0.96 0.8 36.5


0.27 0.44 0.75 0.79 0.55 0.08 0.02 0.45 0.1 0.01 0.31 0.09 0.81 0.67 36.4 33.1 30.5


0.31 0.52 0.88 1.01 0.5 0.09 0.03 0.4 0.18 0.03 0.25 0.17 0.7 0.91 33.9 34.7 31.4


0.78 0.49 0.8 1.02 0.88 0.2 0.03 0.46 0.17 0.02 0.24 0.14 0.61 0.55 27.7 32.5 39.8


4.76 0.38 0.59 1.5


51 30.5

5.3 0.21 0.12 0.59 0.32 0.02 0.26 0.18 0.49 1.03 9.2 27.6 63.2 - 23

x18 28.15 30.18 30.54 32.01 -31.5 30.94 29.12 -

x19 25.89 28.28 29.19 31.23 -30.4 29.65 27.36 -




0.18 0.46


0.16 0.59


0.61 0.43


0.6 1.08


0.43 1.09


0.51 1.11


0.42 0.84


0.29 0.44


26.95 24.92 9

0.09 0.31 0.85 0.97 0.19 0.13 0.12 0.6 0.11 0.03 0.29 0.48 1.06 0.6 14.8



10 0.04 0.3 0.81 0.96 0.18 0.2 0.18 0.61 0.06 0.03 0.34 0.33 1.22 0.82 13.2 19.5 67.3 11 0.05 0.31 0.83 0.99 0.13 0.32 0.38 0.66 0.2 0.07 0.31 0.4 1.08 2.13 14.7 22.44 62.9 12 2.91 0.28 0.76 0.98 1.38 0.15 0.1 0.52 0.07 0.21 0.23 0.13 0.4 0.39 16.9 34.4 48.6 13 0.05 0.46 0.75 0.77 0.22 0.06 0.04 0.55 0.08 0.01 0.34 0.14 1.09 0.51 30

33.7 36.3

33.57 34.78 34.63 28.46 30.38 29.99 31.41 30.31 31.23 28.93 28.87

33.29 34.75 34.02 27.13

1.84 0.35


2.43 0.29


3.57 0.36


0.3 0.74


-29.6 1.37 0.94


1.11 1.2 3 29.21 15 0.03 0.3 0.92 0.76 0.14 0.04 0.03 0.4 0.15 0.01 0.34 0.1 1.18 0.63 28.7 27.6 43.7 2.13 0.63 2 30.22 16 0.04 0.66 0.58 0.78 0.21 0.06 0.07 0.85 0.05 0.01 0.36 0.28 1.18 0.3 30.4 29.3 40.3 1.91 0.73 1 29.89 17 0.13 0.41 0.75 0.98 0.46 0.03 0.03 0.38 0.17 0.01 0.31 0.09 0.59 0.57 30.7 28.5 40.8 0.65 0.7 3 30.24 18 0.36 0.39 0.89 0.84 0.64 0.18 0.11 0.43 0.1 0.08 0.3 0.09 0.87 0.58 27.2 40 32.9 0.48 1.24 3 27.94 19 0.89 0.41 0.85 1.02 0.76 0.05 0.06 0.51 0.12 0.02 0.28 0.06 0.5 0.62 27.2 33.4 39.4 0.56 0.87 3 27.22 20 0.05 0.66 0.6 0.83 0.43 0 0.02 1.11 0.03 0.01 0.39 0.23 1.03 0.2 30 23.7 46.3 -29.3 1.19 0.51 1 28.91 21 0.02 0.68 0.56 0.79 0.2 0.03 0.13 0.75 0.02 0 0.37 0.37 1.2 0.25 32.3 29.2 38.5 1.72 0.77 1 29.98 29.95 22 0.07 0.53 0.7 1.11 0.36 0.02 0.03 0.48 0.16 0.01 0.28 0.23 0.54 0.6 33.2 20.4 46.5 0.71 0.44 4 29.45 27.71 23 0.14 0.81 0.32 0.85 1.53 0.04 0.23 2.51 0.11 0 0.33 0.42 1.25 0.46 34 17.1 49 0.28 0.35 1 30.09 28.04 24 0.09 0.4 0.78 0.84 0.31 0.02 0.03 0.34 0.12 0.02 0.26 0.11 0.67 0.8 28.8 40.8 30.4 0.55 1.35 3 28.24 26.93 25 0.14 0.81 0.32 0.85 1.53 0.04 0.23 2.51 0.11 0 0.33 0.42 1.25 0.46 34 17.1 49 0.28 0.35 1 30.09 28.04 26 0.03 0.87 0.54 0.78 0.22 0.01 0.02 0.87 0.01 0 0.54 0.33 0.98 0.28 36.2 17.3 46.5 2.01 0.37 1 29.92 28.03 27 0.07 0.53 0.7 1.11 0.36 0.02 0.03 0.48 0.16 0.01 0.28 0.23 0.54 0.6 33.2 20.4 46.5 0.71 0.44 4 29.45 27.71 28 0.17 0.43 0.84 0.67 0.66 0.28 0.08 0.49 0.1 0 0.35 0.07 0.98 0.6 36.3 32.6 31.1 0.41 1.05 3 30.36 29.51 29 0.22 0.25 0.64 1.44 0.48 0.02 0.02 0.6 0.19 0.01 0.23 0.35 0.49 0.2 21 40.3 38.7 0.57 1.04 4 32.18 30.43 30 0.04 0.31 0.75 0.94 0.09 0.02 0.05 0.56 0.1 0.02 0.25 0.84 0.66 0.94 25.9 21.6 52.5 3 0.53 2 30.57 30.11 31 0.02 1.18 0.44 0.74 0.17 0 0.01 1.13 0.01 0.01 0.54 0.33 0.98 0.28 36.2 17.3 46.5 2.46 0.38 (1) 29.92 28.77 32 0.04 0.3 0.7 1.06 0.07 0.02 0.07 0.62 0.1 0.01 0.26 0.66 0.62 1.08 20.6 19.3 60.1 3.16 0.32 (2) 30.43 29.91 33 0.4 0.41 0.88 0.86 0.54 0.16 0.21 0.52 0.78 0.03 0.29 0.21 0.71 2.32 32.8 35.1 32 0.47 1.1 (3) 29.29 27.84 34 0.25 0.46 0.87 0.8 0.78 0.23 0.2 0.45 0.13 0.01 0.31 0.08 0.79 0.79 33.6 32.8 33.6 0.32 0.98 (3) 29.69 28.78 35 0.43 0.48 0.79 0.91 0.94 0.1 0.13 0.44 0.21 0.01 0.31 0.1 0.69 0.78 32.2 31.3 36.5 0.24 0.87 (3) 28.63 27.48 36 0.1 0.33 0.7 0.95 0.35 0.04 0.02 0.67 0.07 0 0.27 0.22 0.63 0.39 30.9 20.5 48.6 1 0.42 (4) 31.48 29.27 37 0.22 0.25 0.64 1.44 0.48 0.02 0.02 0.6 0.19 0.01 0.23 0.35 0.49 0.2 21 40.3 38.7 0.57 1.04 (4) 32.18 30.43 a x1 = C19/C23, x2 = C22/C21, x3= C24/C23, x4 = C26/C25, x5 = Tet/C23, x6 = C27T/C27, x7 = C28/H, x8 = C29/H, x9 = X/H, x10 = OL/H, x11 = C31R/H, x12 = GA/31R, x13 = C35S/C34S, x14 = S/H, x15 = %C27, x16 = %C28, x17 = %C29, x18 = δ13Csaturated, x19 =δ13Caromatic, x20 = C26T/Ts, x21 = C28/C29. b * y = OFT = oil family type (1~4) determined by QCA at first; then predicted by SVC, NBAY and BAYSD, number in parenthesis is not input data, but is used for calculating R(%)i. 14 0.06 0.48 0.75 0.74 0.25 0.14 0.04 0.51 0.08 0.03 0.39 0.07 1.24 0.57 28.8 38.8 32.4

- 24

3.1 The Calculation and Results of QCA Using the 37 samples without y* (Table 2) and by QCA[1, 12] taking the standard difference standardization in data normalization, the correlation coefficient in cluster statistic, and the weighted mean in class-distance, a cluster pedigree chart about the 37 samples has been achieved (Fig. 1).


Figure 1 illustrates a cluster pedigree chart of 37 samples that 37 classes (i.e., 37 samples) are merged step by step into one class, showing the order of dependence between samples. Four classes can be divided in Fig. 1 (classes with even number), based on which the class-value (1~4) of each sample is filled in the y* column of Table 2.

3.2 The Calculation and Results of SVC, NBAY, BAYSD and MRA Using the 30 learning samples with y* (Table 2) and by SVC[1, 13], NBAY[1, 14, 15], BAYSD[1] and MRA[1, 16, 17], four - 25

functions of OFT (y) with respect to 21 independent variables (x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, x16, x17, x18, x19, x20, x21) have been constructed. Substituting the values of the 21 independent variables given by the 30 learning samples and 7 prediction samples (Table 2) in the four functions, respectively, the OFT (y) of each sample is obtained (Table 3). TABLE 3. PREDICTION RESULTS FROM OIL FAMILY TYPE (OFT) OF OVERSEA OILFIELDS Sample type

Sample No.


OFT a Classifier NBAY BAYSD R(%) R(%) y y 4 0 4 0 4 0 4 0 4 0 4 0 3 0 3 0 3 0 3 0 3 0 3 0 3 0 3 0 4 0 4 0 2 0 2 0 2 0 2 0 2 0 2 0 4 0 4 0 3 0 3 0 3 0 3 0 2 0 2 0 1 0 1 0 3 0 3 0 3 0 3 0 3 0 3 0 1 0 1 0 1 0 1 0 4 0 4 0 1 0 1 0 3 0 3 0 1 0 1 0 1 0 1 0 4 0 4 0 3 0 3 0 4 0 4 0 2 0 2 0 1 0 1 0 2 0 2 0 4 33.3 4 33.3 3 0 3 0 3 0 3 0 4 0 2 50 4 0 4 0

SVC R(%) y 1 4 4 0 2 4 4 0 3 4 4 0 4 3 3 0 5 3 3 0 6 3 3 0 7 3 3 0 8 4 4 0 9 2 2 0 10 2 2 0 11 2 2 0 12 4 4 0 13 3 3 0 14 3 3 0 15 2 2 0 Learning samples 16 1 1 0 17 3 3 0 18 3 3 0 19 3 3 0 20 1 1 0 21 1 1 0 22 4 4 0 23 1 1 0 24 3 3 0 25 1 1 0 26 1 1 0 27 4 4 0 28 3 3 0 29 4 4 0 30 2 2 0 31 1 1 0 32 2 2 0 33 3 3 0 Prediction 34 3 3 0 samples 35 3 3 0 36 4 4 0 37 4 4 0 a * y = OFT = oil family type (1~4) determined by cluster analysis. b The results y of MRA are converted from real number to integer number by using round rule.

MRA b y 4 4 3 3 3 3 3 4 2 2 2 4 2 2 2 1 3 3 3 1 1 4 1 3 1 1 4 3 4 2 0 2 2 2 3 4 4

R(%) 0 0 25 0 0 0 0 0 0 0 0 0 33.3 33.3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100 0 33.3 33.3 0 0 0

From Table 4 and based on Table 1, it is shown that a) the nonlinearity degree of this studied problem is weak since R(%) of MRA is 6.98; b) the solution accuracies of SVC, NBAY and BAYSD are high, and the results availability of SVC, NBAY and BAYSD are applicable, due to the fact that their R(%) values are 0, 0.9 and 2.25, respectively. - 26



Fitting formula

Nonlinear, explicit Nonlinear, explicit

Dependence of the predicted value (y) Time on independent consuming Problem variables (x1, x2, x3, on PC (Intel nonlinearity R(%) x4), in decreasing Core 2) order

Solution accuracy

Results availability















<1 s








<1 s





Nonlinear, explicit





Linear, explicit




x2, x21, x20, x8, x3, x14, x16, x9, x15, x6, x13, x10, x4, x19, x18, x12, x5, x17, x11, x7, x1 x13, x12, x11, x4, x8, x20, x2, x16, x15, x10, x14, x3, x19, x18, x1, x9, x17, x7, x21, x6, x5

3.3 Dimension-reduction Succeeded from 22-D to 21-D Problem by Using BAYSD and SVC BAYSD gives the dependence of the predicted value (y) on 21 independent variables, in decreasing order: x2, x21, x20, x8, x3, x14, x16, x9, x15, x6, x13, x10, x4, x19, x18, x12, x5, x17, x11, x7, x1 (Table 4). According to this dependence order, at first, deleting x2 and running SVC, it is found the results of SVC are the same as before, i.e., R(%)=0, thus the 22-D problem (x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, x16, x17, x18, x19, x20, x21, y) can become 21-D problem (x1, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, x16, x17, x18, x19, x20, x21, y). In the same way, it is found that this 21-D problem cannot become 20-D problem by deleting x21, due to the fact that the results of SVC are changed with R(%)=0.9 which is greater than previous R(%)=0 (Table 4). Therefore, the 22-D problem at last can become 21-D problem. Comparing with SVC, the major advantages of BAYSD are (Table 4): a) BAYSD runs much faster than SVC, b) it is easy to code the BAYSD program whereas very complicated to code the SVC program, and c) BAYSD can serve as a promising dimension-reduction tool. So BAYSD is better than SVC when nonlinearity of studied problem is weak.

4 CONCLUSIONS Through the aforementioned case study, five major conclusions can be drawn as follows: 1) For classification problem, QCA can be applied to determine the dependent variable (y), while SVC, NBAY and BAYSD can applied to predict y; 2) The sum of MARR R(%) of MRA can be used to measure the nonlinearity degree of a studied problem, and thus MRA should be run at first; 3) The proposed three criteria [nonlinearity degree of a studied problem based on R(%) of MRA, solution accuracy of a given algorithm application based on its R(%), results availability of a given algorithm application based on its R(%)] are practical; 4) If a classification problem has weak nonlinearity, in general, SVC, NBAY and BAYSD are applicable, and BAYSD is better than SVC and NBAY; 5) BAYSD and SVC can be applied to dimensionality reduction.

ACKNOWLEDGMENT This work was supported by the Research Institute of Petroleum Exploration and Development (RIPED) and PetroChina.

- 27


Shi G. ―Data Mining and Knowledge Discovery for Geoscientists.‖ Elsevier Inc, USA, 2013


Shi G. ―Prediction of methane inclusion types using support vector machine.‖ Sci J Earth Sci 5(2): 18–27, 2015


Peters, KE, Ramos, LS, Zumberge, JE, Valin, ZC, Scotese, CR, Gautier, DL. ―Circum-Arctic petroleum systems identified using decision-tree chemometrics.‖ AAPG Bulletin, 91(6): 877–913, 2007


Shi G, Yang X. ―Optimization and data mining for fracture prediction in geosciences.‖ Procedia Comput Sci 1(1): 1353–1360, 2010


Zhu Y, Shi G. ―Identification of lithologic characteristics of volcanic rocks by support vector machine.‖ Acta Petrolei Sinica 34(2): 312–322, 2013


Shi G, Zhu Y, Mi S, Ma J, Wan J. ―A big data mining in petroleum exploration and development.‖ Adv Petrol Expl Devel 7(2): 1– 8, 2014


Zhang Q, Shi G. ―Economic evaluation of waterflood using regression and classification algorithms.‖ Adv Petrol Expl Devel 9(1): 1–8, 2015


Shi G. ―Optimal prediction in petroleum geology by regression and classification methods.‖ Sci J Inf Eng 5(2): 14–32, 2015


Li D, Shi G. ―Data mining in petroleum upstream——the use of regression and classification algorithms.‖ Sci J Earth Sci 5(2): 33–40, 2015

[10] Shi G, Ma J, Ba D. ―Optimal selection of classification algorithms for well log interpretation.‖ Sci J Con Eng 5(3): 37–50, 2015 [11] Mi S, Shi G. ―The use of regression and classification algorithms for layer productivity prediction in naturally fractured reservoirs.‖ J Petrol Sci Res 4(2): 65–78, 2015 [12] Everitt BS, Landau S, Leese M, Stahl D. ―Cluster Analysis, 5th edition.‖ John Wiley & Sons, Chichester, England, UK, 2011 Chang














from, 2011 [13] Tan P, Steinbach M, Kumar V. ―Introduction to Data Mining.‖ Pearson Education, Boston, MA, USA, 2005 [14] Han J, Kamber M. ―Data Mining: Concepts and Techniques, 2nd Ed.‖ Morgan Kaufmann, San Francisco, CA, USA, 2006 [15] Sharma MSR, O'Regan M, Baxter CDP, Moran K, Vaziri H, Narayanasamy R. ―Empirical relationship between strength and geophysical properties for weakly cemented formations.‖ J Petro Sci Eng 72(1-2): 134–142, 2010 [16] Singh J, Shaik B, Singh S, Agrawal VK, Khadikar PV, Deeb O, Supuran CT. ―Comparative QSAR study on para-substituted aromatic sulphonamides as CAII inhibitors: information versus topological (distance-based and connectivity) indices.‖ Chem Biol Drug Design 71, 244–259, 2008


Qian Zhang is a senior engineer of

and data mining algorithms application in petroleum evaluation.

PetroChina, was born in Heilongjiang

She has published 6 articles in which 3 are SCI-indexed and 3 on

Province, China, in September, 1983. She

data mining. She obtained the 1st class award of science-

graduated from the Beijing Institute of

technology achievement from RIPED (2015).

Technology in 2010 with a Ph.D. in


applied mathematics. She joined the Research




Exploration and Development (RIPED) of PetroChina in 2010 and as a petroleum evaluation engineer on the national and world energy assessment projects. Her recent accomplishments include petroleum resources evaluation software development

Guangren Shi is a professor of PetroChina, was born in

Shanghai, China in February, 1940. His research contains two major fields of basin modeling (petroleum system) and data mining for geosciences. He has published 8 books in which 3 are in English and 3 are on data mining, and 75 articles in which 4 are SCI-indexed and 15 are on data mining.

- 28

Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.