The Use of Cluster Analysis and Classifiers for Determination and Prediction of Oil Family Type Base

Scientific Journal of Information Engineering February 2016, Volume 6, Issue 1, PP.19-28

The Use of Cluster Analysis and Classifiers for Determination and Prediction of Oil Family Type Based on Geochemical Data Qian ZHANG†, Guangren Shi Research Institute of Petroleum Exploration and Development, PetroChina, P. O. Box 910, Beijing 100083, China †

Email: zhangvqian@petrochina.com.cn

Abstract The Q-mode cluster analysis (QCA), the multiple regression analysis (MRA) and three classifiers have been applied to an oil family type (OFT) problem using geochemical data, which has practical value when experiment data is less limited. The three classifiers are the support vector classification (SVC), the naïve Bayesian (NBAY), and the Bayesian successive discrimination (BAYSD). QCA is used to determine OFT at first, and then SVC, NBAY and BAYSD are used to predict OFT. In the four regression/classification algorithms, only MRA is a linear algorithm whereas SVC, NBAY and BAYSD are nonlinear algorithms. In general, when all these four algorithms are used to solve a real-world problem, they often produce different solution accuracies. Toward this issue, the solution accuracy is expressed with the sum of mean absolute relative residual (MARR) of all samples, R(%). And three criteria have been proposed: 1) nonlinearity degree of a studied problem based on R(%) of MRA (weak if R(%)<10, and strong if R(%)≥10); 2) solution accuracy of a given algorithm application based on its R(%) (high if R(%)<10, and low if R(%)≥10); and 3) results availability of a given algorithm application based on its R(%) (applicable if R(%)<10, and inapplicable if R(%)≥10). A case study (the OFT problem) has been used to validate the proposed approach. The case study is a classification problem. The calculation results indicate that a) when a case study is a weakly nonlinear problem, SVC, NBAY and BAYSD are all applicable, and BAYSD is better than SVC and NBAY; and b) BAYSD and SVC can be applied to dimensionality reduction. Keywords: Cluster Analysis; Multiple Regression Analysis; Support Vector Machine; Naïve Bayesian; Bayesian Successive Discrimination; Problem Nonlinearity; Solution Accuracy; Results Availability; Dimensionality Reduction

1 INTRODUCTION In the recent years, the regression/classification algorithms have seen the preliminary success in petroleum geochemistry[1, 2], but the application of these algorithms and the cluster analysis algorithm to oil family type (OFT) prediction have not started yet. Peters et al (2007) employed the decision trees to predict the OFT using the data of samples in which each sample contains 21 geochemical factors[3]. Using all the samples presented in [3], we employ the Q-mode cluster analysis (QCA) algorithm to determine OFT at first, and then employ four regression/classification algorithms below to predict OFT. This work is successful and has not been seen in literatures before. Besides QCA algorithm for determining OFT, one regression algorithm and three classification algorithms have been applied to predict OFT. The one regression algorithm is the multiple regression analysis (MRA), while the other three classification algorithms are the support vector classification (SVC), the naïve Bayesian (NBAY), and the Bayesian successive discrimination (BAYSD). In these four regression/classification algorithms, only MRA is a linear algorithm whereas the other three are nonlinear algorithms. In general, when all these four algorithms are used to solve a real-world problem, they often produce different solution accuracies. Toward this issue, the solution - 19 http://www.sjie.org

Turn static files into dynamic content formats.

Create a flipbook