Identification of Liver Disease using Classification Technique Rishi Tiwari1, A. K. Shrivastav2, Akhilesh Kumar Shrivas3 1
2
Dept of IT, Dr. C.V. Raman University, Bilaspur Chhattisgarh ,India Dept of Physics, Dr. C.V. Raman University, Bilaspur Chhattisgarh ,India 3 Dept of IT, Dr. C.V. Raman University, Bilaspur Chhattisgarh ,India
Abstract—In medicals science, there are large amount of information related to patient and medical conditions. To discover the relevant pattern from large volume of data could easily identify the disease condition and diagnosis of diseases condition. Data mining is one of the important techniques for analyzing and evaluation of useful data that is helpful for identifying the particular diseases. In this research work, there are various classification techniques like C4.5, CART, Random Forest (RF) , BayesNet, Support Vector Machine (SVM) , Multilayer Perceptron and its ensemble have used for analysis of liver patient data and classification of liver patient. The ensemble of C4.5, CART and RF and ensemble of CART, RF and SVM gives better results compare to other individual models. The ensemble of C4.5, CART and RF and ensemble of CART, RF and SVM gives 71.69% and 71.52% respectively. Keywords: Liver Disease, Classification, Ensemble Model. I. INTRODUCTION Liver disease is one of the critical problems for human being in medical science. Liver [1] is the largest internal organ and glandular organ in the human body, playing a major role in metabolism and serving several vital functions. The main reason for increasing liver patient is to consumption of alcohol, inhale of harmful gases, intake of contaminated food and drugs. Liver disease may not cause any symptoms at earlier stage or the symptoms may be vague, like weakness and loss of energy. Symptoms depend on the type and the extent of liver disease. Classification techniques play very important role diagnosis disease in medical science. Classifier can classify the data as liver or non-liver .There are various authors have worked in the field of classification of liver and non-liver data. H. Jin, et al.,(2014) [2] have used various classification techniques like naïve bayes, MLP, Decision tree and k-NN as classifier for classification of liver patient. They have compared proposed naïve bayes as classifier with others which given high classification accuracy compared to others. P. Sexena, et al. (2013) [3] have used various clustering algorithm for cauterization of liver patient. They have compared various cluttering algorithm and found that k-means clustering algorithm is simplest and fastest algorithm as compared to other algorithms. A. Gulia , et al. (2014) [1] have suggested classification algorithms like J48, MLP,SVM ,Random Forest and Bayes Net for classification of liver patient. Random forest technique gives highest accuracy 71.35% as best model with reduced number of features. J. Pahareeya et al. (2014) [9] have proposed Genetic programming (GP) for classification of liver patient data. The proposed model gives high classification accuracy compared to Random Forest, Multiple Linear Regression (MLR) and Support Vector Machine (SVM). II. ARCHITECTURE OF MODEL Figure 1 shows that the architecture of proposed model for classification of liver patient. In this model, Indian Liver Patient Data Set (ILPD) [4] applied with 10-fold cross validation on various individual models as well as ensemble models. An Ensemble models play very important role to increase the performance of model. The ensemble model can be developed to combine the two or more model. In this work, we propose the two ensemble model as ensemble of C4.5, CART and RF @IJRTER-2016, All Rights Reserved
71
International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 02, Issue 01; January - 2016 [ISSN: 2455-1457]
and ensemble of CART,RF and SVM which give better accuracy compared to other individuals models for classification of liver patient.
ILPD data set with 10-fold cross validation
C4.5 Accuracy
CART
C4.5+CART+RF RF CART+RF+SVM Bayes Net
Sensitivity Specificity
SVM
SVM
Figure 1: Architecture of proposed model III. TECHNIQUE USED Data mining based classification play very important role for classification of liver patient disease. This section defines various techniques used for classification of liver patient. 3.1 C4.5 C4.5 [6] is an extension of ID3 that handle the unavailable values, continuous attribute value ranges, pruning of decision trees and rule derivation. In building a decision tree, we can deal with training sets that have records with unknown attributes values by evaluating the gain, or the gain ratio, for an attribute values are available. We can classify the records that have unknown attribute value by estimating the probability of the various possible results. C4.5 produces tree with variable branches per node. When a discrete variable is chosen as the splitting attribute in C4.5, there will be one branch for each value of the attribute. 3.2 Classification and Regression Technique (CART) CART (Classification and Regression Tree)[6] is one of the popular data mining techniques of building decision tree. It builds a binary decision tree by splitting the record at each node, according to a function of a single attribute. CART uses the gini index for determining the best split. The initial split produces the nodes, each of which we now attempt to split in the same manner as the root node. Once again, we examine the entire input field to find the candidate splitters. If no split can be found then significantly decreases the diversity of a given node, we label it as a leaf node. Eventually, only leaf nodes remain and we have grown the full decision tree. The full tree may generally not be the tree that does the best job of classifying a new set of records, because of overfitting. 3.3 Random Forest Random Forest (or RF) [5] is an ensemble classifier that consists of many decision trees and outputs the class that is the mode of the classes output by individual trees. Random Forests are often used when we have very large training datasets and a very large number of input variables (hundreds or even thousands of input variables). A random forest model is typically made up of tens or hundreds
@IJRTER-2016, All Rights Reserved
72
International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 02, Issue 01; January - 2016 [ISSN: 2455-1457]
of decision trees. The algorithm for inducing a random forest was developed by Leo Breiman and Adele Cutler. 3.4 Bayesian Net Bayesian classifiers [7] are statistical classifiers. They can predict class membership probabilities, such as the probability that a given tuple belongs to a particular class. Bayesian classification is based on Bayes’ theorem. Classification algorithms have found a simple Bayesian classifier known as the Naive Bayesian classifier to be comparable in performance with decision tree and selected neural network classifiers. Bayesian classifiers have also exhibited high accuracy and speed when applied to large databases. 3.5 Support Vector Machine (SVM) Support vector machines (SVMs) [8] are supervised learning methods that generate input-output mapping functions from a set of labelled training data. The mapping function can be either a classification function (used to categorize the input data) or a regression function (used to estimation of the desired output). For classification, nonlinear kernel functions are often used to transform the input data to a high dimensional feature space in which the input data becomes more separable (i.e., linearly separable) compared to the original input space. SVMs belong to a family of generalized linear models which achieves a classification or regression decision based on the value of the linear combination of features. They are also said to belong to “kernel methods”. In addition to its solid mathematical foundation in statistical learning theory, SVMs have demonstrated highly competitive performance in numerous real-world applications, such as medical diagnosis, bioinformatics, face recognition, image processing and text mining, which has established SVMs as one of the most popular, state-of-the-art tools for knowledge discovery and data mining. 3.6 Multilayer Perceptron MLP [6] is a development from the simple perceptron in which extra hidden layers (layers additional to the input and output layers, not connected externally) are added. More than on hidden layer can be used. The network topology is constrained to be feed forward, i.e., loop-free. Generally, connections are allowed from the input layer to the first (and possible only) hidden layer, from the first hidden layer to the second and so on, until the last hidden layer to the output layer. The presence of these layers allows an ANN to approximate a variety of non-linear unctions.The actual construction of network, as well as the determination of the number of hidden layers and determination of the overall number of units, is sometimes of a trial-and-error process, determined by the nature of the problem at hand. The transfer function generally a sigmoid function. IV. PERFORMANCE MEASURES Various performance measures can be evaluated using some well known statical measures like accuracy, sensitivity and specificity. These measures are calculated by true positive (TP), true negative (TN), false positive (FP) and false negative (FN). Confusion matrix [7] for two classes are defined in table 1. The confusion matrix can be defined by TP, TN, FP and FN. Table 2 shows that equations of various performance measures, where N represents that the total number of samples. Table 1: Confusion matrix for positive and negative samples
Actual Vs. Predicted Positive Negative
@IJRTER-2016, All Rights Reserved
Positive Negative True Positive (TP) False Negative (FN) False Positive (FP) (FP) True Negative (TN)
73
International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 02, Issue 01; January - 2016 [ISSN: 2455-1457]
Table 2: Performance measures
Measures Accuracy Sensitivity Specificity
Equation (TP+TN)/N TP/ (TP +FN) TN/ (TN +FP)
V. RESULT AND DISCUSSIONS This research work used WEKA data mining software [10]in window 7 environment. The main motive of this research work is to develop the robust model for classification of liver patient. We have used 10-fold cross validation for partition of data .We have used various classification techniques like C4.5, CART, Random Forest (RF), BayesNet, Support Vector Machine (SVM) and Multilayer Perceptron for analysis for liver patient data. An ensemble model is one of the important techniques to improve the performance of model. In this work, the individual models are not giving better accuracy and not efficient to develop the classification of liver patient. We have developed two ensemble models like C4.5+CART+RF and CART+RF+SVM and gives 71.69% and 71.52% of accuracy as best model. The table 3 shows that accuracy of various individuals and ensemble models for classification of liver patient. Figure 2 shows that graphical presentation of accuracy of individuals and ensemble models. Table 4 shows that confusion matrix of best model. There are maximum number of samples are misclassified in case of non-liver patient as shown in table 4. Various performance measures like accuracy, sensitivity and specificity are calculated with the help of confusion matrix using formula as shown in table 5. Table 3: Accuracy of models with 10-fold cross validation
Model C4.5 CART Random Forest SVM Bayesian Net MLP C4.5+CART+RF CART+RF+SVM
Accuracy 68.78 71.01 71.18 71.35 67.23 70.84 71.69 71.52
100 90
Accuracy in percentage
80 70
68.78
71.01
71.18
71.35
67.23
70.84
71.69
71.52
60 50 40 30 20 10 0
Models
Figure 2: Accuracy of different models with 10-fold cross validation
@IJRTER-2016, All Rights Reserved
74
International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 02, Issue 01; January - 2016 [ISSN: 2455-1457]
Table 4: Confusion matrix of best model
Actual Vs. Predicted Liver Non-Liver
C4.5+CART+RF Liver Non-Liver 376 40 125 42
CART+RF+SVM Liver Non-Liver 415 1 165 2
Table 5: Performance measures of best models
Performance measures Accuracy Sensitivity Specificity
C4.5+CART+RF
CART+RF+SVM
71.69 90.38 25.14
71.52 99.75 1.19
100 90
In Percentage
80 70
60 50 40
C4.5+CART+RF
30
CART+RF+SVM
20 10 0
Accuracy
Sensitivity
Specificity
Performance measures
Figure 2: Various performance measures of best model
VI. CONCLUSION AND FUTURE WORK In every organization, data mining is one of the important techniques to evaluate the useful information. Medical science is important area where we can apply the data mining technique for classification of data. In this research work, we have used various data mining based classification technique for classification of liver patient data. The main focus of this research work is to develop the robust ensemble model which gives the high classification accuracy. We have recommended the ensemble of C4.5, CART and RF and ensemble of CART, RF and SVM for classification of liver patient. In future, we can apply optimization techniques to computationally increase the performance of model. REFERENCES 1. 2. 3.
4. 5.
A. Gulia, R. Vohra and P. Rani, “Liver Patient Classification Using Intelligent Techniques”, International Journal of Computer Science and Information Technologies,Vol. 5, pp. 5110-5115,2014. H. Jin, S. cheon Kim and J. Kim,” Decision Factors on Effective Liver Patient Data Prediction”, International Journal of Bio-Science and Bio-Technology, Vol.6, No.4 , pp.167-17,2014. P. Saxena and S. Lahre, “Analysis of various clustering algorithms of data mining on health informatics”, International Journal of Computer & Communication Technology, Vol. 4, Issue-2, pp.108-112, 2013. UCI Machine Learning Repository [http://archive.ics.uci.edu/ml/datasets.html]. R. Parimala, and R. Nallaswamy , “A Study of Spam e-mail Classification using Feature Selection Package”, Global Journal of Computer Science and Technology , Vol. 11, ISSN: 0975-4172, 2011. A. K. Pujari, “Data Mining Techniques”, Universities Press (India) Private Limited, 4th ed., ISBN: 81-7371-380-4,
@IJRTER-2016, All Rights Reserved
75
International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 02, Issue 01; January - 2016 [ISSN: 2455-1457]
6. 7. 8.
9.
2001. J. Han, and M. Kamber, “Data Mining Concepts and Techniques”, Morgan Kaufmann, San Francisco. 2nd ed., ISBN: 13: 978-1-55860-901-3, 2006. D. L. Olson and D. Delen, “ Advanced Data Mining Techniques”, USA, Springer Publishing: ISBN: 978-3-54076916-3,2008. J. Pahareeya, R. Vohra, J. Makhijani and S. Patsariya (2014), “Liver Patient Classification using Intelligence Techniques”, International Journal of Advanced Research in Computer Science and Software Engineering, Volume 4, Issue 2,pp. 295-299. Web source: http:// www.cs.waikato.ac.nz/~ml/weka/ last accessed on Oct. 2016.
@IJRTER-2016, All Rights Reserved
76