e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021
Impact Factor- 5.354
www.irjmets.com
COMPREHENSIVE ANALYSIS OF HEART DISEASE PREDICTION USING SCIKIT-LEARN Mir Nawaz Ahmad*1 *1Department
of Computer Science and Engineering, SSM College of Engineering, Parihaspora 193121, Jammu and Kashmir, India
ABSTRACT Heart Disease, as a cardiovascular disorder, is the leading cause of death for men and women. It's the primary source of morbidity and mortality today. Hence, scientists are still working to support healthcare experts in assessing this complicated process using data mining methods. Even though the healthcare sector is wealthier while inside the database, this data isn't correctly mined to detect hidden routines and make conclusions according to these patterns. The most significant target with the learning identifies hidden levels by simply employing multiple Scikit-learn methods that almost certainly give notable benefits to ensure the current clear presence of cardiovascular illness among individuals. Numerous classification methods have been utilized to detect such patterns for exploration from the medical trade. Even the data set comprising 14 features has examined for the prediction platform. The dataset from the UCI repository contains some widely used medical terms and phrases, including blood pressure, cholesterol amount, torso pain, along with 11 other features used to anticipate heart disease. However, you will find many features or anomalies from this dataset that will not offer fantastic results. Hence data preprocessing and feature engineering is utilized to handle this type of issue. Even the most frequently occurring and effectual classification methods employed inside this research paper are Decision Tree, k-nearest neighbor, Extra Trees Classifier, Random Forest, Support Vector Machine, Naïve Bays, Logistic Regression, AdaBoost Classifier, Voting Classifier, Ridge CV. I evaluate such scikit-learn models using some parameters like Accuracy, Precision, Recall, and F1-score. According to our practical consequences, the Extra trees classifier's accuracy is 93.44percent, which's regarded as somewhat excellent, while other models lie below the Extra tree classifier. According to our experimentation investigation, the Extra Tree classifier with the best accuracy believed most useful way of Heart disease prediction. Keywords: Heart Disease, Scikit-learn, Ensemble learning, Machine Learning, Extra Tree Classifier, Feature Engineering.
I.
INTRODUCTION
The heart is our primary and vital organ that pushes blood, together with its life-giving nourishment and oxygen, into most of the human body's cells. In case the pumping activity of this heart gets ineffective, crucial organs such as kidneys and brain suffers, and even death occurs within a minute if the heart stops functioning. Cardiovascular disorder was considered among many complex and individual life-threatening diseases on earth. Life itself is wholly contingent upon the productive functioning of heart disease. Indicators of heart disease involve shortness of breath, and fatigue of the body, swollen foot and tiredness, as it's talked about in [1]. Heart disorder identification and cure are incredibly intricate, particularly within the growing states, due to infrequent access to diagnostic devices and different tools that affect accurate prediction and heart disease therapy. This creates cardiovascular disorder that a significant consideration to be managed. However, it's hard to spot heart disease due to many conducive chance factors like diabetes, higher blood pressure, elevated cholesterol, obesity, strange heartbeat pace, and several different elements. The invasive established ways to identify heart disease rely upon diagnosing their individual's clinical record, physical evaluation report, and most symptoms from clinical professionals. Frequently there's just a delay at the identification as a result of individual problems. As a result of these limitations, boffins have now turned into modern processes such as Data Mining and Machine Learning to predict this disorder. Machine learning has a significant part in creating a smart version for health procedure to find that the heart disorder [2] with the readily available data set of sufferers involves chance variable connected to the illness. Medical professionals can offer assistance for your discovery. Investigators suggest several applications, equipment and separate calculations for acquiring an effective medical decision aid platform. Machine-learning allows systems to master and also behave so. It enables the system to know how the most intricate version predicts the data and can estimate complicated statics on big data. The Machine learning established heart www.irjmets.com
@International Research Journal of Modernization in Engineering, Technology and Science
[67]
e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021
Impact Factor- 5.354
www.irjmets.com
disease prediction systems will likely probably undoubtedly be accurate and certainly will decrease the probability of abnormalities. Machine learning technology's worth is understood nicely in the healthcare industry, with a sizable data pool. It can help health professionals to predict precisely the disorder and also contribute to improving the therapy. Various Scikit Learn models like decision tree, k-nearest neighbor, logistic regression, random forest, Extra tree classifier, support vector machine etc., are employed to predict if an individual is experiencing heart disease or not. But, medical statistics tend to be elicited by more compact-sized collections of observations than those usually favoured to enable adequate testing and training of all units assembled using machine learning algorithms. It's quite tough to decide whether a version is generalizable to formerly hidden sets of information with satisfactorily sized data collections. Utilizing artificial intelligence to overcome limitations inherent in little clinical research information collections can be a sufficient remedy to guard individual privacy and enable the use of machine learning calculations. The more expensive data collections allow sufficiently sized testing and training walls that permit the machine learning algorithm to study from expertise by vulnerability into a massive pair of observations, after which to become analyzed upon a second enormous collection of statements that have never been seen.
II.
LITERATURE REVIEW
S.P.Rajamhona [3] developed a neural network-based heart disease prediction system that has 80.46% of Accuracy. Salma Banu N.K. [4] analysis prediction of heart disease at an early stage using data mining and big data analytics using Naive Bayes, Decision Tree and SVM with an accuracy of 84.1% Purusothama [5] developed various models of classification techniques to design a risk prediction model for heart disease and got an Accuracy of 83.66%. An Artificial intelligence based Hybrid Intelligent System Framework for the Prediction of Heart Disease Using Machine Learning Algorithms [6] having Accuracy of 77.00%. Alizadehsani [7] built the various classification models for CAD based on SMO, Bagging, Artificial Neural Networks and Naive Bayes. It is claimed an Accuracy of 89.00%.
III.
PROPOSED SYSTEM DESIGN
This paper proposes a system with robust prediction algorithms, which takes augmented UCI repository dataset as input and predicts heart problems with a comprehensive report generation module. This paper aims to implement a self-learning protocol such that the past inputs of the disease outcomes determine the future possibilities of heart disease to a particular user. The proposed model uses strong preprocessing and feature engineering tools so that the classification and prediction do not show any error while developing a system using the dataset. A vast number of training examples will be used to make the prediction more and more accurate. In this paper, I have used ten scikit-learn algorithms to train the model. They are: 1. Logistic Regression 2. K-Nearest Neighbors (KNN) 3. Naive Bayes 4. Decision Tree Classifier 5. Extra Trees Classifier 6. Random Forest Classifier 7. Support Vector Machines 8. AdaBoost Classifier 9. Voting Classifier 10. Ridge CV The working of these algorithms has been explained in the sections ahead. I have also calculated the confusion matrix for every model. Later on, I aggregated all the necessary measures to find out the best one from our ten scikit-learn models. www.irjmets.com
@International Research Journal of Modernization in Engineering, Technology and Science
[68]
e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021
Impact Factor- 5.354
IV.
V.
www.irjmets.com
PROCESS FLOW DIAGRAM
DATASET DESCRIPTION
The data set used for implementing the proposed work is taken from the UCI repository [8]. The nature of the data set is defined by its dimensionality. It consists of 13 columns which represent features of the data set. A complete overview has been shown in the figure given below as like other data set; it may consist of certain redundancy, noise and missing values for talking case of it proper steps need to be taken. Before delving into the main task, mandatory actions are required for the UCI repository dataset, which is discussed ahead. The general overview of the raw dataset is shown in the table below: Table 1: UCI repository dataset
www.irjmets.com
@International Research Journal of Modernization in Engineering, Technology and Science
[69]
e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021
Impact Factor- 5.354
VI.
www.irjmets.com
METHODOLOGY
Data Preprocessing and Feature selection The data set used for implementation is taken from the UCI repository; it contains certain attributes whose names are not clear in the dataset also has a certain unrelated attribute, which is not useful for the more excellent performance of the proposed work. Every dataset consists of various anomalies such as missing values, redundancy, or any other problem for removing this problem. There is a need for specific tasks to be taken into consideration before delving into the main process. We can remove or fix missing out on the dataset's entry by Data Preprocessing and Feature selection. Various rows containing these incomplete records and redundant entries are to be removed from the raw dataset. Also, sampling is done on the dataset to enhance the performance of the algorithm. As the entire data is not in ubiquitous fashion, there is dominancy in the values and features of the dataset, which, if not rectified, will harm performance and accuracy. So Scaling and standardization are also used to convert the data into the new vectors. Different attributes use a different unit to normalize each attribute on a single uniform unit. Further new features and different combinations of feature pairs are formed from the original dataset to choose the best features for our models to achieve high performance in terms of various measures.This results in creating a large dataset with multiple different features that are not initially present in the raw dataset. I have used many Feature selection techniques to select the best features among the best from the augmented dataset to gain more accuracy. Here are those selections 1. Pearson correlation 2. SelectFromModel with LinearSVC 3. SelectFromModel with Lasso 4. SelectKBest with Chi-2 5. Recursive Feature Elimination (RFE) with Logistic Regression 6. Recursive Feature Elimination (RFE) with Random Forest 7. VarianceThreshold Scikit-learn Model Selection 01. Support Vector Machines Support Vector Machine is a type of supervised machine learning algorithm used for both classification and regression problems. However, today it is mostly used in classification problems. In the SVM algorithm, we plot each data item as a point in n-dimensional space, with each feature's value being the value of a particular coordinate. Then, we perform classification by finding the hyper-plane that differentiates the two classes very well. Support Vectors are the points or coordinates of individual observation that lie close to the hyper-plane. The SVM classifier is a frontier that best segregates the two classes or categories, as shown in the figure 1 below:
Figure 1: Support Vector Machine (SVM) The confusion Matrix obtained from the SVM model is shown below, which provides all the necessary measures of choice. Both training and test data matrices are shown below: www.irjmets.com
@International Research Journal of Modernization in Engineering, Technology and Science
[70]
e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021
Impact Factor- 5.354
www.irjmets.com
02. Decision Tree Classifier A Decision Tree is a supervised machine learning technique that may be properly employed for both classification and Regression problems, but largely it's advised for resolving Classification issues. It's just a treestructured classifier, in which inner nodes signify the features of the data set, branches represent the decision rules, and each leaf node represents the outcome. It's a graphical representation for obtaining all the possible answers to problem/decision based on given conditions.
Figure 2: Decision Tree classifier The confusion Matrix of the dataset obtained from the Decision Tree classifier is shown below which provides all the necessary measures of the choice:
www.irjmets.com
@International Research Journal of Modernization in Engineering, Technology and Science
[71]
e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021
Impact Factor- 5.354
www.irjmets.com
03. Random Forest Classifier It is an ensemble learning technique that combines multiple classifiers to solve a complex problem and improves the model's performance. Random Forest is a classifier that contains many decision trees on various subsets of the given dataset and takes the average to improve that dataset's predictive accuracy. Instead of one decision tree, the random forest takes the prediction from each tree and, based on the majority votes of predictions, predicts the final output. The more significant number of trees in the forest leads to higher accuracy and prevents over-fitting.
Figure 3: Random Forest Classifier The confusion Matrix of the dataset obtained from the Random Forest classifier is shown below which provides all the necessary measures of the choice:
04. K-Nearest Neighbor (KNN) K-Nearest Neighbor (KNN) is the simplest Machine Learning algorithms based on the Supervised Learning technique. It assumes the similarity between the new case/data and available cases and puts the new case/point into the most similar category to the available classes. It stores all of the given data as well as classifies new fresh data depending on the similarity. When new data appears, it can be easily classified into a good suite category by using the K- NN algorithm. It is a non-parametric algorithm, which means it does not make any assumption on underlying data.
www.irjmets.com
@International Research Journal of Modernization in Engineering, Technology and Science
[72]
e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021
Impact Factor- 5.354
www.irjmets.com
Figure 4: K-Nearest Neighbor The confusion Matrix of the dataset obtained from the K-Nearest Neighbors is shown below which provides all the necessary measures of the choice:
5.0 Extra Trees Classifier Extra Trees Classifier is another kind of ensemble learning technique that combine the results of multiple decorrelated trees collected in a forest to output its result. It is very similar to a Random Forest algorithm but only differs from constructing a vast number of decision trees in the forest. Each Tree in the Extra Trees Forest is built from the original training instances. Then, at each test node, each decision tree is provided with a random piece of 'k features' from the feature-set. Each decision tree must select the best available feature to split the data based on some criteria (e.g. Gini Index). This arbitrary sample of features leads to the creation of multiple de-correlated decision trees.
Figure 5: Extra Tree Classifier www.irjmets.com
@International Research Journal of Modernization in Engineering, Technology and Science
[73]
e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021
Impact Factor- 5.354
www.irjmets.com
The confusion Matrix of the dataset obtained from the K-Nearest Neighbors is shown below which provides all the necessary measures of the choice:
6.0 AdaBoost Classifier Ada-boost, known as Adaptive Boosting, is one ensemble boosting classifier. It combines multiple classifiers to increase the accuracy of classifiers. It is an iterative ensemble method. AdaBoost classifier builds a robust classifier by combining multiple poorly performing classifiers to get high accuracy robust classifier. The basic concept behind Adaptive Boosting eka AdaBoost is to set the weights of classifiers and to train the data sample in each iteration such that it ensures accurate predictions of unusual or unseen observations.
Figure 6: AdaBoost Classifier
www.irjmets.com
@International Research Journal of Modernization in Engineering, Technology and Science
[74]
e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021
Impact Factor- 5.354
www.irjmets.com
The confusion Matrix of the dataset obtained from the AdaBoost is shown below which provides all the necessary measures of the choice: 7.0 Logistic Regression Logistic regression is a popular Machine learning algorithms that come under supervised learning techniques. It can be used for Classification and Regression problems but mainly used for Classification problems. It is used to predict the categorical variable with the help of independent variables. The output of the Logistic Regression problem can be only between 0 and 1. It can be used where the probabilities between the two classes are required. It is based on the concept of Maximum Likelihood estimation. According to this estimation, the observed data should be most probable. In logistic regression, we pass the weighted sum of inputs through an activation function that can map values between 0 and 1. Such activation function is known as the sigmoid function, and the curve obtained is called a sigmoid curve or S-curve.
Figure 7: Logistic Regression The confusion Matrix of the dataset obtained from the Logistic Regression is shown below which provides all the necessary measures of the choice:
8.0 Naive Bayes It is the simplest machine learning algorithm based on Bayes’ Theorem with an independence assumption among predictors. It means a Naive Bayes classifier presumes that the presence of a particular feature in a class has nothing to do with the presence of any other feature. Its model is easy to build and particularly useful for massive data sets. Along with simplicity, it is famous for outperforming highly complex classification problems. Also, it provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c).
www.irjmets.com
@International Research Journal of Modernization in Engineering, Technology and Science
[75]
e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021
Impact Factor- 5.354
www.irjmets.com
Figure 8: Naïve Bayes The confusion Matrix of the dataset obtained from the Naive Bayes is shown below which provides all the necessary measures of the choice:
9.0 Voting Classifier A Voting Classifier is a newly introduced scikit-learn machine learning model that trains on numerous models and predicts an output based on their highest probability of chosen category as the output. It merely aggregates each classifier's findings passed into Voting Classifier and predicts the output class based on the highest voting majority. The idea of creating separate models and finding the accuracy for each of them, we make a single model that trains these models and predicts output based on their combined majority voting for each output class.
Figure 9: Voting Classifier
www.irjmets.com
@International Research Journal of Modernization in Engineering, Technology and Science
[76]
e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021
Impact Factor- 5.354
www.irjmets.com
The confusion Matrix of the dataset obtained from the Voting Classifier is shown below which provides all the necessary measures of the choice:
10. Ridge Classifier The Ridge Classifier, based on the Ridge regression method, converts the label data into [-1, 1] and solves the regression method's problem. The highest value in prediction is accepted as a target class, and for multiclass data, multi-output regression is applied [9].
Figure 10: Ridge Classifier The confusion Matrix of the dataset obtained from the Ridge Classifier is shown below which provides all the necessary measures of the choice:
www.irjmets.com
@International Research Journal of Modernization in Engineering, Technology and Science
[77]
e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021
Impact Factor- 5.354
VII.
www.irjmets.com
PERFORMANCE EVALUATION
This section covers the overall analysis of the experiments conducted in this research. The first step is selecting the best model from the table based on each model's Accuracy matrix. I have also calculated other statistical parameters in terms of Precision, Recall and F1score of all the models encountered throughout this paper. Based on these measures, we can select the best Scikit-learn model for our required problem, i.e. heart disease prediction. ACCURACY Accuracy is a matrix for evaluating classification models. Informally, accuracy is the fraction of predictions our model got right. Accuracy is calculated by using the equation: Accuracy = (TN+TP) / (TN+TP+FN+FP) Table 2: Accuracy of all models Classifier
Accuracy (%)
Logistic Regression
88.52
K-Nearest Neighbors (KNN)
86.89
Naive Bayes
86.88
AdaBoost Classifier
90.16
Decision Tree Classifier
78.69
Extra Trees Classifier
93.44
Random Forest Classifier
90.16
Support Vector Machines
90.16
Voting Classifier
90.16
Ridge CV
88.52
PRECISION Precision is the fraction of relevant instances among the retrieved examples. It is also called positive predictive value. Precision is calculated by using the equation: Precision = TP / TP+FP Table 3: Precision of all models Classifier
Precision (%)
Logistic Regression
85.18
K-Nearest Neighbors (KNN)
84.61
Naive Bayes
80.00
AdaBoost Classifier
85.71
Decision Tree Classifier
72.41
Extra Trees Classifier
92.30
Random Forest Classifier
83.33
Support Vector Machines
88.46
Voting Classifier
88.46
Ridge CV
91.30
RECALL Recall or Sensitivity measures the proportion of positives that are correctly identified. It is also called the True Positive rate. Recall is calculated by using the equation: Recall = TP / TP+ FN www.irjmets.com
@International Research Journal of Modernization in Engineering, Technology and Science
[78]
e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021
Impact Factor- 5.354
www.irjmets.com
Table 4: Recall of all models Classifier
Recall (%)
Logistic Regression
88.46
K-Nearest Neighbors (KNN)
84.61
Naive Bayes
92.30
AdaBoost Classifier
92.30
Decision Tree Classifier
80.76
Extra Trees Classifier
92.30
Random Forest Classifier
96.15
Support Vector Machines
88.46
Voting Classifier
88.46
Ridge CV
80.76
F1 MEASURE/SCORE The F1 score is the harmonic mean of the precision and recall. F1 measure is calculated by using the equation: F1 Measure = 2TP / 2TP+FP+FN Or 2 * (PRECISION * RECALL / PRECISION + RECALL) Table 5: F1 Score of all models Classifier
F1 Measure (%)
Logistic Regression
86.79
K-Nearest Neighbors (KNN)
84.61
Naive Bayes
85.71
AdaBoost Classifier
88.88
Decision Tree Classifier
76.36
Extra Trees Classifier
92.30
Random Forest Classifier
89.28
Support Vector Machines
88.46
Voting Classifier
88.46
Ridge CV
85.71
Table 1 shows the accuracy results of various classifiers. From the Table, it’s demonstrated that the accuracy of the Extra Trees Classifier is 93.44%, which is given more accurate results than other classifiers. The reason behind why the Extra Trees classifier results are higher is because the Extra tree classifier works much faster than the Random tree classifier. Probably it works three times faster than Random Tree Classifier. For noisy features, the Extra trees classifier gives higher performance. Another feature of this classifier is it will not overfit data. It makes extra trees that will help in voting to provide the best predictions.
VIII.
CONCLUSION AND FUTURE WORK
This paper deals with various techniques involving the predicting of heart diseases resulting in substantial accurate prediction. In this research, I performed experiments and evaluated the statistics. It suggests that the Extra Tree Classifier technique ranked the top classifier for Heart disease prediction because it contains additional precision and the tiniest time to decide. We can see that maximum correctness belongs to the Extra www.irjmets.com
@International Research Journal of Modernization in Engineering, Technology and Science
[79]
e-ISSN: 2582-5208 International Research Journal of Modernization in Engineering Technology and Science Volume:03/Issue:03/March-2021
Impact Factor- 5.354
www.irjmets.com
Tree Classifier algorithm. The Extra Tree Classifier supported UCI repository dataset has the very best accuracy, i.e. 93.44%, whereas others have low accuracy than Extra Tree Classifier. In conclusion, and with the review of the literature, we tend to believe a competitive landmark solely is achieved within the suggestion of the model purposed for the patient with Heart disease. Hence, it's a desire for combination and additional advanced models to extend the accuracy to up that may help predict heart disease more accurately.
IX.
REFERENCES
[1] C.A.Devi,S.P.Rajamhoana, K.Umamaheswari , R. Kiruba, K. Karunya, and R. Deepika, ‘‘Analysis of neural networks based heart disease prediction system,’’ in Proc. 11th Int. Conf. Hum. Syst. Interact. (HSI), Gdansk, Poland, Jul. 2018, pp. 233–239. [2] Mai Shouman, Tim Turner, Rob Stocker, “ Using data mining techniques in heart disease diagnosis and treatment”, IEEE Japan-Egypt Conference on Electronics, Communications and Computers, 2012. [3] SP.Rajamhona,2018.Analysis of neural network based heart disease prediction system. Cleveland and statlog heart disease database , UCI repository [4] Salma banu N.k,2016Prediction of heart disease at early stage using data mining and big data analytics. UCI repository. [5] Purusothama et al ,2015. Different classification techniques to design risk prediction model for heart disease. UCI repository. [6] Amin Ul Haq , 1 Jian Ping Li , 1 Muhammad Hammad Memon , 1 Shah Nazir , 2 and Ruinan Sun. A Hybrid Intelligent System Framework for the Prediction of Heart Disease Using Machine Learning Algorithms. [7] Alizadehsani, Roohallah, Jafar Habibi, Mohammad Javad Hosseini, HodaMashayekhi, ReihaneBoghrati, Asma Ghandeharioun, BehdadBahadorian, and Zahra Alizadeh Sani. "A data mining approach for diagnosis of coronary artery disease." Computer methods and programs in biomedicine 111, no. 1 (2013): 52-61. [8] UCI repository dataset for Heart Disease. https://archive.ics.uci.Edu/ml/datasets/Heart Disease [9] Classification Example with Ridge Classifier in Python. https://www.datatechnotes.com/2020/07/classification-example-with-ridge-classifier-in-python.html
www.irjmets.com
@International Research Journal of Modernization in Engineering, Technology and Science
[80]