Predicting Diabetes Using a Machine learning Approach
Using the ML approach, we can now assess diabetes in the patient. Learn more about how the algorithms used are dramatically changing health care.
Diabetes is one of the deadliest diseases in the world. It is not only a disease, but also a creator of a variety of diseases such as heart attacks, blindness, and kidney diseases. The usual detection process is that patients visit the diagnostic center, consult their physician, and sit tight for a day or more to get their reports. Also, every time they want to get their diagnosis report, they have to waste their money.
With the rise of machine learning approaches, we have the potential to find a solution to this problem and have developed a system using data mining that has the potential to tell whether a patient has diabetes. Furthermore, the preoperative tingling of the disease leads to the treatment of patients. Data mining has the potential to extract large amounts of hidden knowledge from diabetes-related data. For that reason, it has an important role in diabetes
research, now more than ever. The goal of this research is to develop a system that can measure the patient's diabetic risk level with high accuracy. This research focuses on developing a system based on three classification methods: Decision Tree, Nav Bayes and Support Vector Machine Algorithms.
Currently, the models give 84.6667%, 76.6667% and 77.3333% accuracy to the Decision Tree, Nav Bayes, and SMO Support Vector machines, respectively. These results are validated using the receiver sensitively operating characteristic curves. The developed ensemble method uses the votes given by other algorithms to give the final result. This voting system eliminates algorithm-based false classifications. This helps to get a more accurate estimate of the disease. We used the Data Mining extension for data preprocessing and experimental analysis. The results of a significant improvement in the accuracy of the ensemble method are compared with other existing methods.
Methodology These algorithms do not work alone; we have developed an ensemble method that uses the votes given by other algorithms to give the final result. The system accepts the result, only when more than two models give the same predicted results. It gives the decision of the majority. This voting system eliminates algorithm-based misclassifications. This helps to get a more accurate estimate of the disease.
The decision tree is the J48 algorithm Decision-tree is a tree structure that has the appearance of a flowchart. It can be used as a method for classification and estimation with representation using nodes and internodes. The root and internal nodes are test cases. Leaf nodes are treated as class variables. To classify a new topic, it creates a decision tree based on the characteristic values of the available training data set. Each node of the tree is generated by calculating the highest information gain for all attributes. If any attribute returns an undoubted result, the branch of that attribute is disabled and the target value is then assigned to it. The following diagram shows the sample decision tree.
A 12-fold cross-validation technique was used to build the model. It is as follows: Divide the data into 12 sets of n / 12 sizes. Train in 11 datasets and test on 1. Repeat 12 times and take the average accuracy. In the 12-fold cross-validation, the original sample was randomly divided into 12 equal-sized sub-samples. Then a single sub-sample is put into validation data to test the model and the remaining (12− 1) sub-models are used as training data.
Bayes Algorithm It is based on the Bayes rule of conditional probability. It uses all the features in the data and analyzes them individually, even though they are equally important and independent of each other. The construction process for Naive Bayes is parallel. This can be applied to a large dataset in real-time because it overcomes various limitations, such as ignoring complex iterations of the parameter. To create the model using this algorithm we used the 70:30 percent split technique. 70% of the data set was used to train the data and the other 30% was used to test the model.
SMO (Sequential Minimal Optimization)
This algorithm is commonly used to solve quadratic programming problems that arise during SVM (Support Vector Machines) training. SMO uses heuristics to divide the training problem into smaller problems that can be analytically solved. It replaces all missing values and converts the nominal attributes into binary. Also, all features are normalized by default, which helps speed up the training process. Here, too, this model
DataFset used:
Data were obtained from the Pima Indians Diabetes Database and the National Institute of Diabetes and Digestive and Kidney Diseases.
Procedure: Load previous datasets to the system. Data pre-processing was done by integrating the WEKA tool.
The following operations are performed in the dataset. A. Replace the missing values. B. Normalization of values. The user inputs data to the system to determine if he has the disease. Build three models using J48 Decision Tree, Nav Bayes, and SMO Support Vector Machine algorithms and train the data set.
Test the dataset using three models. Get evaluation results.
Closing Point: Considering these results, each model has more than 70% accuracy. Similarly, due to the voting process of all the algorithms, this ensures that the conclusion is very accurate. Also, we planned to gather more data from different districts of the country and to increase more accurate and simple foresight patterns.