An Efficient Technique using Data Mining Algorithms for Predicting Disease

Integrated Intelligent Research (IIR)

International Journal of Data Mining Techniques and Applications Volume: 07, Issue: 01, June 2018, Page No.161-165 SSN: 2278-2419

An Efficient Technique using Data Mining Algorithms for Predicting Disease â&#x20AC;&#x201C; A Review M. Muthuraman1, S. Ravichandran2 1 Research Scholar and Assistant Professor, PG and Research Department of Computer Science, H.H. The Rajahâ&#x20AC;&#x2122;s College (A), Pudukkottai, Tamilnadu- 622001 2 Assistant Professor and Head, PG and Research Department of Computer Science, H.H. The Rajahâ&#x20AC;&#x2122;s College (A), Pudukkottai, Tamilnadu- 622001 Email:muthuraman70@gmail.com, rajahsravis@gmail.com Abstract - Data mining is a high volume of data for needful information. The best and most popular data mining techniques are rule mining, clustering, classification and sequence pattern. Data mining has incredible guarantee for the health industry, empowering health frameworks to efficiently utilize data and research to find wasteful aspects and best practices to enhance mind and lessen costs. Some experts believe that the opportunity to improve healthcare and reduce costs at the same time may account for 30% of total health care spending. All in all, this might be a win-win circumstance. In any case, because of the multifaceted nature of healthcare and the moderate appropriation of innovation, our industry lingers behind different enterprises in actualizing powerful information investigation and extraction procedures. For detecting a disease number of tests should be required from the patient. But using data mining technique the number of test should be reduced. This reduced test plays an important role in time and performance. This exploration paper examines how data mining procedures are used for anticipating lung diseases. 1.

INTRODUCTION

Data mining is an imperative advance in finding concealed prediction information for extensive data sets. This is a new and innovation technology that has great potential to help companies focus on the most imperative data in data distribution centres. Data mining is the process of analysis to find interesting models, so the data will become data grave [1]. It allows users to analyze many different sizes of data. In fact, it is the way toward creating connections or examples between many fields in an expansive social database. The data extraction process includes several steps: cleanup, data integration, data selection, data transformation, knowledge representation, and model evolution [2]. The data may be collected from various applications including science and engineering, management, business houses, government administration and so on. Some data patterns may be mined from spatial, time-related, text, biological, multimedia, web and legacy database. The concept of discovering the mining concept descriptions, association, classification, prediction, clustering, trend analysis, deviation analysis and similarity analysis [3]. Data consists of voluminous set of facts and number of dimensions. Dimensions are the entries on which an organization which maintains the record and they will be hierarchical. Mining is also called as learning disclosures in databases, alludes to discovering information from vast datasets. It is the procedures which are utilized to work on huge arrangement of data to find shrouded examples and connections in basic leadership [4]. 2.

DATA MINING METHODS

Vanitha et al., [5] describes data mining techniques are used in mining tasks. Association, Classification, Clustering, Prediction, etc. are some of the data mining techniques. Descriptive and predictive models are the two main category of the data mining. A descriptive model presents the data form which is essentially a summary of the data points, finds patterns in the data and understands the relationships between attributes represented by the data, it includes the methods like clustering, association rules, summarizations, and sequence discovery. The predictive model works by prediction about information of data, which uses actual results found from different datasets. Classification, regression and time series are the predictive data mining models. 2.1 Classification

161

Integrated Intelligent Research (IIR)

International Journal of Data Mining Techniques and Applications Volume: 07, Issue: 01, June 2018, Page No.161-165 SSN: 2278-2419 Smita et al., [6] specifies classification is the most commonly applied technique, in data mining. It finds rules that partition data into some groups. The common characteristics of classification tasks are decision trees, neural networks, genetic algorithms, supervised learning, categories which depends on variable and assigning new data. The application includes credit risk analysis, fraud detection, banking and modeling business etc. 2.2 Clustering Clustering is a collection of similar data objects, dissimilar objects is an another cluster. It is way of finding the similarities between data according to the organizing data, categorize data for model construction and data compression, etc. These procedures are developed and are ordered as partitioning techniques, hierarchical techniques, density and grid based techniques. The datasets may be numerical or categorical K-Means, hierarchical, are some of the well-known data clustering algorithms as given in Rajasekaran et al., [7]. 2.3 Association Rule mining Han et al., [8] has indicated by this is a very much explored strategy for finding interesting relations between variables in large database. In this technique, the presence of another model, i.e. item is related to another in terms of cause and effect. The main objective is to discover all the conditions that have support and confidence greater than or equal to minimum support or confidence in a database. Support means the percentage of total transactions of two different items. Confidence means how much particular item is depending on another. There is no significance for the patterns with low confidence and support. 2.4 Regression Crino et al., [9] Regression is another Predictive data mining model is also known as supervised learning technique. This technique analyse the dependency of some attribute values, which is dependent upon the value of other attributes mainly, present in same item. In this techniques target values are known. 3.

DISEASE PREDICTION ALGORITHMS â&#x20AC;&#x201C; A REVIEW

Aravindhan and Vanitha [10] describes the prediction algorithm is designed to build a generic predictor or estimator to determine the user's next action. This pattern creates a dictionary of area identifiers, treats them as character symbols, and uses the dictionary to collect statistics based on context or phrases in the motion history. The integration of the mining tools described and demonstrated in this article allows for a more comprehensive MATLAB data mining approach than was previously available. This work guarantees that data mining becomes an increasingly straightforward task because the appropriate tools for a given examination wind up apparent.T. Revathi and S. Jeevitha [11] analyzed data mining algorithms in heart disease prediction. Heart disease-related clinical data was used in the analysis. . The results of Neural Network, NaĂŻve Bayes, and Decision Tree algorithms are compared, Neural Network achieved good accuracy. Devendra Ratnaparkhi, Tushar Mahajan, and Vishal Jadhav [12] proposed a frameowrk for foreseeing heart disease using naive Bayes and compared the results with neural network and tree based decision making algorithms. As per this technique, Naive Bayes algorithm provides a good prediction. K. Manimekalai [13] proposed various mining techniques for heart disease prediction. From the experimental results, SVM classifier with genetic algorithm provides better precision while contrasted and NaĂŻve Bayesian, C 5.0, Neural Network, KNN, J48, decision tree and Fuzzy algorithms. Lamboder Jena, Narendra Ku. Kamila [14] performed a set of data on chronic disease from the UCI Repository. They used naive Bayes, multi-layer perceptrons, support vector machines, J48, connection rules, and decision trees to compare classification accuracy. They said that multi-layer perceptron algorithms provide better classification accuracy and better predictive performance to predict chronic kidney disease. Basma Boukenze et al. [15] concerned about the evolution of healthcare system in big data analysis. In his application work, he support vector machine (SVM), the decision tree (C4.5) and the automatic learning algorithm of the Bayesian network. The chronic kidney disease dataset of the ICU automated learning bank was used to predict patients with chronic renal failure and those without chronic kidney disease. The C4.5 classifier offers shorter runtime

162

Integrated Intelligent Research (IIR)

International Journal of Data Mining Techniques and Applications Volume: 07, Issue: 01, June 2018, Page No.161-165 SSN: 2278-2419 and higher accuracy results. Pushpa M. Patil [16] used a data mining classifier to study different research articles on the prediction of chronic kidney disease. The initial concepts of the decision tree, the Bayesian classification, the rule-based classification, the posterior propagation algorithm, the support vector machine (SVM) and the nearest neighbor classifier of K. are presented. The most accurate classifiers they are multilayered perceptors, random forests, naive Bayes, SVM, nearest neighbors and radial basic functions. P. Sindhuja and Jemama Priyadarshini R., [17] described the classification technique in his article to analyze liver disorders. And the advantages as well as the algorithms for C 4.5, naive Bayes, decision trees and the deficiencies of support vector machines, and the return of the propagation tree classification and the regression comparison. They suggest that C 4.5 provides better performance than other algorithms. A. S. Aneeshkumar, C. Jothi Venkateshwaran, presented in his article [18] a method for classifying liver diseases using data mining techniques. They used the Fuzzy-based classification and provided more accuracy in the diagnosis of liver disease. According to the use of algorithmic decision trees on the decision tree of Hepatitis Sowmein V. Shanka, V. Sugumaran, CP Kafkin, diagnosis study Vijayaram TR [19], the C 4.5 algorithm of the decision provides valid results in the prediction of the disease liver. Dr. S. Vijayarani, S. Dhayanand [20] proposed a method to diagnose liver disease. They use SVM and naive Bayesian methods and the Indian liver data sets are used for their research. The algorithm's performance was compared and it was found that SVM provided more accuracy, while the naive Bayes had shorter execution time. Pragati Agarwal and Amit Kumar Dewangan [21] used data mining algorithms to focus on the diagnosis of diabetes. They analyzed cross-validation in kfold, classification methods, K-Nearest Neighbor [KNN], vector support machines [SVM], LDA support vector machines and methods for artificial neural networks for diabetes diagnosis. It is proposed that the support vector machine provides greater accuracy for the diabetic patient data set. Ms Nilam Chandgude and Prof. Suvarna pawar, [22] in their work described the many classification algorithms are used on the diagnosis of diabetes. They used neural network, Decision Tree, NaĂŻve Bayes, Support Vector Machine, ID3, C 4.5, CART algorithms and compared the performance of these algorithm. And found that CART provides better accuracy than other algorithm. Thirumal P. C. and Nagarajan .N [23] presented various data extraction methods to predict diabetes disease. The Pima Indians diabetes dataset is used for analysis. After pre-processing the data, algorithms such as NaĂŻve Bayes Classifier, C4.5 algorithm, SVM, KNN are applied. C4.5 algorithm provided higher accuracy and KNN provided lower accuracy. K. Rajalakshmi and Dr. S. S. Dhenakaran [24] analysed knowledge discovery prediction algorithms in healthcare domain. Various data mining techniques are compared on different disease prediction. SVM algorithm performed well on predicting diabetic. S. Ougiaroglou et al. [25] proposed rather than using a fixed value of k, to use non-fixed number of nearest neighbours i.e. k. Large value of parameter k would also increase the computational cost and time in case of large data sets. To solve this problem it has applied three heuristics so that early break of the algorithm can be possible. These heuristics on fulfilment of a fixed condition would break out from the algorithm. This would save computational time of the algorithm. Y. Cai et al.[26] proposed an another variant of KNN algorithm which uses shared nearest neighbours to classify documents. To find neighbours of a novel entity, it uses similarity measure. A threshold is set, only that much number of nearest neighbours can vote for classification of an unknown entity. S.Vijayarani et al. [27] their study was using SVM and ANN to detect chronic kidney disease. The scope is to compare the performance of the two algorithms based on accuracy and run-time. From the experimental results it shows that the yield of RNA is better than other algorithms. Gopika and Vanitha [29] the purpose of this work is to reduce the diagnostic time and improve the accuracy of the diagnosis by the classification algorithm. Their study deals with the classification of the severity of different layers in chronic kidney disease. The experiment is carried out in different algorithms as back propagation neural networks, radial basis functions and random forests.

163

Integrated Intelligent Research (IIR)

International Journal of Data Mining Techniques and Applications Volume: 07, Issue: 01, June 2018, Page No.161-165 SSN: 2278-2419 4. CONCLUSION

The data mining techniques are listed above, it is learned that this a powerful and essential technique for performing manipulation of data. It gives the proper and targeted outcome from large data source or data worldwide. This paper consists of some of the techniques and tools of data mining. It has become requisite to use data mining techniques to help in decision support and prediction in the field of healthcare to identify the kind of disease, education to identify the behavior of teaching and learning and so applications. Our survey results indicate that on an average with ANN and fuzzy clustering provides on the average better classification accuracy and dimensionality reduction. These methods are eliminating useless and distortive data. This research will contribute reliable and faster automatic disease diagnosis system, where easy diagnosis of disease will saves lives. References [1] Larsson EG, Selén Y. Linear regression with a sparse parameter vector. IEEE Transactions on Signal Processing. 2007 Feb;55(2):451-60. [2] Wu J, Huo Q. A study of minimum classification error (MCE) linear regression for supervised adaptation of MCE-trained continuous-density hidden Markov models. IEEE Transactions on Audio, Speech, and Language Processing. 2007 Feb;15(2):478-88. [3] Zhang S, Zhang L, Qiu K, Lu Y, Cai B. Variable selection in logistic regression model. Chinese Journal of Electronics. 2015 Oct 1;24(4):813-7. [4] Zhang J, Jiang J. Rank-Optimized Logistic Matrix Regression toward Improved Matrix Data Classification. Neural computation. 2018 Feb;30(2):505-25. [5] Vanitha M., Malathi R., Lavanya T., Muthuraman M. (2009), “Data Mining for Network Intrusion Detection System in Real Time”, Journal of Scientific Transactions in Environment and Technovation, Vol. 3(2), pp.74-78. ISSN:0973-9157 [6] Smita, Priti and Sharma, “Use of Data Mining in Various Field: A Survey Paper” IOSR Journal of Computer Engineering, 8727Volume 16, Issue 3, Ver. V (May-Jun. 2014) [7] Rajasekaran S. Efficient parallel hierarchical clustering algorithms. IEEE transactions on parallel and distributed systems. 2005 Jun;16(6):497-502. [8] J. Han and M. Kamber. “Data Mining, Concepts and Techniques”, Morgan Kaufmann, 2000. [9] Crino S, Brown DE. Global optimization with multivariate adaptive regression splines. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics). 2007 Apr;37(2):333-40. [10] Aravinthan K. and Vanitha M (Feb 2016), “A Comparative Study on Prediction of Heart Disease using Cluster and Rank based Approach”, International Journal of Advanced Research in Computer and Communication Engineering. Vol.4, pp.421- 424 ISSN: 2278- 1021, (Impact Factor: 5.332). [11] T. Revathi, S. Jeevitha, ―Comparative Study on Heart Disease Prediction System Using Data Mining Techniques‖, Volume 4 Issue 7, ISSN (Online): 2319-7064, July 2015. [12] Devendra Ratnaparkhi, Tushar Mahajan, Vishal Jadhav, ―Heart Disease Prediction System Using Data Mining Technique‖, International Research Journal of Engineering and Technology (IRJET), Volume: 02 Issue: 08, e-ISSN: 2395 -0056, p-ISSN: 2395-0072, Nov-2015. [13] K.Manimekalai, ―Prediction of Heart Diseases using Data Mining Techniques‖, International Journal of Innovative Research in Computer andCommunication Engineering, Vol. 4, Issue 2, ISSN(Online):23209801, ISSN (Print):2320- 9798, February 2016. [14] Lambodar Jena, Narendra Ku. Kamila, “Distributed Data Mining Classification Algorithms for Prediction of ChronicKidney-Disease”, International Journal of Emerging Research in Management &Technology, Volume-4, Issue-11, and ISSN: 2278-9359, November 2015. [15] Basma Boukenze, Hajar Mousannif and Abdelkrim Haqiq, “Performance of Data Mining Techniques to Predict in Healthcare Case Study: Chronic Kidney Failure Disease”, International Journal of Database Management Systems (IJDMS), Vol.8, No.3, June 2016. [16] Pushpa M. Patil, “Review on Prediction of Chronic Kidney Disease using Data Mining Techniques”, International Journal of Computer Science and Mobile Computing, Vol. 5, ISSN 2320–088X, Issue. 5, May 2016.

164

Integrated Intelligent Research (IIR)

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26] [27] [28]

[29]

International Journal of Data Mining Techniques and Applications Volume: 07, Issue: 01, June 2018, Page No.161-165 SSN: 2278-2419 D.Sindhuja, R. Jemina Priyadarsini, “A Survey on Classification Techniques in Data Mining for Analyzing Liver Disease Disorder”, International Journal of Computer Science and Mobile Computing, Vol.5, Issue.5, ISSN 2320– 088X, May 2016. A.S.Aneeshkumar, Dr. C.Jothi Venkateswaran, “A novel approach for Liver disorder Classification using Data Mining Techniques”, Engineering and Scientific International Journal (ESIJ), Volume 2, Issue 1, ISSN 2394- 7179 (Print), ISSN 2394-7187 (Online), January - March 2015. V.Shankar sowmien, V.Sugumaran, C.P.Karthikeyan, T.R.Vijayaram, “Diagnosis of Hepatitis using Decision tree algorithm”, International Journal of Engineering and Technology (IJET), Vol 8 No 3, e-ISSN : 0975-4024, p-ISSN : 2319-8613, Jun-Jul 2016. Dr. S. Vijayarani, Mr.S.Dhayanand, “Liver Disease Prediction using SVM and Naïve Bayes Algorithms‖, International Journal of Science, Engineering and Technology Research (IJSETR) Volume 4, Issue 4, ISSN: 2278 – 7798, April 2015. Pragati Agrawal, Amit kumar Dewangan, “A Brief Survey on the Techniques used for the Diagnosis of Diabetes-Mellitus”, International Research Journal of Engineering and Technology (IRJET), Volume: 02 Issue: 03, e-ISSN: 2395 - 0056, p-ISSN: 2395-0072, June 2015. Ms. Nilam chandgude, Prof. Suvarna pawar, “A survey on diagnosis of diabetes using various classification algorithm”, International Journal on Recent and Innovation Trends in Computing and Communication, Volume: 3 Issue: 12, ISSN: 2321-8169, 6706 – 6710, December 2015. Thirumal P. C, Nagarajan N, ―Utilization of Data Mining Techniques for Diagnosis of Diabetes MellitusA Case Study”, ARPN Journal of Engineering and Applied Sciences, VOL. 10, NO. 1, ISSN 1819-6608, January 2015. K. Rajalakshmi, Dr. S. S. Dhenakaran, “Analysis of Datamining Prediction Techniques in Healthcare Management System”, International Journal of Advanced Research in Computer Science and Software Engineering, Volume 5, Issue 4, ISSN: 2277 128X, April 2015. S. Ougiaroglou, A. Nanopoulos, A. N. Papadopoulos, Y. Manolopoulos, and T. Welzer-Druzovec, “Adaptive k-NearestNeighbor Classification Using a Dynamic Number of Nearest Neighbors,” in Advances in Databases and Information Systems, Y. Ioannidis, B. Novikov, and B. Rachev, Eds. Springer Berlin Heidelberg, 2007, pp. 66–82. Y. Cai, D. Ji, and D. Cai, “A KNN Research Paper Classification Method Based on Shared Nearest Neighbor,” in Proceedings of the 8th NTCIR Workshop Meeting, 2008. S.Vijayarani, S.Dhayanand “Kidney disease Prediction Using SVM and ANN Algorithms”, International Journal of Computing and Business Research (IJCBR) ISSN (Online): 2229-6166 Vol. 6, Issue 2, 2015. S. Ramya, N. Radha “Diagnosis of Chronic Kidney Disease Using Machine Learning Algorithms”, International Journal of Innovative Research in Computer and Communication Engineering. Vol. 4, Issue 1, 2016. S. Gopika , Dr. M. Vanitha “Survey on Prediction of Kidney Disease by using Data Mining Techniques”, International Journal of Advanced Research in Computer and Communication Engineering, Vol. 6, Issue 1, 2017.

165

An Efficient Technique using Data Mining Algorithms for Predicting Disease – A Review