Tr 00086

Page 1

IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017

Prediction of heart disease using classification mining technique on spark Rashmi G Saboji Computer Science &Engineering C.M.R Institute of Technology Bangalore, India rashmikaneri@gmail.com

This paper identifies the increasing health care data which is being accumulated digitally every day. The healthcare industry is becoming very data intensive. Worldwide digital healthcare data is estimated to be equal to 500 petabytes (1015 bytes), and is expected to reach 25 exabytes (1018 bytes) in 2020 [6].In this paper, heart disease is one such disease selected among variety of disease in healthcare. The purpose of this work is to predict the diagnosis of heart disease with reduced number of attributes. Each dataset stored in HDFS is classified based on attributes. This prediction solution using random forest on apache spark gives massive opportunity for health care analysts to deploy this solution on ever changing, scalable big data landscape for insightful decision making.

(EHR). Concurrently, there is fast progress are being made in clinical analytics, such as techniques for analyzing large volumes of data and derive new insights from that analysis, which is known as big data analytics. As a result of this, we can utilize remarkable opportunities provided by big data to reduce the costs of health care as well as diagnosing the diseases.In this paper, heart disease is one such disease selected among variety of disease in healthcare. Heart disease is a general name for a variety of diseases. Heart disease symptoms may vary depending on the specific type of heart disease.

Keywords: Spark, HDFS, Heart disease, Random forest, verification

So by using big data with data mining algorithms makes it possible to do many things such as,identify healthcare trends, prevent diseases, and diagnose the diseases and so on.

Abstract:

1.

INTRODUCTION

2.

The health care system is rapidly adopting electronic health records, which will drastically increase the quantity of clinical data’s that are available digitally IDL - International Digital Library

The hospitalsuse the hospital database systems to store and manage their patient data. These systems generate large volumes of data, but these data are rarely used to support insightful clinical decision making.

1|P a g e

OBJECTIVES

The purpose of this work is to predict the diagnosis of heart disease with reduced number of attributes. Each Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017 dataset stored in HDFS is classified based on attributes. Here thirteen attributes and one class is involved in predicting heart disease. This prediction solution using random forest on apache spark gives massive opportunity for health care analysts to deploy this solution on ever changing, scalable big data landscape for insightful decision making. The scope of this project mainly deals with Data analysis part to improve in below 4 issues 1.

2.

3.

4.

Complexity of the analysis-For some analysis algorithms, the computing time increases dramatically even with small amounts of data growth. Accuracy in prediction- Different data mining algorithms in classification, clustering, regression and association have different accuracy points when it comes to prediction. Scale of the data – Even for simple data analysis, it could take several days, even months, to obtain the result when data is very large (e.g. zettabytes scale). Parallelization of computing model – For those computationally intense problems, we can parallelize the analysis so that the problem can be solved by distributing tasks over many computers.

3. METHODOLOGY We put into outcomes of heart disease prediction and accuracy competencies on the Spark and Hadoop platform for the reason that attribute data sets are going to most likely scales out to several machines, and representation of whole solution on abstract architecture level is shown in Fig 3.1. Firstly, the spark eco system need to be understood in order to take the advantage of its functionalities and support for machine learning libraries. The Fig 3.2 show cases the ecosystem of spark including its underlying resource manager namely-YARN and dispersed file system that is HDFS. Next task is to collect heart disease datasets in csv files. These datasets needs to processed, which means, datasets are labeled with class.

Fig 3.1: System Architecture

The main objective of this work is to identify the key patterns or features from the medical data using the classifier model. The attributes that are more relevant to heart disease diagnosis can be observed. This will help the medical practitioners to understand the root causes of disease in depth.

Fig 3.2: Spark Eco System The class is simply numerical representation of heart disease prediction based on attribute values. Class 0 means absence of heart disease and class 1 being IDL - International Digital Library

2|P a g e

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017 presence of heart disease. Once the data is collected in csv, it is stored in HDFS as it provides great fault tolerance. Then the data is extracted and parsed in order to handle the missing values of attributes. Finally random forest is used to predict the newly arrived unsupervised, label less dataset’s class. And same algorithm is used to find its accuracy over the increase of training data set which addresses the issue of scalability on the big data. Finally we check the computation time of algorithm on spark to address the issue of computational complexity. It also shows that error rate decreases as accuracy increases over the increase of training data set.

and 1 being present. Due to personal security patient’s personal identification information replaced with dummy values. ForPrediction and Accuracy:  

 4.

IMPLEMENTATION

The heart disease datasets are collected from source as given below [12]: The UCI machine learning is most widely used repository which contains different datasets from different locations. These data sets are used for data mining and machine learning purposes. As for heart disease prediction, data is collected from Cleveland, Switzerland and Hungary. Below contains further information about source and attributes. 1. Hungarian Institute of Cardiology. Budapest: AndrasJanosi, M.D. 2. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D. 3. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D. 4. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.[] The "num" attributes notify to the presence of heart disease in the patient. The range of this attribute is from 0 (no presence) to 4.Most of the experiments associated with Cleveland database are focused on absence (Num‖ value 0) and presence (―Num‖ values from 1 to 4). For our experimentation, we are using 2 classes for prediction, that is 0 being absent

IDL - International Digital Library

3|P a g e

 

Datasets are extracted from HDFS. Datasets are parsed to fill in missing values to provide complete supervised datasets. Then datasets are divided into training dataset and testing datasets in 70:30 proportion. It is then applied to train the random forest model with optimized parameters as explained above. Evaluate model on test instances and compute test error. Based on the model on the test instances, prediction can be found. Based on the comparison of previous label value of test data and predicted value by algorithm, accuracy is evaluated.

5.

OUTCOMES

Below graph showcases the outcome of random forest implementation on spark. The same prediction model is built using Naïve Bayes but below figure clearly shows that Bayes prediction accuracy does not reach expected accuracy level as compared to random forest. Fig5.1 depicts difference in accuracy performance between Random forest and Naïve Bayes w.r.t increase in training datasets which are stored in HDFS.

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017 Fig5.2 shows the increase in accuracy as training datasets used for prediction increases.

Accuracy Performance Comparision 150% 100% 50% 0% 200

400

600

Random Forest

REFERENCES

NaĂŻve Bayes

[1] Mu-HsingKuo, Dillon Chrimes, Belaid Moa, Wei Hu "Design and Construction of a Big Data Analytics Framework for Health Applications" 2015 IEEE International Conference on Smart City/SocialCom/SustainCom together with DataCom 2015 and SC2 2015

Fig 5.1: Accuracy comparison chart

100%

Random Forest

98% 96%

600, 98%

Accuracy

400, 96%

94%

92% 90% 88%

200, 88%

86% 0

200

400

600

800

No of Records

[2] K. Rajalakshmi1* and K. Nirmala2 "In Heart Disease Prediction with MapReduce by using Weighted Association Classifier and K-Means" Indian Journal of Science and Technology, Vol 9(19), DOI: 10.17485/ijst/2016/v9i19/93827, May 2016 [3] Purushottam, Prof. (Dr.) Kanak Saxena, Richa Sharma "Efficient Heart Disease Prediction System using Decision Tree" International Conference on Computing, Communication and Automation (ICCCA2015) [4] Jian Fu, Junwei Sun, Kaiyuan Wang "SPARK—A Big Data Processing Platform for Machine Learning", 2016 International Conference on Industrial Informatics - Computing Technology, Intelligent Technology, Industrial Information Integration

Fig 5.2: Accuracy graph of Random Forest

CONCLUSION AND FUTURE ENHANCEMENT

Utilizing big data analytics, the healthcare data being generated from time to time in medical field can be processed faster for predicting diseases with none IDL - International Digital Library

overhead. As the data advances there will not be any decrease in the performance of the system. This experimental outcome on heart disease attributes validate the efficiency of accuracy in prediction and computation time in colossal scale w.r.t scalable, ever growing data. In the future it can be validated on larger supervised dataset in colossal scale running on cluster setup with even more optimized parameter of algorithm. It can also be validated on unsupervised datasets to check prediction accuracy.

4|P a g e

[5] Patil R Priya, Kinariwala A S, "Automated Diagnosis of Heart Disease using Random Forest Algorithm" International Journal of Advance Research, Ideas and Innovations in Technology

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017 [6] Sun J., Reddy C.K. Big Data Analytics for Healthcare. Tutorial presentation at the SIAM International Conference on Data Mining, Austin, TX, 2013. [7] Hughes G. How big is 'Big Data' in healthcare? URL:http://blogs.sas.com/content/ hls/2011/10/21/how-big-is-big-datainhealthcare/[accessed 2014-9-26]. [8] Herland et al. Journal of Big Data 2014, 1:2 "A review of data mining using big data in health informatics". [9] https://hortonworks.com/apache/hdfs/ [10] https://spark.apache.org/ [11] http://data-flair.training/blogs/hadoop-mapreducevs-apache-spark/ [12]https://archive.ics.uci.edu/ml/datasets/Heart+Disea se

[13]https://www.stat.berkeley.edu/~breiman/RandomF orests/ [14] AnkushVerma, Ashik Hussain Mansuri, and Dr. Neelesh Jain ―Big Data Management Processing with Hadoop MapReduce and SparkTechnology: A Comparison" 2016 Symposium on Colossal Data Analysis and Networking (CDAN) [15] Amit Nandi, Spark for Python Developers

IDL - International Digital Library

5|P a g e

Copyright@IDL-2017


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.