International Journal of Engineering and Techniques - Volume 2 Issue 3, May – June 2016
RESEARCH ARTICLE
OPEN ACCESS
Classification of Thyroid Disease with Feature Selection Technique Amit Kumar Dewangan1, Akhilesh Kumar Shrivas2, Prem Kumar3 1,2,3
Dr. C.V.Raman University,Kota, Bilaspur, Chhattisgarh,India
Abstract:
In medical science, Thyroid classification is one the important role for classification of thyroid diseases. Diagnosis of health condition is very challenging task for every human being because life is directly related to health condition. Data mining based classification is one of the important applications for classification of data. In this research work, we have used various classification techniques for classification of thyroid data. CART gives highest accuracy 99.47% as best model. Feature selection plays very important role to computationally efficient and increase the performance of model. This research work focus on Info Gain and Gain Ratio feature selection technique to reduce the irrelevant features from original data set and computationally increase the performance of model. We have applied both the feature selection techniques on best model i. e. CART. Our proposed CART-Info Gain and CARTGain Ratio gives 99.47% and 99.20% accuracy with 25 and 3 feature respectively. Keywords—Thyroid, Classification, Feature Selection.
I.
INTRODUCTION
Now a day’s, data is increasing every day in every organization. Today, healthcare is one of the most important field where every day lots amount of patient data are storing. Data mining based techniques are playing very important role for classification of data. Classification is one of the important data mining applications that can apply for classification of medical data. In this research work, we have used various data mining based classification technique for classification of thyroid data. There are various authors have worked in the field of classification of data. F. S Gharehchopogh, et al. [1] have proposed Multilayer Perceptron (MLP) for thyroid diseases classification and given 98.6% of accuracy. M. R. Nazari Kousarrizi, et al. [2] have suggested support vector machine (SVM) for classification of thyroid diseases with two data set, the first dataset is collected from UCI repository and the second data set is the real data which has been gathered by Intelligent System Laboratory of K. N. Toosi University of Technology from Imam Khomeini hospital. The proposed technique given 98.62% of accuracy with 3 features in case of first data set. S. Gaikwad, et. al. [3 ] have suggested random forest for classification of thyroid data. The suggested model given 96.63% of accuracy. D. Snthikumar, et.al. [4 ] have suggested various classification techniques like Naïve Bays(NB) , k-Nearest Neighbor (K-NN).S. Panday, et. al. [5] have used various classifiers like C4.5, Random Forest, Multilayer perceptron and Bayes Net for classification of thyroid data. The classifier C4.5 given 99.47% of accuracy with 5 feature subset as robust classifier. A. Upadhyay, et al. [6] have used Tree (CT) ,Clark & Nilbert2(CN2) for classification of thyroid diseases. Clark and Nilbert2 (CN2) given better result two decision tree classifier as C4.5 and C5.0 for classification of thyroid data. C5.0 model given 95%
ISSN: 2395-1303
of accuracy which is better than C4.5 classifier. D . Kerana Hanirex, et al. [7] have suggested NNge model for classification of thyroid data. NNge classifier given 96.44% of accuracy with reduced number of features. Md F . Kabiret al. [8] have suggested naïve bayes classifier for classification of thyroid diseases. The proposed model given 94.13% of accuracy as best classifier. D. Lavanya, et al. [9] have suggested CART classifier and compared with other decision tree classifier as C4.5 and ID3 for classification of thyroid data. The CART achieved highest accuracy as 94.68% as best model. II.
METHODS AND MATERIALS
This section elaborates the various data mining based classification techniques as well as feature selection technique for classification of thyroid data. In this section ,we have also described about data set and performance measures which play major role in this research work. A
C4.5
C4.5 [10] is an extension of ID3 that accounts for unavailable values, continuous attribute value ranges, pruning of decision trees and rule derivation. In building a decision tree, we can deal with training sets that have records with unknown attributes values by evaluating the gain, or the gain ratio, for an attribute values are available. We can classify the records that have unknown attribute value by estimating the probability of the various possible results. Unlike, CART, which generates a binary decision tree, C4.5 produces tree with variable branches per node. When a discrete variable is
http://www.ijetjournal.org
Page 128