IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017
Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
Effect of Data Size on Feature Set Using Classification in Health Domain Uttham H1*, Gowramma2 1 PG-Student, 2Associate Professor, Dept. Computer Science & Engineering, D.B.I.T, Banglore, Karnataka, India. 1* utthamhmanju@gmail.com,2gowramma@gmail.com.
ABSTRACT: In health domain, the major critical issue is prediction of disease in early stage. Prediction of disease is mainly based on the experience of physician so many machine learning approach contribute their work in the prediction of disease. In existing approaches, either prediction or feature selection has been concentrated. The aim of this paper is to present the effect of data size and set of features in the prediction of disease in health domain using NaĂŻve Bayes. This shows how each attribute or combination of attribute behaves on different size of dataset. Keywords: Machine Learning, Classification, NaĂŻve Bayes, feature selection. 1. INTRODUCTION In health, domain diagnosis of disease is
the experience. If the physician has more
very challenging task. Earlier prediction can
experience, then he may predict well. if the
made based on some lab test. Using this lab
physician has less experience then he may
test report the physician will decide whether
predict wrongly.to overcome from this
the patient has disease or not but prediction
problem
of disease by physician mainly depend on
approaches like KNN, SVM, ANN to
IDL - International Digital Library
1|P a g e
machine
learning
has
many
Copyright@IDL-2017
IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017
Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
predict correctly. Machine learning is a
includes phrases such as to gain knowledge,
branch of science that allows machine to
understanding of, or skill by studying the
make decision.to make decision machine has
instruction or experience and modification
to learn on itself or by experience. There are
of a behavioural tendency by experienced
three types of learning supervised learning,
zoologists and psychologists study learning
unsupervised
in animals and humans [1]. The extraction of
learning,
reinforcement
learning.
important information from a large pile of
The aim of this study is to find the effect on performance of different feature set using WEKA on different size of Pima Indian Diabetes Dataset. A critical challenge in medical science is to attain the diagnosis correctly. For correct diagnosis, generally many tests done to predict correctly. All of these test procedures said to be necessary in order to reach the ultimate diagnosis. However, on the other hand, too many tests could complicate the main diagnosis process and lead to the difficulty in obtaining the results, particularly in the case where many tests performed. This kind of difficulty could be resolved with the aid of machine learning which used directly to obtain the result with the aid of several classification techniques. Machine learning covers such a broad range of processes that it is difficult to define it precisely. A dictionary definition
IDL - International Digital Library
2|P a g e
data and its correlations is often the advantage of using
machine
learning.
Humans are constantly discovering new knowledge about tasks. There is a constant stream of new events in the world and continuing redesign of Artificial Intelligent systems to conform to new knowledge is impractical but machine-learning methods might be able to track much of it [1]. There is a substantial amount of research has been done with machine learning algorithms such as Bayes network, Multilayer Perceptron, Decision tree and pruning like J48graft, C4.5, Single Conjunctive Rule Learner like FLR, JRip and Fuzzy Inference System and Adaptive Neuro-Fuzzy Inference System. 2. RELATED WORK A good number of researches have been reported in literature on diagnosis of different deceases. Sapna and Tamilarasi [2]
Copyright@IDL-2017
IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017
Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
proposed a technique based on neuropathy
adequate
diabetics. Nerve disorder is caused by
measure in the detection of eye suspect
diabetic
regions based on neuro-fuzzy subsystem.
mellitus.
Long
term
diabetic
patients could have diabetic neuropathies very easily. There is fifty (50%) percent probability to have such diseases which affect many nerves system of the body. For example, body wall, limbs (which called as somatic nerves) could be affected. On the other hand, internal organ like heart, stomach, etc., are known as automatic nerves. In this paper, the risk factors and symptoms of diabetic neuropathy are used to make the fuzzy relation equation. Fuzzy relation
equation
perception
of
is
linked
with
composition
of
the
binary
relations that means they used Multilayer Perceptron NN using Fuzzy Inference
index to
provide
percentage
Radha and Rajagopalan [4] introduced an application of fuzzy logic to diagnosis of diabetes. It describes the fuzzy sets and linguistic variables that contribute to the diagnosis of disease particularly diabetes. As we all know fuzzy logic is a computational paradigm, that provides a tool based on mathematics which deals with un- certainty. At the same time this paper also presents a computer-based Fuzzy Logic with maximum and mini- mum relationship, membership values
consisting
of
the
components,
specifying fuzzy set frame work. Forty patients’ data have been collected to make this relationship more strong.
System. Faezeh,Hossien, Ebrahim [7] proposed a Leonarda
and
Antonio
[6]
proposed
automatic detection of diabetic symptoms in retinal images by using a multilevel perceptron neural network. The network trained using algorithms for evaluating the optimal
global
threshold,
which
can
minimize pixel classification errors. System
fuzzy clus- tering technique (FACT) which determined the number of appropriate clusters based on the pattern essence. Different experiments for algorithm evaluation were per- formed which showed a better performance compared to the typical widely used K-means clustering algorithm. Data
performances evaluated by means of an
IDL - International Digital Library
3|P a g e
Copyright@IDL-2017
IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017
Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
was taken from the UCI Machine Learning
1
Plasma
Repository [3].
glucose
concentration hours
3. DATA SET DESCRIPTION
in an
glucose The characteristics of the data set used in this research are summarized in following
2 oral
tolerance
test 2
Diastolic
Table 1. The detailed descriptions of the data set are available at UCI repository
a
blood
pressure (mm Hg) 3
Triceps
which contains 768 instances [3]
skin
fold
thickness (mm) 4
Dataset->Pima Indian diabetes
2-hour serum insulin (mu U/ml)
No of example->768 5
Body
mass
index
Input attribute->8
(weight in kg/(height
Output classes->two
in m)^2) 6
Diabetes
Total number of attribute->nine
pedigree
function
Missing attributes status->No
7
Age (years)
8
Class variable (0 or
Noisy attribute status->No 1) Table 1. Characteristics of data sets Sl number
Attributes
0
Number
4. METHODOLOGY of
times
In this paper, we will use machine learning techniques
pregnant
like
the
NaĂŻve
Bayes
classification techniques for classification of diabetes data
IDL - International Digital Library
4|P a g e
Copyright@IDL-2017
IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017
Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
4.1. Naïve BayesThe Naïve Bayes [5]
We
classifier provides a simple approach, with
classifiers
clear semantics, representing and learning
performance metrics like precision value,
probabilistic knowledge. It is termed naïve
recall value, F-measure value.
because
is
relies
on
two
important
measure the performance of the with
respect
to
different
Precision value (p): provides correctness
simplifying assumes that the predictive attributes are conditionally independent
Calculate the precision with respect to a
given the class, and it assumes that no
particular class. This is defined as
hidden or latent attributes influence the
Correctly classified positives p= -----------------------------Total predicted as positive
prediction process. Naive Bayes: The Naive Bayes classifier is a simple supervised learning probabilistic
Recall value(r): provides completeness
classifier based on Bayes’ theorem.
Calculate the recall with respect to a
P(c|x) =P(x|c)P(c)/ P(x)--------->(1)
particular class. This is defined as
P(c|x) = P(x1|c)P(x2|c)...P(x6|c)P(c)---------
Correctly classified positives r= ---------------------------------------------Total positives
> (2) Where
F-Measure (f): it is the harmonic mean of
P(c|x) is the posterior probability of the class (high-risk or low-risk) given the predictors, calculated as (2), P(c) is the prior probability of the class, P(x|c) is the likelihood which is the probability of the predictor given the class, and P(x) is the
precision value and recall value Calculate the F-Measure with respect to a particular class. This is defined as 2*r*p F=-- ---------------------r+p
prior probability of predictor. 6 EXPERIMENTAL WORK 5. PERFORMANCE METRICS
IDL - International Digital Library
5|P a g e
Copyright@IDL-2017
IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017
Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
This experiment have done with the help of
possible
subset)Apply10-fold
open source tools in window environment
validation for building the model then note
using eclipse software. In this experiment,
down the precision value, recall value, f
we used the java code and libraries, which
score value. Repeat the experiment for 90%
are available in WEKA. To conduct the
of data,80% of data,70% of data,60% of
experiment following procedure has to
data,50%
follow.
experiment we know how each feature or
We divide our data set into training sets and
combination of feature act on different size
testing sets to apply supervised learning. We
of data.
of
data.by
cross
conducting
this
use Naive Bayes classifier to explore our data set, primarily because previous work has shown that these algorithms present a
7.
RESULT
ANALYSIS
AND
good trade-off between simplicity and
DISCUSSION
accuracy. Patients are classified into one of
In this paper, we examine the effect of data
two classes: (i) ’diabetic’ i or (ii) ’non -
size on feature set using naïve Bayes
diabetic’. We use 10-fold cross validation in
classifier.
training and then we apply the model onto For each attribute set for example if there
our testing set. Consider 100% of data means full instances
are 8 attribute then 2^8-1=256-1=255 subset possible. For each subset graph generated.
then For each possible subset for features (for example if there are 8 attribute then 2^8
Which shows performance of each The following figure shows the effect of features(0,1,2,4) on different data size
IDL - International Digital Library
6|P a g e
Copyright@IDL-2017
IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017
Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
The below graph shows the effect of attribute subset(2,4,5,6) on different data size
IDL - International Digital Library
7|P a g e
Copyright@IDL-2017
IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017
Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
8.
CONCLUSION
AND
FUTURE
WORK
9. REFERENCES [1]
N.J.Nilsson,
“Introduction
The objective of this study is to evaluate
Machine
effect of data size on feature set and
http://ai.stanford.edu/~nilsson/mlboo
investigate the performance using Naïve
k.html
Bayes algorithm based on WEKA. The
[2]
Learning,”
to 2010
M. S. Sapna and D. A. Tamilarasi,
experiment shows the effect of each attribute
“Fuzzy
or combination of attribute affecting the
Preventing Neuropathy Diabetic,”
performance on different data size i.e. for
Relational
Equation
in
Internati- onal Journal of Recent
each possible subset of attribute affecting
Trends in Engineering, Vol. 2, No. 4,
the performance for prediction on different
2009, p. 126.
size of data. [3] As a future work we can conduct same
UCI Machine Learning Repository. http://www.ics.uci.edu/mlearn/MLR
experiment on different data set for example
epository.html
:heart attack dataset and diabetes dataset from the experiment we can combine
[4]
R. Radha and S. P. Rajagopalan,
common attribute affect for prediction also
“Fuzzy
we can work using different classification
Diagnosis of Diabetes,” Information
algorithm.
Technology Journal, Vol. 6, No. 1,
IDL - International Digital Library
8|P a g e
Logic
Approach
Copyright@IDL-2017
for
IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017
Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
pp.
96-102.
doi:10.3923/itj.2007.96.102 [5]
G.
H.
John
and
P.
“Estimating
Langley,
Continuous
Distributions
in
Bayesian
Classifiers,” Proceedings of the 11th Conference
on
Uncertainty
Artificial
Intelligence,
in San
Francisco, 1995, pp. 338-345.
[6]
L. Carnimeo and A. Giaquinto, “An Intelligent System for Improving Detection of Diabetic Symptoms in Retinal Images,” IEEE International Conference
on
Information
Technology
in
Biomedicine,
Ioannina, 26-28 October 2006. [7]
F. Ensan, M. H. Yaghmaee and E. Bagheri,
“Fact:
A New
Fuzzy
Adaptive Clustering Technique,” The 11th IEEE Symposium on Computers and Communications, Sardinia, 2629
June
2006,
pp.
442-447.
doi:10.1109/ISCC.2006.73
IDL - International Digital Library
9|P a g e
Copyright@IDL-2017
IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017
Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
IDL - International Digital Library 2017
10 | P a g e
Copyright@IDL-