Tr 00080

Page 1

IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017

Effect of Data Size on Feature Set Using Classification in Health Domain Uttham H1*, Gowramma2 1 PG-Student, 2Associate Professor, Dept. Computer Science & Engineering, D.B.I.T, Banglore, Karnataka, India. 1* utthamhmanju@gmail.com,2gowramma@gmail.com.

ABSTRACT: In health domain, the major critical issue is prediction of disease in early stage. Prediction of disease is mainly based on the experience of physician so many machine learning approach contribute their work in the prediction of disease. In existing approaches, either prediction or feature selection has been concentrated. The aim of this paper is to present the effect of data size and set of features in the prediction of disease in health domain using NaĂŻve Bayes. This shows how each attribute or combination of attribute behaves on different size of dataset. Keywords: Machine Learning, Classification, NaĂŻve Bayes, feature selection. 1. INTRODUCTION In health, domain diagnosis of disease is

the experience. If the physician has more

very challenging task. Earlier prediction can

experience, then he may predict well. if the

made based on some lab test. Using this lab

physician has less experience then he may

test report the physician will decide whether

predict wrongly.to overcome from this

the patient has disease or not but prediction

problem

of disease by physician mainly depend on

approaches like KNN, SVM, ANN to

IDL - International Digital Library

1|P a g e

machine

learning

has

many

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017

predict correctly. Machine learning is a

includes phrases such as to gain knowledge,

branch of science that allows machine to

understanding of, or skill by studying the

make decision.to make decision machine has

instruction or experience and modification

to learn on itself or by experience. There are

of a behavioural tendency by experienced

three types of learning supervised learning,

zoologists and psychologists study learning

unsupervised

in animals and humans [1]. The extraction of

learning,

reinforcement

learning.

important information from a large pile of

The aim of this study is to find the effect on performance of different feature set using WEKA on different size of Pima Indian Diabetes Dataset. A critical challenge in medical science is to attain the diagnosis correctly. For correct diagnosis, generally many tests done to predict correctly. All of these test procedures said to be necessary in order to reach the ultimate diagnosis. However, on the other hand, too many tests could complicate the main diagnosis process and lead to the difficulty in obtaining the results, particularly in the case where many tests performed. This kind of difficulty could be resolved with the aid of machine learning which used directly to obtain the result with the aid of several classification techniques. Machine learning covers such a broad range of processes that it is difficult to define it precisely. A dictionary definition

IDL - International Digital Library

2|P a g e

data and its correlations is often the advantage of using

machine

learning.

Humans are constantly discovering new knowledge about tasks. There is a constant stream of new events in the world and continuing redesign of Artificial Intelligent systems to conform to new knowledge is impractical but machine-learning methods might be able to track much of it [1]. There is a substantial amount of research has been done with machine learning algorithms such as Bayes network, Multilayer Perceptron, Decision tree and pruning like J48graft, C4.5, Single Conjunctive Rule Learner like FLR, JRip and Fuzzy Inference System and Adaptive Neuro-Fuzzy Inference System. 2. RELATED WORK A good number of researches have been reported in literature on diagnosis of different deceases. Sapna and Tamilarasi [2]

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017

proposed a technique based on neuropathy

adequate

diabetics. Nerve disorder is caused by

measure in the detection of eye suspect

diabetic

regions based on neuro-fuzzy subsystem.

mellitus.

Long

term

diabetic

patients could have diabetic neuropathies very easily. There is fifty (50%) percent probability to have such diseases which affect many nerves system of the body. For example, body wall, limbs (which called as somatic nerves) could be affected. On the other hand, internal organ like heart, stomach, etc., are known as automatic nerves. In this paper, the risk factors and symptoms of diabetic neuropathy are used to make the fuzzy relation equation. Fuzzy relation

equation

perception

of

is

linked

with

composition

of

the

binary

relations that means they used Multilayer Perceptron NN using Fuzzy Inference

index to

provide

percentage

Radha and Rajagopalan [4] introduced an application of fuzzy logic to diagnosis of diabetes. It describes the fuzzy sets and linguistic variables that contribute to the diagnosis of disease particularly diabetes. As we all know fuzzy logic is a computational paradigm, that provides a tool based on mathematics which deals with un- certainty. At the same time this paper also presents a computer-based Fuzzy Logic with maximum and mini- mum relationship, membership values

consisting

of

the

components,

specifying fuzzy set frame work. Forty patients’ data have been collected to make this relationship more strong.

System. Faezeh,Hossien, Ebrahim [7] proposed a Leonarda

and

Antonio

[6]

proposed

automatic detection of diabetic symptoms in retinal images by using a multilevel perceptron neural network. The network trained using algorithms for evaluating the optimal

global

threshold,

which

can

minimize pixel classification errors. System

fuzzy clus- tering technique (FACT) which determined the number of appropriate clusters based on the pattern essence. Different experiments for algorithm evaluation were per- formed which showed a better performance compared to the typical widely used K-means clustering algorithm. Data

performances evaluated by means of an

IDL - International Digital Library

3|P a g e

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017

was taken from the UCI Machine Learning

1

Plasma

Repository [3].

glucose

concentration hours

3. DATA SET DESCRIPTION

in an

glucose The characteristics of the data set used in this research are summarized in following

2 oral

tolerance

test 2

Diastolic

Table 1. The detailed descriptions of the data set are available at UCI repository

a

blood

pressure (mm Hg) 3

Triceps

which contains 768 instances [3]

skin

fold

thickness (mm) 4

Dataset->Pima Indian diabetes

2-hour serum insulin (mu U/ml)

No of example->768 5

Body

mass

index

Input attribute->8

(weight in kg/(height

Output classes->two

in m)^2) 6

Diabetes

Total number of attribute->nine

pedigree

function

Missing attributes status->No

7

Age (years)

8

Class variable (0 or

Noisy attribute status->No 1) Table 1. Characteristics of data sets Sl number

Attributes

0

Number

4. METHODOLOGY of

times

In this paper, we will use machine learning techniques

pregnant

like

the

NaĂŻve

Bayes

classification techniques for classification of diabetes data

IDL - International Digital Library

4|P a g e

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017

4.1. Naïve BayesThe Naïve Bayes [5]

We

classifier provides a simple approach, with

classifiers

clear semantics, representing and learning

performance metrics like precision value,

probabilistic knowledge. It is termed naïve

recall value, F-measure value.

because

is

relies

on

two

important

measure the performance of the with

respect

to

different

Precision value (p): provides correctness

simplifying assumes that the predictive attributes are conditionally independent

Calculate the precision with respect to a

given the class, and it assumes that no

particular class. This is defined as

hidden or latent attributes influence the

Correctly classified positives p= -----------------------------Total predicted as positive

prediction process. Naive Bayes: The Naive Bayes classifier is a simple supervised learning probabilistic

Recall value(r): provides completeness

classifier based on Bayes’ theorem.

Calculate the recall with respect to a

P(c|x) =P(x|c)P(c)/ P(x)--------->(1)

particular class. This is defined as

P(c|x) = P(x1|c)P(x2|c)...P(x6|c)P(c)---------

Correctly classified positives r= ---------------------------------------------Total positives

> (2) Where

F-Measure (f): it is the harmonic mean of

P(c|x) is the posterior probability of the class (high-risk or low-risk) given the predictors, calculated as (2), P(c) is the prior probability of the class, P(x|c) is the likelihood which is the probability of the predictor given the class, and P(x) is the

precision value and recall value Calculate the F-Measure with respect to a particular class. This is defined as 2*r*p F=-- ---------------------r+p

prior probability of predictor. 6 EXPERIMENTAL WORK 5. PERFORMANCE METRICS

IDL - International Digital Library

5|P a g e

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017

This experiment have done with the help of

possible

subset)Apply10-fold

open source tools in window environment

validation for building the model then note

using eclipse software. In this experiment,

down the precision value, recall value, f

we used the java code and libraries, which

score value. Repeat the experiment for 90%

are available in WEKA. To conduct the

of data,80% of data,70% of data,60% of

experiment following procedure has to

data,50%

follow.

experiment we know how each feature or

We divide our data set into training sets and

combination of feature act on different size

testing sets to apply supervised learning. We

of data.

of

data.by

cross

conducting

this

use Naive Bayes classifier to explore our data set, primarily because previous work has shown that these algorithms present a

7.

RESULT

ANALYSIS

AND

good trade-off between simplicity and

DISCUSSION

accuracy. Patients are classified into one of

In this paper, we examine the effect of data

two classes: (i) ’diabetic’ i or (ii) ’non -

size on feature set using naïve Bayes

diabetic’. We use 10-fold cross validation in

classifier.

training and then we apply the model onto For each attribute set for example if there

our testing set. Consider 100% of data means full instances

are 8 attribute then 2^8-1=256-1=255 subset possible. For each subset graph generated.

then For each possible subset for features (for example if there are 8 attribute then 2^8

Which shows performance of each The following figure shows the effect of features(0,1,2,4) on different data size

IDL - International Digital Library

6|P a g e

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017

The below graph shows the effect of attribute subset(2,4,5,6) on different data size

IDL - International Digital Library

7|P a g e

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017

8.

CONCLUSION

AND

FUTURE

WORK

9. REFERENCES [1]

N.J.Nilsson,

“Introduction

The objective of this study is to evaluate

Machine

effect of data size on feature set and

http://ai.stanford.edu/~nilsson/mlboo

investigate the performance using Naïve

k.html

Bayes algorithm based on WEKA. The

[2]

Learning,”

to 2010

M. S. Sapna and D. A. Tamilarasi,

experiment shows the effect of each attribute

“Fuzzy

or combination of attribute affecting the

Preventing Neuropathy Diabetic,”

performance on different data size i.e. for

Relational

Equation

in

Internati- onal Journal of Recent

each possible subset of attribute affecting

Trends in Engineering, Vol. 2, No. 4,

the performance for prediction on different

2009, p. 126.

size of data. [3] As a future work we can conduct same

UCI Machine Learning Repository. http://www.ics.uci.edu/mlearn/MLR

experiment on different data set for example

epository.html

:heart attack dataset and diabetes dataset from the experiment we can combine

[4]

R. Radha and S. P. Rajagopalan,

common attribute affect for prediction also

“Fuzzy

we can work using different classification

Diagnosis of Diabetes,” Information

algorithm.

Technology Journal, Vol. 6, No. 1,

IDL - International Digital Library

8|P a g e

Logic

Approach

Copyright@IDL-2017

for


IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017

pp.

96-102.

doi:10.3923/itj.2007.96.102 [5]

G.

H.

John

and

P.

“Estimating

Langley,

Continuous

Distributions

in

Bayesian

Classifiers,” Proceedings of the 11th Conference

on

Uncertainty

Artificial

Intelligence,

in San

Francisco, 1995, pp. 338-345.

[6]

L. Carnimeo and A. Giaquinto, “An Intelligent System for Improving Detection of Diabetic Symptoms in Retinal Images,” IEEE International Conference

on

Information

Technology

in

Biomedicine,

Ioannina, 26-28 October 2006. [7]

F. Ensan, M. H. Yaghmaee and E. Bagheri,

“Fact:

A New

Fuzzy

Adaptive Clustering Technique,” The 11th IEEE Symposium on Computers and Communications, Sardinia, 2629

June

2006,

pp.

442-447.

doi:10.1109/ISCC.2006.73

IDL - International Digital Library

9|P a g e

Copyright@IDL-2017


IDL - International Digital Library Of Technology & Research Volume 1, Issue 5, May 2017

Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017

IDL - International Digital Library 2017

10 | P a g e

Copyright@IDL-


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.