239 by ides editor

Proc. of Int. Conf. on Control, Communication and Power Engineering 2010

Supervised Feature Subset Selection Based On Extended Fuzzy Relative Information Measure For Boundary Samples 1

K. Sarojini1*, K. Thangavel2, and D.Devakumari3 S.N.R. Sons College/Department of Computer Applications, Coimbatore, India-641 015. Email: saromaran@gmail.com 2 Periyar University/ Department of Computer Science, Salem, India-636 011. Email: drktvelu@yahoo.com 3 Govt. Arts & Science College/ Department of Computer Science, Dharmapuri, India. .Email: ramdevshri@yahoo.com II. LITERATURE REVIEW

Abstract—Feature subset selection is an essential preprocessing task in data mining. This paper presents a new method called Extended Fuzzy Relative Information Measure for Boundary Samples (EFRIMBS) for dealing with supervised feature subset selection. The proposed algorithm uses boundary samples instead of full set of samples. First, Discretization algorithms such as K-Means, Fuzzy C Means and Median as Initial Centroid of K-Means are applied to discretize numeric features to construct the membership functions of each fuzzy sets of a feature. Then the proposed EFRIMBS is applied to select feature subset focusing on boundary samples. J.D.Shie and S.M.Chen’s fuzzy entropy measure (FE) for feature subset selection is also applied with different discretization algorithms and the results are compared with the proposed algorithm. The FE based feature selection algorithm is efficient only for smaller datasets, the proposed method is very efficient for small and large datasets which is selecting minimum number of features for feature subset. The experimental results for UCI datasets shows that the proposed method produces better results when compared with the existing one.

A. Fuzzy Entropy Measure Fuzzy sets are powerful mathematical tools for modeling and controlling uncertain systems. Entropy measures the impurity of a collection. Some existing entropy measures are found in [2, 3, 4, 5]. In [4], Lee et al. presented fuzzy entropy based on class degree. B. Information Gain Measure and [FRIM] Information Gain[6, 7] measures the purity of a collection. The FRIM [8] can be used to measure the degree of similarity between two fuzzy sets. C. Fuzzy Entropy based Feature Selection In [5], Jen-Da shie, Shyi-Ming Chen has presented a method for feature subset selection based on fuzzy entropy measure. It considers boundary samples while selecting features. It measures only the impurity of a feature. This method is only efficient for smaller datasets.

Index Terms—feature selection, fuzzy entropy, extended fuzzy relative information measure, boundary samples.

D. Discretization and Boundary samples Discretization methods are used to reduce the into intervals. Discretization makes learning more accurate and faster [9, 10]. In dimension reduction problems[5], boundary samples take important part in affecting the results. Two dimensional feature space is reduced to one dimensional feature space. Reducing dimension will increase the entropy of data because some information will be omitted while reducing the dimension. It means that the samples incorrectly classified by a feature could be correctly classified by other features. Boundary samples are incorrectly classified samples of features. So, feature subset selection should focus on boundary samples.

I. INTRODUCTION Feature selection process refers to choosing subset [1] of attributes from the set of original attributes. This paper presents supervised feature subset selection based on EFRIMBS. First, discretization algorithms are used to construct the membership function of each fuzzy set of a feature. Then, it selects the feature subset, based on the proposed EFRIM focusing on boundary samples. The proposed method can select feature subset with minimum number of features. The experiments are carried out by using data sets taken from UCI Repository of Machine LearningDatabases. ftp://ftp.ics.res.uci.edu/pub/machinelearning-databases/. The proposed feature subset selection method can select feature subset with minimum number of features for small and large datasets when compared to fuzzy entropy based feature subset selection method. The rest of this paper is organized as follows. Section 2 briefly reviews related work of feature subset selection. Section 3, presents proposed EFRIMBS for feature subset selection. An experimental analysis is performed in section 4. Section 5 concludes this paper.

III. PROPOSED WORK A.

Methodology The proposed method, EFRIMBS is a dimension reduction problem, which focuses on the “boundary samples” instead of a full set of samples to select the feature subset. Considering both Fuzzy Entropy and Information Gain Measure, the proposed EFRIMBS is 184

Proc. of Int. Conf. on Control, Communication and Power Engineering 2010

Extended Fuzzy Relative Information Measure EFRIM(f). Step2: Put the feature with the maximum EFRIM(f) into the selected feature subset FS and remove it from the set F of candidate features. Let Efs = EFRIM(f), where f = arg max f Є F EFRIM(f) Let FS = { f }; Let F = F - { f }; Step3: Repeatedly put the feature which can increase the EFRIMBS of the feature subset into FS until no such a feature exists. Repeat { For each f ∈ F, do { based on [5], calculate EM FS U {f} according to the maximum class degree threshold value Tr given by the user, where Tr Є [0, 1]. EM FS U {f} = CEM(FS, f, Tr); Based on [5], calculate Fuzzy Entropy BSFFE(FS, f ) of the feature subset FS U {f} focusing on boundary samples. Based on (1), calculate EFRIMBS(FS, f) of the feature subset FS U {f} focusing on boundary samples. }; Let f = argmax f Є F EFRIMBS(FS, f) returns one of such a feature f that maximizes the function EFRIMBS(FS, f); Let D = EFRIMBS(FS, f) - EFS; Let EFS = EFRIMBS(FS, f); Let F = F - { f }; }; Until ( EFS = 0 (or) D ≤ 0 (or) F = { }); Let FS is a selected feature subset.

setup. It measures the purity of boundary samples. EFRIM(f, C), is called as Extended fuzzy relative information measure of a feature(f) with respect to class C. It is defined as: H(C) − H(f/C) FE(C) – FE(f/C) EFRIM(f, C) =___________ = ____________ (1) H(C) FE(C) Fuzzy Entropy of a feature FE(f/C) is defined as: SÃ FE(f/C) = Σ ----- FE(Ã) АєV S (2) where FE(Ã) denotes the fuzzy entropy of fuzzy set Ã, V denotes the set of fuzzy set of feature f , SÃ denotes the summation of the membership grades of the samples belonging to the fuzzy set Ã. S denotes the sum of the membership grades of samples belonging to each fuzzy set of a feature f.Fuzzy Entropy of class C with respect to different class labels are, FE(C) = - ((P+)log2(P+) + (P-)log2(P-)) (3) B. The Proposed Algorithm The proposed work involves two phases, Phase I & II. Phase – I The first phase uses K-Means, Fuzzy C means and Median as initial centroid of K-Means clustering algorithms to discrete numerical attributes and constructs the triangular membership function to fuzzyfy all numeric features. The steps are given here . Step1: Initially, set the number K of clusters as 2. Step2: Use the (K-Means or Fuzzy C means or Median as initial Centroid of K-Means) clustering algorithm to generate K clusters centers based on the values of a feature, where K ≥ 2. Step3: Construct the membership function of the fuzzy sets using triangular membership functions based on these K cluster centers, respectively. Step4: Calculate fuzzy entropy of feature f using class degree. Step5: Calculate Information gain of feature f. (4) IG(X | Y) = H(X) − H(X | Y) Step6:Calculate EFRIM for feature f using fuzzy entropy and information gain. Step7: If the decreasing rate of the EFRIM of feature is larger than the threshold value Tc given by the user, where Tc ∈ [0, 1], then let K = K+1 and go to Step 2. Otherwise, let K = K−1 and Stop. Phase – II Assume that a set R of samples is divided into a set C of classes, where R = {r1, r2, . . . , rc}, F denotes a set of candidate features and FS denotes the selected feature subset. The algorithm for feature subset selection is as follows: Step1: Based on Ref [5], construct the extension matrix EMf. For each feature f membership grades of values of fuzzy sets are defined in EMf. For these fuzzy sets calculate class degree with respect to class and fuzzy entropy. Then calculate fuzzy entropy and Information gain for each feature. Using these measures calculate

The proposed method has been implemented using Mat Lab version 7.0 The proposed method is compared with fuzzy entropy based feature selection method to get minimum number of feature subset. The data sets used here can be discretized to transform the numerical data into categorical one. Then the proposed and fuzzy entropy algorithms are implemented on the discretized dataset to find the subset of features. The proposed algorithm is compared with the fuzzy entropy based feature selection algorithm and displays the number of selected features which are presented in Table-1.From Table-I, It was found that the number of features selected by the proposed method is lesser than the fuzzy entropy based feature selection method. Almost for all datasets, the proposed algorithm gives very minimum number of features even in all discretization methods. But fuzzy entropy based feature selection gives better result only for smaller datasets. The threshold value Tc is used for constructing the membership functions. The threshold value Tr, is used for feature subset selection. The threshold value Tc and Tr used in this method for different datasets are shown in Table-II. Figure-2 shows, the selected average minimum features for proposed and FE based feature selection method. From the analysis, it shows that k-means and FCM based proposed EFRIMBS method gives very minimum number of features for 185

EXPERIMENTAL ANALYSIS AND DISCUSSION

Proc. of Int. Conf. on Control, Communication and Power Engineering 2010

Data set

Ra w Da ta

Discretizatio n Methods WPB C SON AR WIN E ECO LI IONO PIMA IRIS MPG AVE RAG E

33 60

Fuzzy Entropy Based Reduction KMe ans

FC M

3 19

3 37

Media n as Initial Centro id 33

Proposed EFRIMBS Based Reduction KMea ns

40 13 7

2 2 4

FC M

Median as Initial Centroid

8 4 7

2 2 3

8 4 7

2 2 3

2 2

20. 75

8.6 25

18.25

2.5

3 2.12 5

2 4 7

WPBC

Threshold value (Tr) 0.8

SONAR

0.2

0.75

WINE

0.2

0.5

ECOLI

0.2

0.3

IONO

0.2

0.7

PIMA

0.2

0.75

IRIS

0.2

0.9

MPG

0.03

0.6

7 2 .5

2 .1 2 5

F C M M e d ia n

F u z z y E n tro p y

K-

F C M M e d ia n P ro p o s e d

Fig 1: Comparison of Selected Features

feature subset selection based on EFRIMBS is compared with FE based feature selection method. The experimental results shows that, Fuzzy Entropy based feature selection is only good for smaller datasets in giving minimum features. But the proposed EFRIMBS with FCM method is very efficient for both smaller and larger datasets, giving very minimum features for feature subset. REFERENCES

[1]. S.M.Chen, “A new approach to handling fuzzy decision making problems”, IEEE Trans Syst Man Cybern 18(6):1012–1016, 1988. [2]. E.C.C.Tsang, D.S.Yeung, X.Z.Wang,”OFFSS: optimal fuzzy valued feature subset selection”, IEEE Trans Fuzzy Syst 11(2):202–213, 2003. [3]. S.M.Chen, C.H.Chang, “A new method to construct membership functions and generate weighted fuzzy rules from training instances”, Cybern Syst 36(4):397– 414, 2005. [4]. H.M.Lee, C.M.Chen, J.M.Chen, Jou YL, “An efficient fuzzy classifier with feature selection based on fuzzy entropy”, IEEE Trans Syst Man Cybern Part B Cybern 31(3):426–432, 2001. [5]. Jen-Da shie, Shyi-Ming Chen, “Feature subset selection based on fuzzy entropy measures for handling classification problems”, Appl Intell (2008) 28:69-82, 2008. [6]. J.D.Shie, S.M.Chen, “A new approach for handling classification problems based on fuzzy information gain measures”, In: Proceedings of the 2006 IEEE international conference on fuzzy systems, Vancouver, BC, Canada, pp 5427–5434,2006. [7] Lei Yu, Huan Liu, “Redundancy Based Feature Selection for Micro array Data”, Proceedings of the 2004 ACM SIGKDD. 2004. pp. 737–742. [8]. SHI-Fei Ding, Shi-xiong Xia, Feng-Xiang Jin and ZhongZhi Shis, “Novel fuzzy information proximity measures”, Journal of Information Science,Volume 33 , Issue 6 (December 2007) , pages 678-685 Year of Publication: 2007 ISSN:0165-5515. [9].Marzuki, F. Ahmad,”Data mining discretization methods and Performance”, Proceedings of International conference on Electrical engineering and Informatics, institute Technology Bandung, Indonesia, June 2007. [10].H Liu, etal: Discretization: ”An Enabling Technique. Data Mining and Knowledge Discovery”, 6,393-423, 2002.

Through this analysis it is proved that the proposed EFRIMBS with FCM based feature selection method is very efficient in selecting minimum features. CONCLUSIONS This paper, presents a novel method for feature subset selection based on proposed EFRIMBS. First, the numerical attributes have been discretized using some of the benchmark clustering algorithms such as K-means, Fuzzy C Means and Median as initial centroid of Kmeans. Then the proposed EFRIMBS is applied for feature subset selection focusing boundary samples. The performance evaluation of the proposed

N u m b e r o f F e a tu re s

THE THRESHOLD VALUE TC AND TR FOR DIFFERENT DATASETS Threshold value(Tc) 0.5

1 8 .2 5 8 .6 2 5

K-

TABLE II.

Data set

2 0 .7 5

R aw

3 34

25 20 15 10 5 0

Centroid

TABLE I. A COMPARATIVE ANALYSIS FOR FEATURE SELECTION

S e le c te d A v e ra g e M in im u m F e a tu re s fo r P ro p o s e d a n d F E b a s e d F e a tu re S e le c tio n M e th o d

Centroid

Number of Selected Features

feature selection when compared to FE based feature selection method.