Ijcot v3i11p108 by Ssrg Journals

International Journal of Computer & Organization Trends –Volume 3 Issue 11 – Dec 2013

An Efficient Extended Attribute Selection Methods for Classification S.Rajeev 1 , Mrs. N. Rajeswari 2 Mtech Student in cse department,Gudlavalleru Engineering College,Gudlavalleru,Krishna(dt), Associate professor in cse department,Gudlavalleru Engineering College,Gudlavalleru,Krishna(dt) 1

Abstract – In recent years many applications of data mining deal with a high-dimensional data (very large number of features) impose a high computational cost as well as the risk of “over fitting”. In these cases, it is common practice to adopt feature selection method to improve the generalization accuracy. Data quantity is the main issue in the small data set problem, because usually insufficient data will not lead to a robust classification performance. How to extract more effective information from a small data set is thus of considerable interest. In this connection, the present study is devoted not only to investigate the most relevant subset features with minimum cardinality for achieving high predictive performance by adopting CFS and ChiSqr filtered feature selection techniques in data mining but also to evaluate the goodness of subsets with different cardinalities and the quality of two filtered feature selection algorithms in terms of Fmeasure value and Receiver Operating Characteristics (ROC) value, generated by the Improved SVM technique with radial basis kernel function classifier method. The results show that the proposed method has a superior classification performance when compared to principal component analysis (PCA), kernel principal component analysis (KPCA), and kernel independent component analysis (KICA) with a Gaussian kernel in the support vector machine (SVM) classifier. The result of the present study effectively supports the well known fact of increase in the predictive accuracy with the existence of minimum number of features. The expected outcomes show a reduction in computational time and constructional cost in both training and classification phases . Keywords – A t t r i b u t e s e l e c t i o n , Rules, Stemming, Probability.

I. INTRODUCTION As the world grows in complexity, overwhelming us when using the data it generates, data mining becomes the one and only think of elucidating the patterns that underlie it [1]. The manual strategy of data analysis becomes tedious as size of data grows and to discover the number of dimensions increases, the means of data analysis ought to be computerised. The words

ISSN: 2249-2593

Knowledge Discovery from data (KDD) is the automated process of knowledge discovery from databases. The method of KDD is comprised of many steps namely data cleaning, data integration, data selection, data transformation, data mining, pattern evaluation and knowledge representation. Data mining is a step in entire process of data discovery that can be explained being a means of extracting or mining knowledge from a great deal of data [2]. Data mining is basically a kind of knowledge discovery necessary for solving problems inside a specific domain. Data mining may also be explained as the non trivial process that automatically collects the useful hidden information beginning with the data and is gotten as sorts of rule, concept, pattern and so forth [3]. The knowledge extracted from data mining, allows the user to look for interesting patterns and regularities deeply buried within the data in order to help by doing so of selection process.

Many irrelevant attributes could be comprised in data to remain mined. In order that they ought to be removed. Also many mining algorithms don’t work efficiently with large amounts of features or attributes. Therefore feature selection techniques ought to be applied before just about any mining algorithm is applied. The most ideal objectives of feature selection are to refrain from overfitting and increase model performance and also to give faster plus much more cost-effective models. The alternative of optimal features adds extra layer of complexity among the modelling as instead of just finding optimal parameters for full multitude of features, first optimal feature subset is going to be found and of course the model parameters are optimised . Attribute selection methods might be broadly divided into filter and wrapper approaches. Among the filter hit on the attribute selection way is independent of the results mining algorithm that really is onto the selected attributes and assess the relevance of features by looking only at the intrinsic properties of one's data. In most cases a feature relevance final score is calculated, and lowscoring features are removed. The subset of features left after feature removal is presented as input into the

http://www.ijcotjournal.org

Page 530