Enhancement in Boosting Efficiency via Feature Selection and Clustering of Input Dataset

IJIRST â&#x20AC;&#x201C;International Journal for Innovative Research in Science & Technology| Volume 3 | Issue 02 | July 2016 ISSN (online): 2349-6010

Enhancement in Boosting Efficiency via Feature Selection and Clustering of Input Dataset Dincy George M. Tech. Student Department of Computer Science & Engineering NCERC Thrissur, India

Naveen Raja S M Assistant Professor Department of Computer Science & Engineering NCERC Thrissur, India

Hafzal Rahman M J Assistant Professor Department of Computer Science & Engineering NCERC Thrissur, India

Abstract Learning is an inevitable aspect in the field of Artificial Intelligence. Supervised learning systems are of great importance in current era. Boosting is an iterative technique for improving the predictive accuracy of such systems. It works by learning multiple functions by considering the output label of previous function as the base of succeeding one. Real world data sets still have issues while dealing with label noise and over fitting in case of complex datasets. To mitigate this issue the datasets are being clustered together using efficient algorithms. The dataset is being grouped together by combining the most similar member data. This clustered data set is then integrated together to the boosting process. Thus it improves predictive accuracy and lessens over fitting. This work first analyses the variation in predictive accuracy of popular boosting techniques with clustered and nonclustered data sets. After the analysis, a feature selection based approach is proposed to mitigate the identified issues is proposed which can enhance the efficiency of the system in the aspects of both time and memory. Keywords: Artificial Intelligence Clustering Algorithms, Label Noise, Supervised Learning Systems _______________________________________________________________________________________________________ I.

INTRODUCTION

Machine learning is a prominent domain of computer science which includes different types on the basis of nature of learning. Supervised, unsupervised, reinforced are the major categorizations. Supervised Learning systems are continuously monitored to ensure the accuracy and efficiency. Therefore accuracy prediction is also an important aspect of Supervised Learning Systems. Boosting is an iterative process capable of improving the predictive accuracy. Subsequent functions in boosting technique solely rely on the values generated by previous functions. Each of these function predicts the succeeding data instances using a weighted vote. A more refined decision boundary on the training data can thus be obtained by combining multiple functions together. This seems to be more efficient than in cases where only a single function is used. Even though boosting has a wide range of advantages, it still has limitations. It fall short in the scenarios where the labels of instances provided are wrong in actual sense and in cases where irrelevant features exist in training data set. It also faces issues in cases where over fitting of training data occur due to existence of complex and long functions. These common issues are mitigated by incorporating clustered dataset as input for boosting. This can augment the way in which boosting learns functions [1]. This paper first analyses the variation in predictive accuracy in clustered and non-clustered data sets. Then it proposes a feature selection based approach for selecting only relevant data points and helps to improve the predictive accuracy by maintaining or improving the time and memory efficiency of clustered boosting approach. II. RELATED WORK Boosting is done by two different methods: either by resampling and by reweighting. Both these methods exhibit similarity in mode of execution: the probability increases for incorrect instances while it decreases for correct instances. There are different ways by which boosting and clustering are used together in supervised learning systems. Boosting is used to bump up the efficiency of clustering by predicting the accuracy. Both clustering and boosting can be used together to augment the competence of supervised learning systems and finally clustering can be utilized to perk up boosting technique. Certain works employ boosting and clustering to improve the splitting points for decision trees [5]. Generally the combination of clustering and boosting together use k-means algorithm for creating clusters [6]. Such applications focus mainly on mitigating issues associated with label noises. Regularized boosting [4] is a popular cluster based boosting approach which regularizes the margin or any such additional information. It works by removing instances prone to be noisy. However it works on all the training data without clustering because of which it may fail to identify troublesome areas.

188

Enhancement in Boosting Efficiency via Feature Selection and Clustering of Input Dataset (IJIRST/ Volume 3 / Issue 02/ 034)

III. CLUSTERED BOOSTING The approach deals with forming clusters from data set by combining the most similar objects together and then integrating these inputs to the corresponding boosting technique. Ada Boost which is a shorthand name of Adaptive boosting is an algorithm for constructing a strong classifier by combining multiple weak classifiers. This is one of the simplest and conventional boosting techniques. In case of benchmark datasets clustered input exhibits very high accuracy level in comparison with adaboost. It has higher competence in choosing when to perform boosting within clusters to tackle troublesome areas and when to abstain from boosting to deal with problematic training data or label noises. Clustered approach influence boosting to accomplish finer precision in comparison with Adaboost. Hence forth, it addresses issues associated with boosting in all training data. It works with high efficiency in case of benchmark datasets however concerns such as class imbalance impact could affect this in case of real time datasets. This needs to be handled by fine-tuning of minority label estimate in those data sets which are prone to such issues.

Fig. 1: Graph Clustered approach

Since the number of data points to be considered is reduced due to formation of clusters, the time complexity and memory usage is found to be reduced in clustered method when compared to the one without clustering.

Fig. 2: Graph Clustered approach

Even though the complexities are reduced, the prediction is not highly accurate apart from the better accuracy gained in comparison with normal boosting techniques which may be due to inclusion of irrelevant data points into a cluster. Regularized boosting works with two different algorithms namely Brown Boost and AdaBoost KL. The number of functions learned by Brown Boost is comparatively smaller than that of AdaBoost. Also it was found to be stopping too early [7] or sometimes with initial function. AdaBoostKL had issues with its weight values. The sum of weights learned for subsequent

189

Enhancement in Boosting Efficiency via Feature Selection and Clustering of Input Dataset (IJIRST/ Volume 3 / Issue 02/ 034)

function is sometimes smaller than the weight of initial function. It indicates that the subsequent function had no relevance in the prediction of label due to the dominant vote of initial function.

Fig. 3: Graph Clustered approach

Time complexity is reduced in case of clustered data set due to reduced amount of data due to clustering .

Fig. 4: Graph Clustered approach

The higher value of weights learned by initial function would result in increase of value in subsequent iterations and number of iterations due to which a visible enhancement in memory usage is not found in regularized boosting even after clustering. Clustering simplifies the boosting process by breaking up the training data into clusters containing similar instances. Selective boosting can then be used more precisely on the instances in the same cluster. This helps in distinguishing between instances which are difficult for the initial function and instances that genuinely have label noise. Also the Brown Boost algorithm is not capable of boosting on all training data without the risk of excluding instances that do not have label noise. Thus, in comparison clustered boosting high efficiency in comparison with Brown Boost. While AdaBoost KLâ&#x20AC;&#x2DC;s performance is quite close to that of clustered boosting due to its selective nature. Boosting in presence of clustered data sets is found to be more efficient than the conventional techniques which use clusters only for pre processing. It could handle the major shortcomings of boosting by filtering for functions containing nuances in training data and over fitting occurred due to presence of incorrect instances. With this survey it can be concluded that boosting with clustered dataset accomplish superior predictive accuracy to prominent Boosting techniques such as AdaBoost [2] and Regularized Boost [3]. But it is possible to reduce the time and space complexity further. Also the level of accuracy of prediction needs to be improved to have a precise forecasting system. This can be achieved by completely eliminating the presence of irrelevant data points from the clusters.

190

Enhancement in Boosting Efficiency via Feature Selection and Clustering of Input Dataset (IJIRST/ Volume 3 / Issue 02/ 034)

IV. FEATURE SELECTION BEFORE CLUSTERING Feature Selection is the process of selecting a set of relevant features from thousands of existing features. It is used in cases where models are to be simplified for easier interpretations, for reducing training time and to have an enhanced generalization to reduce over fitting. Therefore to overcome the issues identified after analysis, feature selection can be performed before clustering. The feature selected data set can be given as input for the desired clustering algorithm after which the clustered boosting process can be continued. Any suitable filtering based feature selection approach can be used to select the relevant feature as this would reduce the amount of data. When feature selection is applied the number data points get reduced along with which the existence of irrelevant data points in a cluster can also be avoided. Due to reduced number of data points, the time required for processing it also gets reduced over a range and the possible for memory utilization to be high in comparison can also be avoided. Due to the obtaining of selected set of data before clustering, the accuracy of prediction would also remain high. Thus performing feature selection prior to clustering can enhance the boosting performance. V. CONCLUSION Boosting is of high significance in the field of artificial intelligence and therefore accuracy in the outcome of boosting is an important aspect. To enhance the efficiency of boosting, a cluster based approach was introduced which on our analysis found to have certain shortcomings. This was mitigated by bringing in the feature selection approach prior to clustering. This helped in improving accuracy by minimizing time and memory complexities. Thus feature selection and clustering when combined together could produce a highly efficient and accurate outcome for conventional boosting processes. In this work, we had considered only those data sets with numerical data which works well with UCI benchmark datasets, but in case of real world data sets the existence of non-numerical data also needs to be considered. REFERENCES [1] [2] [3] [4] [5] [6]

L.Dee Miller and Leen-Kiat Soh “Cluster based Boosting,”in IEEE Transactions On Knowledge and Data Engineering, Vol. 27,No. 6, June 2015 Shuquiong Wu and Hiroshi Nagahashi “Parameterized Adaboost: Introducing a Parameter to speed up the Training of Real AdaBoost,”in IEEE Signal Processing Letters,Vol.21,No.6,June 2014 Ke Chen and Shihai Wang “Semi-Supervised Learning via Regularized Boosting Working on Multiple Semi-Supervised Assumptions,” IEEE Transactions On Pattern Analysis and Machine Intelligence,Vol.33,No.1,January 2011 B.Wu and R.Nevatia,”Cluster boosted tree classifier for multiview,multi pose object detection,” in Proc.IEEE Int. Conf.Comput.Vis.,2007,pp.1-8 D.-S. Kim,Y.-M.Baek, and W.-Y. Kim, “Reducing overfitting of Adaboost by clustering based pruning of hard examples,”in Proc.Int. Conf. Ubiquitous Inform Manage. Commun., 2013, pp. 1–3 M. Warmuth, K. Glocer, and G. Ratsch, “Boosting algorithms for maximizing the soft margin,” in Proc. Int. Conf. Neural Inform. Process.Syst., 2007, pp. 1–8.

191

Turn static files into dynamic content formats.

Create a flipbook

Enhancement in Boosting Efficiency via Feature Selection and Clustering of Input Dataset

Published on Oct 29, 2016

IJIRST

Learning is an inevitable aspect in the field of Artificial Intelligence. Supervised learning systems are of great importance in current era. Boosting is an iterative technique for improving the predictive accuracy of such systems. It works by learning multiple functions by considering the output label of previous function as the base of succeeding one. Real world data sets still have issues while dealing with label noise and over fitting in case of complex datasets. To mitigate this issue the datasets are being clustered together using efficient algorithms. The dataset is being grouped together by combining the most similar member data. This clustered data set is then integrated together to the boosting process. Thus it improves predictive accuracy and lessens over fitting. This work first analyses the variation in predictive accuracy of popular boosting techniques with clustered and non-clustered data sets. After the analysis, a feature selection based approach is proposed to mitigat