INTERNATIONAL CONFERENCE ON DEVELOPMENTS IN ENGINEERING RESEARCH, ICDER - 2014
Hierarchical fuzzy rule based classification systems with genetic rule selection To Filter Unwanted Messages shaik masthan baba1, shaik aslam2 , K.Naresh babu3 1 Computer science and engineering. 2 Computer science and engineering 3 Computer science and engineering 1
masthan201@gmail.com,2 aslambasha592@gmail.com ,3 naresh.kosuri@gmail.com
Abstract: Social networking sites that facilitate communication of information between users allow users to post messages as an important function. Unnecessary posts could spam a user’s wall, which is the page where posts are displayed, thus disabling the user from viewing relevant messages. The aim of this paper is to improve the performance of fuzzy rule based classification systems on imbalanced domains, increasing the granularity of the fuzzy partitions on the boundary areas between the classes, in order to obtain a better separability. We propose the use of a hierarchical fuzzy rule based classification system using the neural network learning model to filter out unwanted messages from Online Social Networking (OSN) user walls, which is based on the refinement of a simple linguistic fuzzy model by means of the extension of the structure of the knowledge base in a hierarchical way and the use of a genetic rule selection process in order to get a compact and accurate model.
Keywords:On-line Social Networks, Classification,Fuzzy rule based classification systems,Imbalanced data-sets , Genetic rule selection
1.INTRODUCTION Online Social Networks (OSNs) are today one of the most popular interactive medium to communicate, share and disseminate a considerable amount of human life information. Daily and continuous communications imply the exchange of several types of content, including free text, image, audio and video data. The huge and dynamic character of these data creates the premise for the employment of web content mining strategies aimed to automatically discover useful information dormant within the data. They are instrumental to provide an active support in complex and sophisticated tasks involved in OSN management, such as for instance access control or information filtering. Information filtering has been greatly explored for what concerns textual documents and more recently, web content [1-3].Information filtering can therefore be used to give users the ability to automatically control the messages written on their own walls, by filtering out unwanted messages. Indeed, today OSNs provide very little support to prevent unwanted messages on user walls [4-6]. No content-based preferences are supported and therefore it is not possible to prevent undesired messages, Providing this service is not only a matter of using previously
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT 92
www.iaetsd.in
INTERNATIONAL CONFERENCE ON DEVELOPMENTS IN ENGINEERING RESEARCH, ICDER - 2014
defined web content mining techniques for a different application, rather it requires to design ad hoc classification strategies. The aim of the present work is therefore to propose and experimentally evaluate an automated system, called Filtered Wall (FW), able to filter unwanted messages from OSN user walls [7-9]. We exploit Machine Learning (ML) text categorization techniques to automatically assign with each short text message a set of categories based on its content. The major efforts in building a robust short text classifier (STC) are concentrated in the extraction and selection of a set of characterizing and discriminate features. The solutions investigated in which we inherit the learning model and the elicitation procedure for generating pre-classified data..In particular, we base the overall short text classification strategy on Radial Basis Function Networks (RBFN) for their proven capabilities in acting as soft classifiers, in managing noisy data and intrinsically vague classes We insert the neural model within a hierarchical two level classification strategy [8-10]. In the first level,the RBFNcategorizes short messages as Neutral and Nonneutral;in the second stage, No neutral messages are classified producing gradual estimates of appropriateness to each of the considered category. Besides classification facilities, the system provides a powerful rule layer exploiting a flexible language to specify Filtering Rules (FRs), by which users can state what contents should not be displayed on their walls. FRs can support a variety of different filtering criteria that can be combined and customized according to the user needs. In addition, the system provides the support for user-defined Blacklists (BLs), that is, lists of users that are temporarily prevented to post any kind of messages on a user wall [11-13]. The experiments we have carried out show the effectiveness of the developed
filtering techniques the proposal of a system to automatically filter unwanted messages from OSN user walls on the basis of both message content and the message creator relationships and characteristics.
2. LITERATURE REVIEW The main contribution of this paper is the design of a system providing customizable content-based message filtering for OSNs, based on ML techniques. As we have pointed out in the introduction, to the best of our knowledge we are the first proposing such kind of application for OSNs. However, our work has relationships both with the state of the art in content-based filtering, as well as with the field of policybased personalization for OSNs and, more in general, web contents. Therefore, in what follows, we survey the literature in both these fields. 2.1 Content-based filtering: Information filtering systems are designed to classify a stream of dynamically generated information dispatched asynchronously by an information producer and present to the user those information that are likely to satisfy his/her requirements [3]. In content-based filtering each user is assumed to operate independently. As a result, a content-based filtering system selects information items based on the correlation between the content of the items and the user preferences as opposed to a collaborative filtering system that chooses items based on the correlation between people with similar preferences. Documents processed in content-based filtering are mostly textual in nature and this makes content-based filtering close to text classification. The activity of filtering can be modeled, in fact, as a case of single label, binary classification, partitioning incoming
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT 93
www.iaetsd.in
INTERNATIONAL CONFERENCE ON DEVELOPMENTS IN ENGINEERING RESEARCH, ICDER - 2014
documents into relevant and non relevant categories [4]. More complex filtering systems include multi-label text categorization automatically labeling messages into partial thematic categories. Content-based filtering is mainly based on the use of the ML paradigm according to which a classifier is automatically induced by learning from a set of pre-classified examples. A remarkable variety of related work has recently appeared, which differ for the adopted feature extraction methods, model learning, and collection of samples [5], [6], [7], [8],[9]. The feature extraction procedure maps text into a compact representation of its content and is uniformly applied to training and generalization phases. The application of content-based filtering on messages posted on OSN user walls poses additional challenges given the short length of these messages other than the wide range of topics that can be discussed. Short text classification has received up to now few attention in the scientific community. Recent work highlights difficulties in defining robust features, essentially due to the fact that the description of the short text is concise, with many misspellings, non standard terms and noise. Focusing on the OSN domain, interest in access control and privacy protection is quite recent. As far as privacy is concerned, current work is mainly focusing on privacypreserving data mining techniques, that is, protecting information related to the network, i.e., relationships/nodes, while performing social network analysis [5]. Works more related to our proposals are those in the field of access control. In this field, many different access control models and related mechanisms have been proposed so far (e.g., [6,2,10]), which mainly differ on the expressivity of the access control policy language and on the way access control is enforced (e.g., centralized vs. decentralized). Most of these models express access control
requirements in terms of relationships that the requestor should have with the resource owner. We use a similar idea to identify the users to which a filtering rule applies. However, the overall goal of our proposal is completely different, since we mainly deal with filtering of unwanted contents rather than with access control. As such, one of the key ingredients of our system is the availability of a description for the message contents to be exploited by the filtering mechanism as well as by the language to express filtering rules. In contrast, no one of the access control models previously cited exploit the content of the resources to enforce access control. We believe that this is a fundamental difference. Moreover, the notion of black- lists and their management are not considered by any of these access control models 2.2 Policy-based personalization of OSN contents Recently, there have been some proposals exploiting classification mechanisms for personalizing access in OSNs. For instance, in [11] a classification method has been proposed to categorize short text messages in order to avoid overwhelming users of microblogging services by raw data. The system described in [11] focuses on Twitter2 and associates a set of categories with each tweet describing its content. The user can then view only certain types of tweets based on his/her interests. In contrast, Golbeck and Kuter [12] propose an application, called FilmTrust, that exploits OSN trust relationships and provenance information to personalize access to the website. However, such systems do not provide a filtering policy layer by which the user can exploit the result of the classification process to decide how and to which extent filtering out unwanted information. In contrast, our filtering policy language allows the setting
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT 94
www.iaetsd.in
INTERNATIONAL CONFERENCE ON DEVELOPMENTS IN ENGINEERING RESEARCH, ICDER - 2014
of FRs according to a variety of criteria, that do not consider only the results of the classification process but also the relationships of the wall owner with other OSN users as well as information on the user profile. Moreover, our system is complemented by a flexible mechanism for BL management that provides a further opportunity of customization to the filtering procedure. The only social networking service we are aware of providing filtering abilities to its users is MyWOT, a social networking service which gives its subscribers the ability to: 1) rate resources with respect to four criteria: trustworthiness, vendor reliability, privacy, and child safety; 2) specify preferences determining whether the browser should block access to a given resource, or should simply return a warning message on the basis of the specified rating. Despite the existence of some similarities, the approach adopted by MyWOT is quite different from ours. In particular, it supports filtering criteria which are far less flexible than the ones of Filtered Wall since they are only based on the four above-mentioned criteria. Moreover, no automatic classification mechanism is provided to the end user. Our work is also inspired by the many access control models and related policy languages and enforcement mechanisms that have been proposed so far for OSNs, since filtering shares several similarities with access control. Actually, content filtering can be considered as an extension of access control, since it can be used both to protect objects from unauthorized subjects, and subjects from inappropriate objects. In the field of OSNs, the majority of access control models proposed so far enforce topology-based access control, according to which access control requirements are expressed in terms of relationships that the requester should have with the resource owner. We use a similar idea to identify the users to which a
FR applies. However, our filtering policy language extends the languages proposed for access control policy specification in OSNs to cope with the extended requirements of the filtering domain. Indeed, since we are dealing with filtering of unwanted contents rather than with access control, one of the key ingredients of our system is the availability of a description for the message contents to be exploited by the filtering mechanism. In contrast, no one of the access control models previously cited exploit the content of the resources to enforce access control. Moreover, the notion of BLs and their management are not considered by any of the above-mentioned access control models. 3. ANALYSIS OF PROBLEM: The use of effective and appropriate methods in facilitating projects enhances its effectiveness and efficiency. The method will be applied in system analysis and design method where an existing system is studied to proffer better options to solving existing problems. Indeed, today OSNs provide very little support to prevent unwanted messages on user walls. For example, Facebook allows users to state who is allowed to insert messages in their walls (i.e., friends, friends of friends, or defined groups of friends). However, no content-based preferences are supported and therefore it is not possible to prevent undesired messages, such as political or vulgar ones, no matter of the user who posts them. However, no content-based preferences are supported and therefore it is not possible to prevent undesired messages, no matter of the user who posts them. Providing this service is not only a matter of using previously defined web content mining techniques for a different application, rather it requires to design ad hoc classification strategies. This is because wall messages are constituted by short text
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT 95
www.iaetsd.in
INTERNATIONAL CONFERENCE ON DEVELOPMENTS IN ENGINEERING RESEARCH, ICDER - 2014
for which traditional classification methods have serious limitations since short texts do not provide sufficient word occurrences. 4.IMBALANCED DATA-SETS IN CLASSIFICATION In this section, we will first introduce the problem of imbalanced data-sets. Then, we will describe the preprocessing technique that we have applied in order to deal with the imbalanced data-sets: the SMOTE algorithm [7]. Finally, we will present the evaluation metrics for this kind of classification problem. 4.1. The problem of imbalanced data-sets Learning from imbalanced data is an important topic that has recently appeared in the machine learning community.When treating with imbalanced data-sets, one or more classes might be represented by a large number of examples whereas the others are represented by only a few.We focus on the binary-class imbalanced data-sets, where there is only one positive and one negative class. We consider the positive class as the one with the lowest number of examples and the negative class the one with the highest number of examples. Furthermore, in this work we use the IR, defined as the ratio of the number of instances of the majority class and the minority class, to organize the different data-sets according to their IR.The problem of imbalanced data-sets is extremely significant because it is implicit in most real world applications, such as fraud detection [16], text classification, risk management or medical applications.In classification, this problem (also named the ‘‘class imbalance problem”) will cause a bias on the training of classifiers and will result in the lower sensitivity of detecting the minority class examples. For this reason, a large number of approaches have been previously proposed to deal with
the class imbalance problem. These approaches can be categorized into two groups: the internal approaches that create new algorithms or modify existing ones to take the class imbalance problem into consideration [3] and external approaches that preprocess the data in order to diminish the effect cause by their class imbalance [4,15].The internal approaches have the disadvantage of being algorithm specific, whereas external approaches are independentof the classifier used and are, for this reason, more versatile. Furthermore, in our previous work on this topic [18] we analyzed the cooperation of some preprocessing methods with FRBCSs, showing a good behaviour for the oversampling methods,specially in the case of the SMOTE methodology.According to this, we will employ in this paper the SMOTE algorithm in order to deal with the problem of imbalanced data-sets. This method is detailed in the next subsection. 4.2. Preprocessing imbalanced data-sets. The SMOTE algorithm As mentioned before, applying a preprocessing step in order to balance the class distribution is a positive solution to the imbalance data-set problem [4]. Specifically, in this work we have chosen an over-sampling method which is a reference in this area: the SMOTE algorithm [7].In this approach the minority class is over-sampled by taking each minority class sample and introducing synthetic examples along the line segments joining any/all of the k minority class nearest neighbours. Depending upon the amount of oversampling required, neighbours from the k-nearest neighbours are randomly chosen. This process is illustrated in Fig. 1,where xi is the selected point, xi1 to xi4 are some selected nearest neighbours and r1 to r4 the synthetic data points created by the randomized interpolation. The implementation employed
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT 96
www.iaetsd.in
INTERNATIONAL CONFERENCE ON DEVELOPMENTS IN ENGINEERING RESEARCH, ICDER - 2014
in this work uses only one nearest neighbour using the euclidean distance, and balance both classes to the 50% distribution.Synthetic samples are generated in the following way: take the difference between the feature vector (sample) under consideration and its nearest neighbour. Multiply this difference by a random number between 0 and 1, and add it to the feature vector under consideration. This causes the selection of a random point along the line segment between two specific features.This approach effectively forces the decision region of the minority class to become more general. An example is detailed in Fig. 2.In short, its main idea is to form new minority class examples by interpolating between several minority class examples that lie together. Thus, the overfitting problem is avoided and causes the decision boundaries for the minority class to spread further into the majority class space.
of the positive (negative) label being true. In other words, they assess the effectiveness of the algorithm on a single class.
4.3. Evaluation in imbalanced domains The measures of the quality of classification are built from a confusion matrix (shown in Table 1) which records correctly and incorrectly recognized examples for each class.The most used empirical measure, accuracy (1), does not distinguish between the number of correct labels of different classes, which in the framework of imbalanced problems may lead to erroneous conclusions. For example a classifier that obtains an accuracy of 90% in a data-set with an IR value of 9, might not be accurate if it does not cover correctly any minority class instance.
Fig.1.An illustration on how to create the synthetic data points in the SMOTE algorithm.
Because of this, instead of using accuracy, more correct metrics are considered. Two common measures, sensitivity and specificity (2,3), approximate the probability
The metric used in this work is the geometric mean of the true rates [3], which can be defined as
Fig.2.Example of SMOTE application
This metric attempts to maximize the accuracy of each one of the two classes with a good balance. It is a performance metric that links both objectives.
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT 97
www.iaetsd.in
INTERNATIONAL CONFERENCE ON DEVELOPMENTS IN ENGINEERING RESEARCH, ICDER - 2014
Table.1. Confusion matrix for a two-class problem.
Class Positive class Negative class
Positive prediction True positive(TP) False positive(FP)
Negative Prediction False Negative(FN) True Negative(TN)
5. Hierarchical rule base genetic rule selection process In the previous section we have mentioned that an excessive number of rules may not produce a good performance and it makes difficult to understand the model behaviour. We may find different types of rules in a large fuzzy rule set: irrelevant rules, which do not contain significant information; redundant rules, whose actions are covered by other rules; erroneous rules, which are wrong defined and distort the performance of the FRBCS; and conflicting rules, which perturb the performance of the FRBCS when they coexist with others.In this work, we consider the CHC genetic model [14] in order to make the rule selection process, since it has achieved good results for binary selection problems [6]. In the following, the main characteristics of this genetic approach are presented. 1. Coding scheme and initial gene pool: It is based on a binary coded GA where each gene indicates whether a rule is selected or not (alleles ‘1’ or ‘0’, respectively). Considering that N rules are contained in the preliminary/candidate rule set, the chromosome C = (c1, . . . ,cN) represents a subset of rules composing the final HRB, such that:
with Ri being the corresponding ith rule in the candidate rule set and HRB being the final hierarchical rule base.The initial
pool is obtained with an individual having all genes with value ‘1’ and the remaining individuals generated at random in {0, 1}, so that the initial HRB is taking into account in the genetic selection process. 2. Chromosome evaluation: The fitness function must be in accordance with the framework of imbalanced data-sets. Thus, we will use, as presented in Section 2.3, the geometric mean of the true rates, defined in (4) as:
3. Crossover operator: The half uniform crossover scheme (HUX) is employed. In this approach, the two parents are combined to produce two new offspring. The individual bits in the string are compared between the two parents and exactly half of the non-matching bits are swapped. Thus the Hamming distance (the number of differing bits) is first calculated.This number is divided by two. The resulting number is how many of the bits that do not match between the two parents will be swapped. 4. Restarting approach: To get away from local optima, this algorithm uses a restart approach. In this case, the best chromosome is maintained and the remaining are generated at random in {1,0}. The restart procedure is applied when a threshold value is reached, which means that all the individuals coexisting in the population are very similar. 5. Evolutionary model: The CHC genetic model makes use of a ‘‘Population-based Selection” approach. N parents and their corresponding offspring are combined to select the best N individuals to take part of the next population. The CHC approach makes use of an incest prevention mechanism and a restarting process to provoke diversity in the population,instead of the well-known mutation operator.
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT 98
www.iaetsd.in
INTERNATIONAL CONFERENCE ON DEVELOPMENTS IN ENGINEERING RESEARCH, ICDER - 2014
This incest prevention mechanism will be considered in order to apply the HUX operator, i.e., two parents are crossed if their hamming distance divided by 2 is higher than a predetermined threshold, L. The threshold value is initialized as: L = (#Genes/4.0). Following the original CHC scheme, L is decremented by one when the population does not change in one generation. The algorithm restarts when L is below zero. We will stop the genetic process if more than 3 restarts are performed without including any new chromosome in the population.
6. CONCLUSION In this paper, we have proposed an HFRBCS approach for classification with imbalanced data-sets. Our aim was to employ a hierarchical model to obtain a good balance among different granularity levels. A fine granularity is applied in the boundary areas, and a thick granularity may be applied in the rest of the classification space providing a good generalization. Thus,this approach enhances the classification performance in the overlapping areas between the minority and majority classes.Furthermore, we have made use of the SMOTE algorithm in order to balance the training data before the rule learning generation phase. This preprocessing step enables the obtention of better fuzzy rules than using the original data-sets and therefore, we improve the global performance of the fuzzy model to filter out unwanted messages from Online Social Networking (OSN). 7.REFERENCES [1] R. Alcalá, J. Alcalá-Fdez, F. Herrera, J. Otero, Genetic learning of accurate and compact fuzzy rule based systems based on the 2-tuples linguistic representation,
International Journal of Approximate Reasoning 44 (2007) 4564. [2] A. Asuncion, D. Newman, 2007. UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. URL: <http://www.ics.uci.edu/~mlearn/MLReposi tory.html>. [3] R. Barandela, J.S. Sánchez, V. García, E. Rangel, Strategies for learning in class imbalance problems, Pattern Recognition 36 (3) (2003) 849–851. [4] G.E.A.P.A. Batista, R.C. Prati, M.C. Monard, A study of the behaviour of several methods for balancing machine learning training data, SIGKDD Explorations 6 (1) (2004) 20–29. [5] P. Campadelli, E. Casiraghi, G. Valentini, Support vector machines for candidate nodules classification, Letters on Neurocomputing 68 (2005) 281–288. [6] J.R. Cano, F. Herrera, M. Lozano, Using evolutionary algorithms as instance selection for data reduction in kdd: an experimental study, IEEE Transactions on Evolutionary Computation 7 (6) (2003) 561–575. [7] N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, Smote: synthetic minority over-sampling technique, Journal of Artificial Intelligent Research 16 (2002) 321–357. [8] N.V. Chawla, N. Japkowicz, A. Kolcz, Editorial: special issue on learning from imbalanced data-sets, SIGKDD Explorations 6 (1) (2004) 1–6. [9] Z. Chi, H. Yan, T. Pham, Fuzzy algorithms with applications to image processing and pattern recognition, World Scientific, 1996. [10] J.-N. Choi, S.-K. Oh, W. Pedrycz, Structural and parametric design of fuzzy inference systems using hierarchical fair competition-based parallel genetic algorithms and information granulation, International Journal of Approximate Reasoning 49 (3) (2008) 631–648.
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT 99
www.iaetsd.in
INTERNATIONAL CONFERENCE ON DEVELOPMENTS IN ENGINEERING RESEARCH, ICDER - 2014
[11] O. Cordón, M.J. del Jesus, F. Herrera, A proposal on reasoning methods in fuzzy rule-based classification systems, International Journal of Approximate Reasoning 20 (1) (1999) 21–45. [12] O. Cordón, F. Herrera, I. Zwir, Linguistic modeling by hierarchical systems of linguistic rules, IEEE Transactions on Fuzzy Systems 10 (1) (2002) 2–20. [13] J. Demšar, Statistical comparisons of classifiers over multiple data-sets, Journal of Machine Learning Research 7 (2006) 1–30. [14] L.J. Eshelman, 1991. Foundations of Genetic Algorithms. Morgan Kaufman, Ch. The CHC Adaptive Search Algorithm: How to have Safe Search when Engaging in Nontraditional Genetic Recombination, pp. 265–283. [15] A. Estabrooks, T. Jo, N. Japkowicz, A multiple resampling method for learning from imbalanced data-sets, Computational Intelligence 20 (1) (2004) 18–36. [16] T. Fawcett, F.J. Provost, Adaptive fraud detection, Data Mining and Knowledge Discovery 1 (3) (1997) 291–316. [17] A. Fernández, S. García, M.J. del Jesus, F. Herrera, An analysis of the rule weights and fuzzy reasoning methods for linguistic rule based classification systems applied to problems with highly imbalanced data-sets, in: International Workshop on Fuzzy Logic and Applications (WILF07), Lecture Notes on Computer Science, vol. 4578, SpringerVerlag, 2007, pp. 170–179. [18] A. Fernández, S. García, M.J. del Jesus, F. Herrera, A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets, Fuzzy Sets and Systems 159 (18) (2008) 2378–2398. [19] M. Friedman, The use of ranks to avoid the assumption of normality implicit in the analysis of variance, Journal of the American Statistical Association 32 (1937) 675–701.
[20] S. García, D. Molina, M. Lozano, F. Herrera, A study on the use of nonparametric tests for analyzing the evolutionary algorithms’ behaviour: a case study on the CEC’2005 special session on real parameter optimization. Journal of Heuristics, in press, doi: 10.1007/s10732008-9080-4.
INTERNATIONAL ASSOCIATION OF ENGINEERING & TECHNOLOGY FOR SKILL DEVELOPMENT 100
www.iaetsd.in