Pattern Anal Applic DOI 10.1007/s10044-011-0265-3
THEORETICAL ADVANCES
A unifying methodology for the evaluation of neural network models on novelty detection tasks Guilherme A. Barreto • Rewbenio A. Frota
Received: 27 April 2010 / Accepted: 31 December 2011 Ó Springer-Verlag London Limited 2012
Abstract An important issue in data analysis and pattern classification is the detection of anomalous observations and its influence on the classifier’s performance. In this paper, we introduce a novel methodology to systematically compare the performance of neural network (NN) methods applied to novelty detection problems. Initially, we describe the most common NN-based novelty detection techniques. Then we generalize to the supervised case, a recently proposed unsupervised novelty detection method for computing reliable decision thresholds. We illustrate how to use the proposed methodology to evaluate the performances of supervised and unsupervised NN-based novelty detectors on a real-world benchmarking data set, assessing their sensitivity to training parameters, such as data scaling, number of neurons, training epochs and size of the training set. Keywords Novelty detection Self-organizing maps Multilayer neural networks Bootstrap Decision intervals
As such, it has been the focus of increasing attention in many pattern recognition applications whose success depends on building a reliable model for the data, such as machine monitoring [23, 31, 44], image processing [34], remote sensing [54], medical diagnosis [30, 42], mobile robotics [36, 47], multimedia applications [9, 40], computer network security [13, 48, 53], homeland security [4], telecommunications [7, 15], time series data analysis [6, 37, 51], among others. This interest is in part due to the fact that, for a wide range of real-world problems it is crucial to be able to detect patterns that do not match well with the stored data representation. Several neural, system-theoretic, statistical and hybrid approaches to novelty detection have been proposed over the years, but it is becoming usual that the formulation of novelty detection tasks as one of the following pattern classification problems: –
1 Introduction Novelty detection is the problem of reporting the occurrence of novel events or data. Due to the wide range of applicability across disciplines in engineering and science, novelty detection can also be called anomaly detection, intruder detection, fault detection or even outlier detection. G. A. Barreto (&) R. A. Frota Department of Teleinformatics Engineering, Federal University of Ceara´, Fortaleza, CE, Brazil e-mail: guilherme@deti.ufc.br R. A. Frota e-mail: rewbenio@uol.com.br
–
Single-class: The data available for learning a representation of the expected behavior of the system of interest comprised only one class of data vectors, usually representing normal activity of the system. This type of data is also referred to as positive examples. The goal is to indicate if a given input vector corresponds to normal or abnormal behavior. Multi-class: The training set contains data vectors of different classes. The data should be representative of positive (normal) and negative (abnormal) behavior, to build an overall representation of the known system behavior, even (and specially) in faulty operation [3]. The goal is to classify the input vector into one or none of the existing classes.
Thus, the design of novelty detectors can be generally stated as the task in which a description of what is already known about the system is learned by fitting a set of normal
123
Pattern Anal Applic
and/or abnormal data vectors, so that subsequent unseen patterns are evaluated by comparing a measure of novelty against decision thresholds. The main challenges are then the collection of reliable data, the definition of an appropriate learning machine (i.e., the classifier) and the computation of decision thresholds. As pointed out in [19, 32, 33, 35], considerable efforts have been devoted to the design of powerful classifiers and decision threshold computation techniques, while much less attention has been paid to the data-related issues, such as the occurrence of outliers and data-scaling methods, and their influence on the performance of the classifiers. In what concern the quality of the collected data, most of the works in novelty detection assume, implicitly or explicitly, that the training data is outlier-free or the outliers are known in advance. Since outliers may arise due to a number of reasons, such as measurement error, mechanical/electrical faults, unexpected behavior of the system (e.g., fraudulent behavior), or simply by natural statistical deviations within the data set, those assumptions are unrealistic. It is worth mentioning that the data-labeling process, even if performed by an expert, is also error-prone. Even if we assume that the data are outlier-free, it is very difficult, if not impossible, to know in advance if the sampled data, concerning the number of positive and/or negative examples, suffice to give a reliable description of the underlying statistical nature of the system. For example, for some applications, the number of negative (abnormal) examples can be very small, since they are rare or difficult (e.g., expensive) to collect. However, it is well known that to achieve a good classification performance, the number of examples per class should be ideally balanced [49]. This is particularly true for powerful nonlinear classifiers, such as multilayered neural networks [29]. In this case, it is recommended to consider the few negative examples available as outliers, treating the novelty detection task as a single-class classification problem, in which training the classifier is carried out with positive (normal) examples only. The outliers are then used to test the performance of the novelty detection system. Some authors, however, argue that the inclusion of outliers during training can be beneficial for the novelty detection system, improving its robustness as a whole [5, 34, 45]. If known outliers are unavailable, these authors suggest to generate artificial outliers for that purpose. Bearing in mind the aforementioned issues concerning the design of a robust novelty detection system, the contributions of this paper are manifold:
2.
1.
Thus, when formulating a conclusion regarding the condition of the system based on the definitions of H0 and H1, two types of errors are possible:
proposal of a generic methodology to compute decision thresholds that can be applied to a wide range of supervised and unsupervised neural architectures.
123
3.
4.
Proposal of a data-cleaning strategy for outlier removal based on the proposed methodology. Comparison of the proposed methodology with existing techniques for computing decision thresholds using different neural network paradigms. Evaluation of the proposed methodology in the presence of known and unknown outliers and for different data-scaling strategies.
The remainder of the paper is organized as follows. In Sect. 2, we briefly present the novelty detection task as a hypothesis testing procedure. Then, in Sect. 3, we describe how the standard neural network architectures, such as the SOM, MLP and RBF networks, have been used for novelty detection purposes. In Sects. 4 and 5, to compute reliable decision thresholds we generalize the recent application of the bootstrap resampling technique to unsupervised novelty detection to the supervised case, and propose an outlier removal procedure based on it. Finally, in Sects. 6 and 7 we evaluate the performances of the neural network methods through simulations on a breast cancer data set and discuss the obtained results. We conclude the paper in Sect. 8.
2 Novelty detection as hypothesis testing Before starting to describe neural network approaches for novelty detection, it is worth presenting the novelty detection task under the formalism of statistical hypothesis testing to establish criteria to measure the performance of the neural models. First of all, it is necessary to define a null hypothesis, i.e., the hypothesis to be tested. For our purposes, H0 is stated as follows: –
H0: The input vector xnew reflects KNOWN activity; where by the adjective known we mean the vector xnew represents normal behavior, if we are dealing with single-class classification problems. If we have a multiclass classification problem, the adjective known means that the input vector belongs to one of the already learned classes.
The so-called alternative hypothesis, denoted as H1, is obviously given by: –
H1: The input vector reflects the UNKNOWN activity. so that, in this case, the input vector carries novel information, which in general is indicative of abnormal behavior of the system being analyzed.
Pattern Anal Applic
–
–
Type I error: This error occurs when H0 is rejected when it is, in fact, true. The probability of making a type I error is denoted by the significance level, a, whose value is set by the investigator in relation to the consequences of such an error. That is, we want to make the significance level as small as possible to protect the null hypothesis and to prevent, as far as possible, the investigator from inadvertently making false claims. Type I error is also referred to as False Alarm, False Detection or yet False Positive. Type II error: This error occurs when H0 is accepted when it should be rejected. The probability of making a type II error is denoted by b (which is generally unknown). Type II error is also referred to as Absence of Alarm or False Negative. A type II error is frequently due to sample sizes N being too small.
Novelty detection systems are usually evaluated by the number of false alarms and absence of alarms they produce. The ideal novelty detector would have a = 0 and b = 0, but this is not really possible in practice. So, one tries to manage a and b error probabilities based on the overall consequences (e.g., high costs, death, machine breakdown, virus infection, etc.) for the system being analyzed. For example, reporting false alarms too frequently may lead to the following situation: system operators would gradually put no faith on classifier’s decisions to the point that they would refuse to believe that an actual problem is occurring. In medical testing, the absence of alarms (false negatives) provides false, incorrect reassurance, to both patients and physicians, that patients are free of diseases which are actually present. This in turn leads to people receiving inappropriate understanding and a lack of better advice and treatment. The difficulty is that, for any fixed sample size N, a decrease in a causes an increase in b. Conversely, an increase in a causes a decrease in b. To decrease both a and b to acceptable levels, we may increase the sample size N. Also, for any fixed a, an increase in the sample size N will cause a reduction in b, i.e., a greater number of samples reduces the probability of reject the null hypotheses when it is true. Usually, in neural-based novelty detection the number of samples is fixed and strongly related to the number of neurons. If one increases the number of neurons to decrease both a and b, the computational cost also increases rapidly. This can be problematic if the novelty detection systems are supposed to work in real-time, such as in intruder detection or spam detection softwares. Even for offline applications, higher computational efforts demand higher computational power, increasing the costs of the hardware. An alternative is to increase the number of samples, a
technique which usually demands low computational efforts. In this paper, we increase the number of samples of the variable of interest by means of statistical resampling techniques, such as the bootstrap [12].
3 Neural methods for novelty detection Artificial neural network (ANN) algorithms have been successfully applied to a wide range of novelty detection tasks, mainly due to its nonparametric1 nature and its powerful generalization performance. In this section, we briefly review the most common ANN approaches to novelty detection. It is not our intention to provide a comprehensive survey of possible approaches, but rather to give an introduction to the issue. 3.1 Optimal linear associative memory One of the first approaches to novelty detection, called the Novelty Filter, was proposed by Kohonen and Oja [27]. The following mathematical development is a special case of the Optimal Linear Associative Memory (OLAM) [24], which builds a linear mapping y = Mx from a finite set of input–output pairs ðxi ; yi Þ; i ¼ 1; . . .; m: For novelty detection purposes, we are interested in the autoassociative OLAM, the case for which xi = yi. Thus, given a set of n-dimensional vectors x1 ; x2 ; . . .; xm 2 Rn ; it is possible to compute the n-by-n matrix M as follows: M ¼ XX ;
ð1Þ
where the columns of the n-by-m matrix X are the training vectors xi 2 Rn and X* = XT(XXT)-1 denotes the pseudoinverse matrix of X. Let the vectors x1 ; x2 ; . . .; xm span some unique linear subspace Lðx1 ; x2 ; . . .; xm Þ of Rn ; or alternatively, ( ) m X L ¼ Lðx1 ; x2 ; . . .; xm Þ ¼ xjx ¼ ci xi ð2Þ i¼1
where the c1 ; . . .; cm are arbitrary real scalars from the domain ð 1; 1Þ: It can be shown that the matrix M is a projection operator. The operator M projects Rn onto L: There is another operator, called dual operator that projects Rn onto L? ; which is the orthogonal complement space fx 2 Rn : xT y ¼ 0; 8y 2 Lg: It can be shown that the dual operator is given by I M, where I denotes the n 9 n identity matrix. Every vector in Rn can be uniquely decomposed as follows:
1
By nonparametric we mean methods that make none or very few assumptions about the statistical distribution of the data.
123
Pattern Anal Applic
x ¼ Mx þ ðI MÞx ¼ ^ xþ~ x;
ð3Þ
^ measures what is known about in which the projection x the input x relative to the vectors x1 ; x2 ; . . .; xm stored in matrix M, as shown in (1). The projection ~ x is called the novelty vector, since it measures what is maximally unknown or novel in the input vector x. Thus, the magnitude of ~ x can be used for novelty detection purposes. In such applications, the larger the norm jj~ xjj; the less certain we are of judging the vector ^x as belonging to the linear subspace L: 3.2 The self-organizing map The Self-Organizing Map (SOM) [25, 26] is one of the most popular neural network architectures. It belongs to the category of unsupervised competitive learning algorithms and it is usually designed to build an ordered representation of spatial proximity among vectors of an unlabeled data set. The SOM has been widely applied to pattern recognition and classification tasks, such as clustering, vector quantization, data compression and data visualization. In these applications, the weight vectors are called prototypes or centroids of clusters of input vectors, being obtained usually through a process of learning. The neurons in the SOM are put together in an output layer, A; in one-, two- or even three-dimensional arrays. Each neuron i 2 A has a weight vector wi 2 Rn with the same dimension of the input vector x 2 Rn : The network weights are trained according to a competitive-cooperative learning scheme in which the weight vectors of a winning neuron and its neighbors in the output array are updated after the presentation of an input vector. Roughly speaking, the functioning of this type of learning algorithm is based on the concept of winning neuron, defined as the neuron whose weight vector is the closest to the current input vector. Using Euclidean distance, the simplest strategy to find the winning neuron, i*(t), is given by
i ðtÞ ¼ arg min kxðtÞ wi ðtÞk 8i
ð4Þ
where xðtÞ 2 Rn denotes the current input vector, wi ðtÞ 2 Rn is the weight vector of neuron i, and t denotes the current iteration of the algorithm. Accordingly, the weight vectors are adjusted by the following recursive equation: wi ðt þ 1Þ ¼ wi ðtÞ þ gðtÞhði ; i; tÞ½xðtÞ wi ðtÞ ;
ð5Þ
where h(i*, i;t) is a Gaussian function which control the degree of change imposed to the weight vectors of those neurons in the neighborhood of the winning neuron:
123
kri ðtÞ ri ðtÞk2 hði ; i; tÞ ¼ exp r2 ðtÞ
!
ð6Þ
where r(t) defines the radius of the neighborhood function, ri(t) and ri ðtÞ are, respectively, the coordinates of neurons i and i* in the array. The learning rate, 0 \ g(t) \ 1, should decrease gradually with time to guarantee convergence of the weight vectors to stable states. In this paper, we use g(t) = g0 (gT/g0)(t/T), where g0 and gT are the initial and final values of g(t), respectively. The variable r(t) should also decrease with time similarly to the learning rate g(t). The SOM has several features which make it a valuable tool in data-mining applications [46]. For instance, the use of a neighborhood function imposes an order to the weight vectors, so that, at the end of the training phase, input vectors that are close in the input space are mapped onto the same winning neuron or onto winning neurons that are close in the output array. This is the so-called topologypreserving property of the SOM, which has been particularly useful for data visualization purposes [14]. Once the SOM converges, the set of ordered weight vectors summarizes important statistical characteristics of the input. The SOM should reflect variations in the statistics of the input distribution: regions in the input space X from which a sample x are drawn with a high probability of occurrence are mapped onto larger domains of the output space A, and therefore with better resolution than regions in X from which sample vectors are drawn with a low probability of occurrence. The density-matching property is very important for novelty detection purposes. For example, once the SOM is trained with unlabeled vectors consisting only of data representing the normal state of the system being analyzed, we can use the quantization error vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi uX u n eðx; wi ; tÞ ¼ kxðtÞ wi ðtÞk ¼ t ðxj ðtÞ wi j ðtÞÞ2 ; j¼1
ð7Þ between the current input vector x(t) and the winning weight vector wi ðtÞ; as a measure of the degree of proximity of x(t) to the distribution of ‘‘normal’’ data vectors encoded by the weight vectors of the SOM. Roughly speaking, if eðx; wi ; tÞ is larger than a certain threshold q, we assume that the current input is far from the region of the input space representing normal behavior as modeled by the SOM weights, thus revealing a novelty or an anomaly in the system being monitored. Several procedures to compute the threshold q have been developed in the recent years, most of them based on wellestablished statistical techniques (see e.g., [19, 32]). In the
Pattern Anal Applic
next sections we describe some of these techniques in the context of SOM-based novelty detection. 3.3 Novelty detection using the SOM In this section we describe two SOM-based approaches for novelty detection. The first one is based on the computation of a single decision threshold, while the other is based on the computation of two decision thresholds. Despite being proposed for specific neural architectures (e.g., the SOM), we will show later in Sect. 4 that the novelty detection techniques to be reported in this section can be equally applied to other learning architectures/paradigms (e.g., OLAM, MLP and RBF networks).
A single-threshold SOM-based method for fault detection in rotating machinery is presented by Tanaka et al. [44]. The procedure follows the same steps described previously, except that in this case, the novelty threshold is computed as follows: –
for each neuron in the immediate neighborhood of the winning neuron (also called 1-neighborhood neurons), one computes the distances Di j ¼ kwi wj k;
–
8j 6¼ i
from the winning weight vector to all the other weight vectors. The novelty threshold is taken as the maximum value of these distances: q ¼ max fDi j g 8j2V1
3.3.1 Computing a single decision threshold In Ho¨glund et al. [20], the SOM is trained with data representing the activity of normal users within a computer network. The threshold q is determined by computing the statistical p value associated with the distribution of training quantization errors. The p value defines the probability of observing the test statistic eðx; wi ; tÞ as extreme as or more extreme than the observed value, assuming that the null hypothesis is true. This novelty detection procedure is implemented as follows: Step 1: After training is finished, the quantization errors for the training vectors are computed ðe1 ; e2 ; . . .; em Þ using (7), for t ¼ 1; . . .; m: Step 2: The quantization error, enew, for a new input vector is computed. Step 3: The p-value for any new input vector, denote by Pnew, is computed as follows. Let B be the number of values in the set fe1 ; e2 ; . . .; em g that are greater than enew. Thus, q ¼ Pnew ¼
B ; m
ð8Þ
where m is the number of training data vectors available. Step 4: If q [ a, then H0 is accepted; otherwise it is rejected. A significance level a = 0.05 is commonly used. Step 5: Steps 2–4 are repeated for every new input vector. According to the authors the above algorithm is very reliable and has presented acceptable rates of false negatives and false positives, concluding that theses errors were caused by normal changes in user profiles. Similar approaches have applied to novelty detection in cellular networks [28], time series modeling [16] and machine monitoring [17].
ð9Þ
where V1 is the set of 1-neighborhood neurons of the current winning neuron. Thus, if enew [ q then the input vector carries novel or anomalous information, i.e., the null hypotheses should be rejected. 3.3.2 Computing double decision thresholds In this section, we describe techniques that compute two thresholds for evaluating the degree of novelty in the input vector. The rationale behind this approach is based on the fact that, for certain applications not only a very high quantization error is indicative of novelty but also very small ones. One can argue that a small quantization error means that the input vector is almost surely normal. This is true if no outliers are present in the data. However, in more realistic scenarios, there is no guarantee that the training data is outlier-free, and a given neuron could be representing exactly the region the outliers belong to. For further elaboration on this reasoning and its consequences for novelty/anomaly/outlier detection, the interested reader is referred to Mun˜oz and Muruza´bal [38]. In [7], the authors proposed a novel technique to detect faults in cellular systems by computing the Bootstrap Prediction Interval (BOOPI) for the distribution of quantization errors. The lower and upper limits of the BOOPI define the novelty thresholds. Several competitive models are analyzed and the SOM has provided the best results, generating the lowest false alarm rate. To implement this procedure, a sample of M bootstrap instances feb1 ; eb2 ; . . .; ebM g is drawn with replacement from the original sample of m (m « M) quantization errors ðe1 ; e2 ; . . .; em Þ; where each
123
Pattern Anal Applic
instance has equal probability to be sampled. Then, the lower and upper limits of the BOOPI method are computed via percentiles.2 It is shown that prediction (or confidence) intervals can be computed from the bootstrap samples without making any assumption about the distribution of the original data, provided the number M of bootstrap samples is large, e.g., M [ 1,000 [11, 12, 41]. For a given significance level a, we are interested in an interval within which we can certainly find a percentage 100(1 - a) (e.g., a = 0.05) of normal values of the quantization error. Hence, we compute the lower and upper limits of this interval as follows: – –
Lower Limit (q-): This is the 100 a2 th percentile: Upper Limit (q?): This is the 100ð1 a2Þth percentile:
This interval ½q ; qþ can then be used to classifying a new state vector into normal/abnormal by means of a simple decision rule: IF THEN ELSE
enew 2 ½q ; qþ xnew is NORMAL x
new
ð10Þ
is ABNORMAL
In this paper we propose a double-threshold method that can be viewed as an alternative to the BOOPI approach. Instead of computing the 100 a2 th and 100ð1 a2Þth percentiles for the M bootstrap samples of the quantization errors, we use the well-known statistical box-plot technique3 to determine the interval ½q ; qþ based solely on the original set of quantization errors ðe1 ; . . .; em Þ: As will be shown in the simulations, the box-plot approach revealed to be one of the more robust approach to novelty detection. It is worth noting that the proposed approach using boxplot is much similar to the one introduced by [38]. However, there are two important differences: (i) In our case, the interval ½q ; qþ is computed from the original set of training quantization errors ðe1 ; . . .; em Þ; while in [38] the interval is computed from a cleaned training data set from which the outliers were removed. (ii) to detect and remove outliers, the method by [38] demands the additional 2
The percentile of a distribution of values is a number Na such that a percentage 100(1 - a) of the population values are less than or equal to Na. For example, the 75th percentile (also referred to as the 0.75 quantile) is a value (Na) such that 75% of the values of the variable fall below that value. 3 In Box Plots, ranges or distribution characteristics of values of a selected variable (or variables) are plotted separately for groups of cases defined by values of a categorical (grouping) variable. The central tendency (e.g., median or mean), and range or variation statistics (e.g., quartiles, standard errors, or standard deviations) are computed for each group of cases and the selected values are presented in the selected box plot. Outlier data points can also be plotted.
123
computation of the MID matrix4 and the Sammon’s mapping [22], which makes it unsuitable for online applications due the excessive computational burden required5. 3.4 Multilayer feedforward supervised networks The most popular supervised ANN algorithm, the multilayer Perceptron (MLP), learns an input–output mapping through minimization of some objective function, usually the mean squared error. Due to its popularity, MLPs have been also widely used for novelty detection purposes. Two main approaches are common: (1) If examples of normal and abnormal behaviors are available, the MLP is used as a nonlinear classifier [5, 34, 45]; (2) If only data representing normal behavior is available, then the MLP is commonly used as an auto-associator [21]. These possibilities are better described next. The single-hidden layered MLP classifier implements very general nonlinear discriminant functions [49]. Usually, if there are q classes of data, ðC1 ; C2 ; . . .; Cq Þ; we will need q output neurons. These neurons are then trained to produce output values yi ; i ¼ 1; . . .; q; that encode the different classes. For example, neuron i should produce an output value yi close to 1 if the input vector belongs to class Ci; otherwise, yi = 0 (or yi = - 1). For testing the classification performance, we assign a new input vector x to class Ck, if k is the index of the neuron with the highest output value: k ¼ arg maxfyi g: 8i
ð11Þ
For novelty detection purposes, given a new input vector, once we find the neuron with the highest output value as in (11), we verify if yk is below a preset threshold (q). If so, then a novelty is declared. This approach was used by Markou and Singh [34], Augusteijn and Folkert [5] and Vasconcelos et al. [45]. Augusteijn and Folkert [5], however, argued that this classification scheme is unsuitable for novelty detection, since it takes into account only the information carried out by a single output neuron. Hence, they suggested taking the entire output pattern into account, so that the distance between this output pattern and each one of the target patterns (used during training) is computed, and if the smallest of these distances are above a
4
Median Interneuron Distance matrix is defined as that whose mij entry is the median of the Euclidean distance between the weight vector wi and all neurons within its L-neighborhood. 5 The Sammon’s mapping is a nonlinear mapping that maps a set of input vectors onto a plane trying to preserve the relative distance between the input vectors approximately. It is widely used to visualize the SOM ordering by mapping the values of weight vectors onto a plane. Sammon’s mapping can be applied directly to data sets, but it is computationally very intensive.
Pattern Anal Applic
preset threshold then the input pattern is considered to be novel. To improve the MLP performance in novelty/outlier detection tasks, Vasconcelos et al. [45] suggested applying the Gaussian Multilayer Perceptron (GMLP) [10], which uses Gaussian activation functions for hidden neurons, instead of the usual sigmoids. This simple modification provided better results, due to the fact that Gaussian activation functions forces the receptive field of a neuron to be more selective, being activated only for a narrow partition of the input space. The MLP is also commonly used for novelty detection tasks as an autoassociative architecture [21, 39]. The autoassociative MLP (AAMLP) is designed to learn an input–output mapping in which the target vectors are the input vector themselves. This is usually implemented through a hidden layer whose number of neurons is much lower than the dimension of the input vectors [18]. The AAMLP is trained to reconstruct as well as possible a training set consisting of vectors representing normal behavior. In this sense, the autoassociative MLP can be viewed as a nonlinear version of the Novelty Filter presented in Sect. 3.1. Hence, it should be able to adequately reconstruct subsequent normal input vectors, but should perform poorly on the task of reconstructing abnormal (novel) ones. Thus, the detection of novel or anomalous input patterns reduces to the task of assessing how well such vectors are reconstructed by the autoassociative MLP. Quantitatively, this procedure consists in computing an upper bound for the reconstruction error of all the training set vectors at the end of training. For testing purposes, this upper bound is usually relaxed a little bound by a certain percentage. New input patterns are subsequently classified by checking whether the reconstruction error of the new input pattern is above the relaxed upper bound, thus revealing novel data, or below (if data is normal). Another popular supervised multilayer ANN, the Radial Basis Function (RBF) network, has been used for novelty detection [2]. In such application, RBF networks have been used in a way much similar to the MLP network. However, the same limitations presented by MLP-based novelty detectors also apply to RBF-based ones, and the method proposed by Augusteijn and Folkert [5] can be used instead. An alternative is proposed by Li et al. [31]. If the output value of neuron i is given by yi ðxÞ ¼ wTi /ðxÞ þ bi ;
ð12Þ
where /ðxÞ ¼ ½/1 ðxÞ /q ðxÞ T is the vector of Gaussian basis function activations and bi is the bias of the output neuron i. Li et al.[31] developed a method to set the
threshold values for each output neuron of a RBF network as follows: qi ¼ b i þ e i ;
ð13Þ
where bi is the bias of neuron i computed during training and 0\ei 1 is a very small positive constant required to make the classifier robust to noise and disturbances while having little increase on the misclassification rate. Using this method, outputs may be readily interpreted as an ‘unknown fault’ when none of the ‘normal’ or ‘fault output neuron exceeds the threshold qi. 4 A unifying methodology for performance comparison As pointed out by Markou and Singh [33], there are a number of studies and proposals in the field of novelty detection, but comparative works are rare. To the best of our knowledge, only a few papers have compared different neural model on the same data set [1, 34, 48, 53]. None of them provided comprehensive results on which one is less sensitive to training parameters (e.g., number of training epochs, number of neurons, etc.), which one is more robust to outliers, and which data preprocessing method provides better results. In this paper, we try to answer some of these questions by providing a general methodology to compare the performance of supervised and unsupervised neuralbased novelty detection systems under common bases. The rationale behind the proposal of a general methodology is the observation that the decision thresholds of neural-based novelty detectors are computed heuristically, without clearly stated principles. For example, a commonly used heuristic for MLP- or RBF-based novelty detectors is to set the decision threshold to q = 0.5. That is, if all the outputs of the network fall below this value then an unknown (novel) activity is detected. The same argument applies to unsupervised methods, but in a lower scale. In general, the computation of decision thresholds involves statistical tools, such as p value, bootstrap resampling, boxplot, percentiles, among others. In this paper, we argue that most of the techniques described for SOM-based novelty detectors, can be adapted to MLP- and RBF-based novelty detectors in a straightforward manner. For that, once a neural method to be evaluated is defined, the approach we propose to compute decision threshold is a combination of the four steps listed below. Step 1: Define the output variable, zt, to be evaluated at a given time step t. It worth emphasizing that zt should reflect the statistical variability of the training data. For that purpose, we give next some possibilities.
123
Pattern Anal Applic
OLAM: The Euclidean norm of the novelty vector, zt ¼ k~ xðtÞk is a possible choice (see Section 3.1). SOM: The quantization error, zt ¼ kxðtÞ wi ðtÞk is the usual choice (see Sect. 3.2). MLP/RBF: In this case, we have two situations. – –
For single-output networks, it can be the output value of the network itself, i.e., zt = y(t). For multi-output networks, it can be the norm of the difference between the vector of desired outputs, d(t), and the vector of actual outputs, y(t). Then, we have zt ¼ kdðtÞ yðtÞk: If the Autoassociative MLP is being used zt can be chosen as the reconstruction error.
Step 2: Once the learning machine is trained. Compute the values of zt for each vector of the training set, Z ¼ ðz1 ; z2 ; . . .; zm Þ: Step 3: Generate a sample of M bootstrap instances Zb ¼ ðzb1 ; zb2 ; . . .; zbM Þ drawn with replacement from the original sample ðz1 ; z2 ; . . .; zm Þ; where each original value of zi has equal probability to be sampled. Step 4: Compute the threshold for novelty detection tests using the bootstrap samples Zb : In this case, we have again two possibilities: –
–
For single-threshold methods, one can choose, e.g., the p value approach proposed by Ho¨glund et al. [20], shown in (8), or the one proposed by Tanaka et al. [44], shown in (9). For double-threshold methods, one can compute decision intervals through percentile [7] or by the box-Plot method, both described in Sect. 3.3.2.
The advantages of the proposed methodology are listed below. Reliability: It is a statistically well-founded approach, once its functioning is based on the bootstrap method. In addition, if one adopts the BOOPI, the computed thresholds will correspond exactly to predict (confidence) intervals for the output variable zt. Nonparametric: No assumptions about the statistical properties of the output variable are made in any stage of the procedure. Generality: It allows the comparison of supervised and unsupervised learning methods under common basis. Robustness: The bootstrap technique allows the generation of a large number of samples, improving the estimates of parameters. Simplicity: The method is logical, very easy to understand and apply. An interesting by-product of the proposed methodology is the development of simple data-cleaning strategy as described in the next section.
123
5 Data preparation strategies Ypma and Duin [52] comment on the usual unavailability of samples that describe accurately all the faults in the system and claim that the best solution is to build a representation of the normal operation of the system and measure faults as a deviation of this normality. In [43], this problem is addressed using the Vapnik’s principle of never solving a problem that is more general than the one that we are actually interested in. Following this principle, if we are interested only in detecting novelty, it is not always necessary to estimate a full density model of the data. As pointed out earlier in this paper, a common approach to novelty detection (for some authors the genuine one!) is to treat the problem as a singleclass modeling/classification problem, in which we are interested in building a good representation for only a restricted class and then creating a method to test if novel events are members of this class. In this kind of method, the training set must be ideally outlier-free. The supporters of this viewpoint argue that a novelty detection system may have its performance improved if associated with some mechanism of outlier cleaning. The proposed methodology can be used to perform automatic data cleaning, by removing anomalous (undesirable) vectors from the training set, and then retraining the neural model with the cleaned data set. The proposed data-cleaning procedure is detailed in the following. Step 1: Choose a neural model and then compute decision thresholds according to the methodology described in Sect. 4. Step 2: Apply novelty tests using the training data vectors. Exclude those vectors classified as ‘‘abnormal’’ from the original training set. Step 3 (optional): Retrain the network with the cleaned set. It is worth mentioning that network retraining is optional, although highly recommended because the presence of outliers may have distorted considerably the decision surfaces and, hence, the positions of decision thresholds. In Sect. 6, we report simulations showing the benefits of this data-cleaning procedure. 5.1 Data-scaling strategies Data scaling is an important issue that is usually under emphasized in applications of neural methods to novelty detection. In this paper, we also evaluate the neural algorithms in this respect, assessing the influence of different methods on the performances of the novelty detectors. For this purpose, three techniques are utilized. Two of them normalize each component of the input vectors, while the
Pattern Anal Applic
third one normalizes the length of the vectors and decorrelates their components. Soft normalization: The distributions of the components, xj ; j ¼ 1; . . .; m are standardized to zero mean and unit variance: xj x j xnew ¼ ð14Þ j rj where m 1X xj x j ¼ m j¼1
and
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u m u 1 X rj ¼ t ðxj x j Þ2 m 1 j¼1
ð15Þ
Hard normalization: The components xj are rescaled to the [0; 1] range: xnew ¼ j
xj minðxj Þ maxðxj Þ minðxj Þ
ð16Þ
in which max(xj) and min(xj) are the maximum and minimum values of xj, respectively. Whitening and sphering: The original set of data vectors m fxi gm i¼1 are transformed to a new set of vectors fvi gi¼1 whose components are uncorrelated and their variances are unitary. In other words, the covariance matrix of the transformed vectors becomes the identity matrix Efvi vTi g ¼ I: This is usually performed through the eigenvalue decomposition (EVD) of the covariance matrix Rx ¼ Efxi xi T g of the original data vectors [49].
detected (false positive), the person will require further investigations about the disease and will eventually discover that the previous diagnosis was wrong. In this case, besides additional costs for new exams, the person is exposed to undesirable psychological stress while waiting for a conclusive evidence of cancer. For SOM-based novelty detectors, the output variable zt is the quantization error. For MLP- and RBF-based novelty detectors, the output variable zt is the output of the network itself, except for the AAMLP for which we selected the norm of the reconstruction error. For all the neural architectures, the decision thresholds were determined from the bootstrap samples of zt using the following methods: p value, box-plots, BOOPI. Additionally, for SOM-based novelty detectors the Tanaka’s method (here denoted by TNK) is also used to compute decision thresholds. All tests were performed using one-dimensional SOMs6. The MLP has a single hidden layer and is trained with the standard backpropagation algorithm with momentum term. The logistic function was adopted for all neurons. The RBF consisted of a first layer of Gaussian basis functions whose centers ci were computed by the SOM algorithm. A single radius is defined for all the Gaussian, computed as a fraction of the maximum distance among all the centers, i.e., pffiffiffiffiffi r ¼ dmax ðci ; cj Þ= 2q; 8i 6¼ j; where q is the number of basis functions and dmax ðci ; cj Þ ¼ max8i6¼j fkci cj kg: In the simulations, we are interested in the evaluation of following issues: –
6 Simulations and results In this section, we evaluate the performance of the neural network methods discussed in Sect. 3 through simulations on a breast cancer data set [50], available through the UCI Machine Learning Repository [8]. This data set was chosen because biomedical applications require high accuracy due to human factors involved. False positive and false negative errors in diagnosis have different implications to the person being analyzed, but both should be reduced. Unsupervised and supervised architectures were assessed under the proposed methodology by their robustness to outliers and their sensitivity to training parameters, such as data scaling, number of neurons, training epochs and size of the training set. Let a cancer detection test be performed under the null hypothesis H0 that the person is healthy (i.e., normal behavior). If cancer is not detected (false negative), the person returns home and usually stops worrying about health for a while, until the next visit to the doctor. This may lead to a serious situation, since the detection of a malignant tumor in the earlier stages of development is crucial for the success of the treatment. If a false cancer is
– –
novelty detection using the aforementioned neural network techniques. Performance improvement through the proposed outlier removal procedure. Performance sensitivity to different data preprocessing methodologies.
6.1 Novelty detection approaches Unsupervised ANNs The first set of simulations aims at comparing the novelty detection ability of the neural methods. The entire data set consisted of 699 ninedimensional feature vectors, whose attributes xi ; i ¼ 1; . . .; 9 are the following: clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses. All the attributes assume values within the range [1–10]. The hard normalization method was used to rescale the data to the [0-1] range. 6
We have tested different SOM topologies and number of neurons. Confirming a previous work on novelty detection (see reference [7]), the best results were obtained for 1D-SOMs, which have the additional advantage of being computational lighter than 2D-SOMs.
123
Pattern Anal Applic
A number of 16 instances containing a single missing attribute value were excluded from the original data set. From the remaining set of 683 vectors, 444 vectors corresponded to benign tumors and 239 to malignant ones. From the total of 444 ‘‘normal’’ vectors, 355 of them (about 80%) were selected for training purposes. From the remaining 89 ‘‘normal’’ vectors, 30 of them were replaced by ‘‘abnormal’’ vectors, randomly chosen from the set of 239 ‘‘abnormal’’ vectors. The inclusion of abnormal vectors in the testing set was necessary in order to evaluate the false negative (Error II) rates. If only examples of normal vectors were present in the testing set, we could estimate only the false positive (Error I) rates. For each combination of neural network model and decision threshold computation strategy, this procedure was repeated for 100 simulation runs, and the final error rates were averaged accordingly. The false negative rates achieved by SOM-based novelty detectors as a function of the number of neurons are shown in Fig. 1. Each neural model in this figure was trained for 100 epochs. It can be noted that the pair (SOM, box-plot) produced the lower rates, followed very closely by the pair (SOM, p value). The pairs (SOM, BOOPI) and (SOM, TNK) provided the worst rates. The second set of simulations evaluates the sensitivity of the SOM-based novelty detectors to changes in the number of training epochs, as shown in Fig. 2. The training parameters used were the same as those used for the first set of simulations, except that the number of neurons was set to 40. The overall performances remain the same as in Fig. 1, with the pair (SOM, box-plot) achieving the lowest false negative rates.
As a double threshold decision test, the pair (SOM, boxplot) can detect outliers in regions of high quantization error (QE) as well as in regions of low QE. As discussed in Sect. 3.3.2, this type of outlier (referred to as unknown outliers) can be the result of an erroneous labeling. If unknown outliers are present in the training set, some neurons may be attracted to these spurious patterns, so that in the future some outliers will probably fire these neurons producing low quantization errors. Only novelty decisions based on double thresholds methods, such as the box-plot or the BOOPI, can detect outliers in this case. The pair (SOM, BOOPI), which in theory could also detect outliers in the low-QE region, has obtained a performance only better than the pair (SOM, TNK). This may be due to the fact that the great majority of unknown outliers are in the region of high QE, probably due to the low occurrence of unknown outliers (such as mislabeled normal data) in the training set, thus implicitly revealing the good quality of the data set. It is interesting to note that the performance of the pair (SOM, TNK) gets worse as the number of training epochs increases. This may occur because of the very nature of Tanaka’s test. Once the SOM network has more time to converge, it better fits the data manifold. Then, we can observe that the quantization error enew tends to decrease even more, while the novelty threshold q computed through (9) tends to stabilize, remaining constant. So, as the network achieves a better representation of the data, it becomes more and more rare to observe enew [ q , and hence the test is almost never positive for novelty, even when the presented data vector is truly novel. This contradicts the common sense that says that
90
90 BOOPI P−VALUE BOX−PLOT TANAKA
BOOPI P−VALUE BOX−PLOT TANAKA
80
70
False Negative Rates (%)
False Negative Rates (%)
80
60 50 40 30
70 60 50 40 30
20
20
10
10 3
3 2
5
10
15
20
25
30
35
40
45
50
Number of Neurons
Fig. 1 False negative rates (in percentage) as a function of the number of neurons for SOM-based novelty detectors (SOM, boxplot), (SOM, p value), (SOM, BOOPI) and (SOM, TNK)
123
1
50
100
150
200
250
300
Number of Training Epochs
Fig. 2 False negative rates (in percentage) as a function of the number of training epochs for SOM-based novelty detectors (SOM, box-plot), (SOM, p value), (SOM, BOOPI) and (SOM, TNK)
Pattern Anal Applic
the better the representation of the data, the better the network’s result. Finally, another interesting conclusion drawn from Figs. 1 and 2 is that, for a large number of neurons or a very long training period, the pairs (SOM, box-plot) and (SOM, p value) produced very similar false negative rates. This may be due to the fact that increasing the number of neurons of the SOM or the number of training epochs the mean value of the quantization error decreases, which makes few real outliers to fall below the decision threshold computed according to the p value method. The third set of simulations aims at comparing the accuracy of SOM-based novelty detectors with respect to the size of the training and testing sets. The purpose of this test is to give a rough idea of which method requires less data to achieve high accuracy. Results are shown in Fig. 3 only for the pairs (SOM, boxplot), (SOM, p value) and (SOM, BOOPI). The results for the pair (SOM, TNK) were very poor and are not shown in the figure to improve the visualization of results for the other three pairs. By analyzing this figure, we observe that no relevant changes in performance were verified as the sample size varied significantly. For these tests, the number of neurons and the number of training epochs were set to 40 and 100, respectively. Supervised ANNs The same tests described above for SOM-based novelty detectors are repeated here for the supervised methods (MLP, Autoassociative MLP, GMLP and RBF). The first set of simulations evaluates the false negative rate as a function of the number of hidden neurons. For these tests, each MLP network was trained for 1,000 epochs with normal data vectors only. The learning
rate and the momentum factor were set to 0.35 and 0.5, respectively. For the sake of clarity, the results are shown only for the p value (Fig. 4) and the box-plot (Fig. 5) decision threshold methods, since they provided the best overall results. The best performances were achieved by the pairs (MLP, p value) and (RBF, box-plot). These figures also illustrate that for this specific data set, the pair (RBF, p value) did not perform well, while the pair (RBF, box-plot) achieved a very good performance. Note that the same network (i.e., RBF) achieved rather different performances for two different novelty detection techniques (i.e., box-plot and
Fig. 4 False negative rates for supervised novelty detectors (MLP, Autoassociative MLP, GMLP and RBF) as a function of the number of hidden neurons, using the p value decision threshold method
20 BOX−PLOT P−VALUE BOOPI
18
False Negative Rates (%)
16 14 12 10 8 6 4 2 10
20
30
40
50
60
70
80
90
Size of the Training Data (% of total Data Set)
Fig. 3 False negative rates as a function of the training set size for SOM-based novelty detectors (SOM, Boxplot), (SOM, p value) and (SOM, BOOPI). The poor results for the pair (SOM, TNK) are not shown
Fig. 5 False negative rates for supervised novelty detectors (MLP, Autoassociative MLP, GMLP and RBF) as a function of the number of hidden neurons, using the box-plot decision threshold method
123
Pattern Anal Applic 100
Table 1 Best models (NN ? novelty test)—mean and variance values of false negative rates (%) over 100 independent simulations
90
Model
False Negative Rates (%)
80
(SOM , BOX−PLOT) (MLP , P−VALUE) MLP Classifier
70
False negative
False positive
Mean
Mean
Var 11.8
Var
RBF ? box-plot
0.1
0.1
9.9
MLP ? p Value
0.6
0.5
3.4
3.5
40
2-class MLP SOM ? box-plot
0.9 2.0
0.9 1.0
3.5 3.7
3.2 2.9
30
Linear filter ? box-plot
3.3
41.0
5.2
11.9
60 50
20
0 0
5
10
15
20
25
30
35
40
45
50
Percentage of Outliers in Data
Fig. 6 False positive rates (%) versus percentage of negative examples for the pairs (SOM, box-plot), (MLP, p value) and for the standard 2-class MLP classifier
p value). This result is in accordance with the main rationale behind this paper, which is to combine different neural architectures with different novelty detection techniques in order to compare their performances under common bases for a specific data set. Once the evaluation stage is finished, the user can choose the best combination of neural architecture and novelty detection technique for the given data set. To illustrate the influence of negative examples7 during training on the performance of unsupervised and supervised novelty detectors, we evaluated the pairs (SOM, boxplot) and (MLP, p value) for different values of the number of abnormal vectors. For the sake of completeness, a standard MLP-classifier for two class (normal and abnormal) was included in the performance comparison. This classifier has 2 output neurons, 30 hidden neurons and is trained for 100 epochs, with targets ½ 1 0 T for normal data and ½ 0 0 T for abnormal data. Figure 6 shows the results. It is important to point out that the performance of the 2-class MLP classifier is very sensitive to the frequency of negative examples in the training set, achieving its best rates when the two classes have the same percentage representation in the training set. This is a problem with the use of such a classifier for novelty detection, once we know that negative examples are often rare, incomplete or even non-existent. Despite its good performance, its practical application is restricted to cases in which negative 7
By negative examples, we mean normal data vectors, whose original label was changed from normal (?1) to abnormal (-1) in order to simulate the class of negative examples. This class can be understood as the one containing novel (or abnormal) examples, also called outliers.
123
examples are abundant (as in the present case) or the detection technique includes artificially created negative examples as proposed in [34]. In a nutshell, it is advantageous to work with a singleclass novelty detection system, such as the pairs (RBF, box-plot) or (MLP, p value), instead of a standard 2-class MLP or RBF classifier which would require a considerable amount of negative examples and a longer training period. Finally, Table 1 summarizes the best performances of the algorithms for novelty detection evaluated in this paper for the breast cancer data set. 6.2 The role of data cleaning With respect to the proposed data-cleaning procedure, it is observed that all methods have shown better performance after the data cleaning procedure. In Fig. 7, for the pair (SOM, box-plot) with 40 neurons, one can see a reduction in the false positive rates after the application of the datacleaning procedure. This was the best performance among the SOM-based approaches, achieving an average false positive rate below 3%.
After Data Cleaning No Data Cleaning
False Negative Rates (%)
10
3
1
50
100
150
200
Number of Training Epochs
Fig. 7 False positive rates (%) versus the number of training epochs for the pair (SOM, box-plot) with 40 neurons, after cleaning the data
Pattern Anal Applic
6.3 The role of data preprocessing
7 Further discussion
Three preprocessing methods were investigated to improve the novelty detection system: two component-wise normalization and a sphering/whitening transformation. The appropriate use of one of these procedures can prevent problems like masking (which increases the false positive error or type II error) or swamping (which increases the false negative error or type I error), which are very usual in distance-based algorithms like the SOM. It is virtually impossible to indicate prior to experimentation which one of these three preprocessing methods is the best, since their performances are dependent on the data set available and on the chosen neural architectures. To illustrate how the preprocessing can influence the performance of a given novelty detection method, Fig. 8 shows the results obtained for the (SOM, box-plot) pair, as a function of the number of neurons in the SOM. For this specific data set and the chosen novelty detection method, hard-normalization outperformed the other two preprocessing methods. First, this may be due to the fact that hard-normalization preserves the distance relations among data vectors. Second, whitening and softnormalization are more suited to the masking problem (which increases the false positive error), because in the test (with the presence of outliers) the outliers will distort l and R; thus providing lower values for the Mahalanobis (or Euclidean after transformation) distance, and the tests will not detect some outliers.
For SOM-based novelty detection methods, the best performances were achieved using the box-plot method [38] for decision threshold computation, followed by p value [20] and BOOPI [7]. The test for novelty proposed in [44] has showed the worst performance between all others SOM-based novelty detectors. Specially its performance is degraded with the increasing of the number of training iterations. This occurs because of the very nature of the test. Once the SOM network is better trained (more training iterations) or covers better the data region, we can see that dext, the distance between the actual data and its best matching neuron, tends to decrease, while dint, the distance between the best matching neuron and its nearest neighbor tends to stabilize, and becomes almost constant for the entire network. So, as the network achieves a better representation of data, it becomes even and even rare to see dint \ dext, and the test almost never is positive for novelty, even when the presented data are novel. This contradicts the common sense in neural networks that says: the better the representation of the data, the better the network’s result.
100 Soft Normalization Hard Normalization Whitening Transf.
90
False Negative Rates (%)
80 70 60 50 40 30 20 10 3 2
5
10
15
20
25
30
35
40
45
50
Number of Neurons
Fig. 8 False positive rates (%) for three preprocessing methods applied to the (SOM, box-plot) pair, as a function of the number of neurons in the SOM
8 Conclusion In this paper, we have carried out a comprehensive analysis of supervised and unsupervised ANN-based methods to deal with novelties/abnormalities/outliers in the data. The goal of the paper was not to carry out a performance comparison among neural architectures using several data sets, but rather we aimed at proposing a novel methodology, where we can compare, under a common framework, supervised and unsupervised neural network architectures on novelty detection tasks. Thus we have chosen one particular data to show how to perform such a comparison. We analyzed two working hypothesis: remove them from data or handle outliers directly (detecting and labeling them for classification purposes). For this purpose, we have introduced a unifying methodology to compare the performances of supervised and unsupervised neural network methods applied to novelty detection tasks. We also have extended the application of techniques to compute double decision thresholds to supervised architectures and proposed an outlier removal procedure based on it. Finally, under the proposed framework, we have evaluated the performance of different neural network methods through simulations on a breast cancer data set, assessing their robustness to outliers and their sensitivity to training parameters, such as data scaling, number of neurons, training epochs and size of the training set.
123
Pattern Anal Applic
An important issue also evaluated in our experiments was the data preparation strategy, which included an outlier removal procedure and a preprocessing technique. Currently, we are evaluating the proposed methodology on the detection of novelties in time series data. We are also extending it to include recurrent neural network models, such as Elman and echo-state-network (ESN) models. Acknowledgements this research.
The authors thank FUNCAP for supporting
References 1. Addison JFD, Wermter S, MacIntyre J (1999) Effectiveness of feature extraction in neural network architectures for novelty detection. In: Proceedings of the 9th international conference on artificial neural networks (ICANN’99). IEEE Press, Washington, DC, pp 976–981 2. Albrecht S, Busch J, Kloppenburg M, Metze F, Tavan P (2000) Generalized radial basis functions networks for classification and novelty detection: self-organization of optimal bayesian decision. Neural Netw 13:1075–1093 3. Alhoniemi E, Hollme´n J, Simula O, Vesanto J (1999) Process monitoring and modeling using the self-organizing map. Integr Comput Aided Eng 6(1):3–14 4. Appiani E, Buslacchi G (2009) Computational intelligence solutions for homeland security. Adv Soft Comput 53:43–52 5. Augusteijn MF, Folkert BA (2002) Neural network classification and novelty detection. Int J Remote Sens 23(14):2891–2902 6. Barreto GA, Aguayo L (2009) Time series clustering for anomaly detection using competitive neural networks. In: Principe JC, Miikkulainen R (eds) Advances in self-organizing maps, vol LNCS-5629. Springer, Berlin, pp 28–36 7. Barreto GA, Mota JCM, Souza LGM, Frota RA, Aguayo L (2005) Condition monitoring of 3G cellular networks through competitive neural models. IEEE Trans Neural Netw 16(5):1064–1075 8. Blake CL, Merz CJ (1998) UCI repository of machine learning databases. University of California, Irvine, Dept. of Information and Computer Sciences. http://www.ics.uci.edu/*mlearn/ML Repository.html 9. Cristani M, Bicego M, Murino V (2007) Audio-visual event recognition in surveillance video sequences. IEEE Trans Multimed 9(2):257–266 10. Dawson MRW, Schopflocher DP (1992) Modifying the generalized delta rule to train networks of nonmonotonic processors for pattern classification. Connect Sci 4(1):19–31 11. DiCiccio TJ, Efron B (1996) Bootstrap confidence intervals. Stat Sci 11(3):189–228 12. Efron B, Tibshirani RJ (1993) An Introduction to the Bootstrap. Chapman & Hall, Boca Raton 13. Fisch D, Hofmann A, Sick B (2010) On the versatility of radial basis function neural networks: a case study in the field of intrusion detection. Inform Sci 180(12):2421–2439 14. Flexer A (2001) On the use of self-organizing maps for clustering and visualization. Intell Data Anal 5(5):373–384 15. Frota RA, Barreto GA, Mota JCM (2007) Anomaly detection in mobile communication networks using the self-organizing map. J Intell Fuzzy Syst 18(5):493–500 16. Gonzalez F, Dasgupta D (2002) Neuro-immune and self-organizing map approaches to anomaly detection: a comparison. In: Proceedings of the first international conference on artificial immune systems, Canterbury, UK, pp 203–211
123
17. Harris T (1993) A Kohonen SOM based machine health monitoring system which enables diagnosis of faults not seen in the training set. In: Proceedings of the international joint conference on neural networks, (IJCNN’93), vol 1, pp 947–950 18. Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507 19. Hodge VJ, Austin J (2004) A survey of outlier detection methodologies. Artif Intell Rev 22:85–126 20. Ho¨glund AJ, Ha¨to¨nen K, Sorvari AS (2000) A computer hostbased user anomaly detection system using the self-organizing map. In: Proceedings of the IEEE-INNS-ENNS international joint conference on neural networks (IJCNN’00), vol 5, Como, Italy, pp 411–416 21. Japkowicz N, Myers C, Gluck M (1995) A novelty detection approach to classification. In: Proceedings of the 14th international joint conference on artificial intelligence (IJCAI’95), pp 518–523 22. Sammon JW Jr. (1969) A nonlinear mapping for data structure analysis. IEEE Trans Comput C-18:401–409 23. King S, Bannister PR, Clifton DA, Tarassenko L (2009) Probabilistic approach to the condition monitoring of aerospace engines. Proc IMechE Pt G J Aerosp Eng 223(5):533–541 24. Kohonen T (1989) Self-organization and associative memory, 3rd edn. Springer, Berlin 25. Kohonen T (1990) The self-organizing map. Proc IEEE 78(9): 1464–1480 26. Kohonen T (2001) Self-organizing maps, 3rd edn. Springer, Berlin 27. Kohonen T, Oja E (1976) Fast adaptive formation of orthogonalizing filters and associative memory in recurrent networks of neuron-like elements. Biol Cybernet 25:85–95 28. Laiho J, Kylva¨ja¨ M, Ho¨glund A (2002) Utilisation of advanced analysis methods in UMTS networks. In: Proceedings of the IEEE vehicular technology conference (VTS/spring), Birmingham, Alabama, pp 726–730 29. Lawrence S, Burns I, Back AD, Tsoi AC, Giles CL (1998) Neural network classification and unequal prior class probabilities. In: Orr G, Mu¨ller K-R, Caruana R (eds) Neural networks: tricks of the trade, vol 1524. Lecture Notes in Computer Science, Springer, Berlin, pp 299–314 30. Lee H-J, Cho S, Cho M-S (2008) Supporting diagnosis of attention-deficit hyperactive disorder with novelty detection. Artif Intell Med 42(3):199–212 31. Li Y, Pont MJ, Jones NB (2002) Improving the performance of radial basis function classifiers in condition monitoring and fault diagnosis applications where ‘unknown’ faults may occur. Pattern Recogn Lett 23(5):569–577 32. Markou M, Singh S (2003) Novelty detection: a review—part 1: statistical approaches. Signal Proc 83(12):2481–2497 33. Markou M, Singh S (2003) Novelty detection: a review—part 2: neural network based approaches. Signal Proc 83(12):2499–2521 34. Markou M, Singh S (2006) A neural network-based novelty detector for image sequence analysis. IEEE Trans Pattern Anal Mach Intell 28(10):1664–1677 35. Marsland S (2003) Novelty detection in learning systems. Neural Comput Surv 3:157–195 36. Marsland S, Shapiro J, Nehmzow U (2002) A self-organising network that grows when required. Neural Netw 15(8–9):1041– 1058 37. Modenesi AP, Braga AP (2009) Analysis of time series novelty detection strategies for synthetic and real data. Neural Proc Lett 30(1):1–17 38. Mun˜oz A, Muruza´bal J (1998) Self-organising maps for outlier detection. Neurocomputing 18:33–60 39. Petsche T, Marcantonio A, Darken C, Hanson SJ, Kuhn GM, Santoso I (1996) A neural network autoassociator for induction
Pattern Anal Applic
40.
41. 42.
43.
44.
45.
46.
motor failure prediction. In: Touretzky D, Mozer M, Hasselmo M (eds) Advances in neural information processing systems, vol. 8. MIT Press, Cambridge, pp 924–930 Piciarelli C, Micheloni C, Foresti GL (2008) Trajectory-based anomalous event detection. IEEE Trans Circuit Syst Video Technol 18(11):1544–1554 Reich Y, Barai SV (1999) Evaluating machine learning models for engineering problems. Artif Intell Eng 13:257–272 Rose CJ, Taylor CJ (2004) A generative statistical model of mammographic appearance. In: Rueckert D, Hajnal J, Yang G-Z (eds) Proceedings of the 2004 medical image understanding and analysis (MUIA’04), pp 89–92 Scholkopf B, Williamson RC, Smola AJ, Shawe-Taylor J, Platt JC (2000) Support vector method for novelty detection. In: Solla SA, Leen TK, Mu¨ller K-R (eds) Advances in neural information processing systems, vol 12. MIT Press, Cambridge, pp 582–588 Tanaka M, Sakawa M, Shiromaru I, Matsumoto T (1995) Application of Kohonen’s self-organizing network to the diagnosis system for rotating machinery. In: Proceedings of the IEEE international conference on systems, man and cybernetics (SMC’95), vol 5, pp 4039–4044 Vasconcelos GC, Fairhurst MC, Bisset DL (1995) Investigating feedforward neural networks with respect to the rejection of spurious patterns. Pattern Recogn Lett 16:207–212 Vesanto J, Ahola J (1999) Hunting for correlations in data using the self-organizing map. In: Proceedings of the international
47. 48. 49. 50.
51.
52.
53.
54.
ICSC congress on computational intelligence methods and applications (CIMA99), pp 279–285 Vieira Neto H, Nehmzow U (2007) Visual novelty detection with automatic scale selection. Robotics Autonomous Syst 55(9):693–701 Vu D, Vemuri VR (2002) Computer network intrusion detection: A comparison of neural networks methods. J Differ Equ Dyn Syst Webb A (2002) Statistical pattern recognition, 2nd edn. Wiley, New York Wolberg WH, Mangasarian OL (1990) Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proc Natl Acad Sci USA 87:9193–9196 Yamanishi K, Maruyama Y (2007) Dynamic model selection with its applications to novelty detection. IEEE Trans Inform Theory 53(6):2180–2189 Ypma A, Duin RPW (1997) Novelty detection using selforganising maps. In: Kasabov N, Kozma R, Ko K, O’Shea R, Goghill G, Gedeon T (eds) Progress in connectionist-based information systems, vol 2. Springer, Berlin, pp 1322–1325 Zhang Z, Li J, Manikopoulos CN, Jorgenson J, Ucles J (2001) HIDE: a hierarchical network intrusion detection system using statistical preprocessing and neural network classification. In: Proceedings of the IEEE workshop on information assurance and security, pp 85–90 Zhou J, Cheng L, Bischof WF (2007) Online learning with novelty detection in human-guided road tracking. IEEE Trans Geosci Remote Sens 45(12):3967–3977
123