Nonstationary Time Series Prediction Using Local Models Based on Competitive Neural Networks Guilherme A. Barreto1, João C.M. Mota1, Luis G.M. Souza2, and Rewbenio A. Frota2 1
Department of Teleinformatics Engineering, Federal University of Ceará CP 6005, CEP 60455-760, Fortaleza, Ceará, Brazil {guilherme, mota}@deti.ufc.br http://www.deti.ufc.br/~guilherme 2 Instituto Atlântico: Research & Development in Telecom & IT Rua Chico Lemos, 946, CEP 60822-780, Fortaleza, Ceará, Brazil {gustavo, rewbenio}@atlantico.com.br
Abstract. In this paper, we propose a general approach for the application of competitive neural networks to nonstationary time series prediction. The underlying idea is to combine the simplicity of the standard least-squares (LS) parameter estimation technique with the information compression power of unsupervised learning methods. The proposed technique builds the regression matrix and the prediction vector required by the LS method through the weight vectors of the K first winning neurons (i.e. those most similar to the current input vector). Since only few neurons are used to build the predictor for each input vector, this approach develops local representations of a nonstationary time series suitable for prediction tasks. Three competitive algorithms (WTA, FSCL and SOM) are tested and their performances compared with the conventional approach, confirming the efficacy of the proposed method.
1 Introduction A scalar time series consists of n observations of a single variable y measured sequentially in the time: {y(t), y(t - 1), … , y(t – n + 1)}. Time series prediction (or forecasting) is the engineering task whose goal is to find mathematical models that supply estimates for the future values of the variable y [2]. This is possible because, in general, successive values of a series are dependent on each other for a period dictated by the underlying process responsible for the generation of the series, which can assume a linear or nonlinear nature. Several approaches for the prediction task have been studied along the years [13], such as the widely used autoregressive (AR) and moving average (MA) models, as well as their combinations in the ARMA and ARIMA models [2], [3]. Among nonlinear models, successful applications using artificial neural networks (ANNs) have been reported elsewhere [4], [5], [10], [13]. In general, existing time series methods can be classified roughly into global and local models [11]. In global models, a single mathematical model learns the dynamics of the observed series. In local models, the time series is divided into shorter segR. Orchard et al. (Eds.): IEA/AIE 2004, LNAI 3029, pp. 1146-1155, 2004. © Springer-Verlag Berlin Heidelberg 2004
Nonstationary Time Series Prediction Using Local Models
1147
ments, each one characterized by (usually linear) models simpler than the one required by the global approach. The segmentation of the series is usually performed by clustering algorithms, such as the K-means [8] or the Self-Organizing Map (SOM) [7], [12]. In this case, a scalar time series is transformed into a set of data vectors by means of a sliding time-window that is formed by a fixed number of consecutive samples of the series. Then, parameters of a given local model are computed using the data associated to the segment it models. The type of model to be used depends on the underlying dynamics of the time series in analysis. Global models are suitable to the prediction of stationary series, while local models are preferred for modeling of nonstationary ones. This paper introduces a general design technique for building local models for prediction of nonstationary time series that can be used by any type of competitive neural networks. The method is tested with three competitive algorithms, which are evaluated based on their predictive capacity. The remainder of the paper is organized as follow. Section 2 presents the neural competitive learning algorithms to be used. In Section 3 we discuss the standard linear parameter estimation problem introduces a new technique for local modeling of time series through competitive networks. In Section 4 we report computer simulations involving the methods and networks described in previous sections. The article is concluded in Section 5.
2 Competitive Neural Networks Competitive learning comprises one of the main classes of unsupervised ANNs, where only a neuron or a small group of neurons, called winning neurons, are activated according to the degree of proximity of their weight vectors to the current input vector [4], [6], [10]. This type of algorithm is used in tasks of pattern recognition and classification, such as clustering and vector quantization. In these applications, the weight vectors are called the prototypes of the set input patterns. In the simplest competitive algorithm, known as WTA (Winner-Take-All), only one neuron has the weights updated. The training can be described in two basic steps: 1. Search for the winning vector, i*(t), associated with the input vector x(t):
i * (t ) = arg min x(t ) − w i (t ) .
(1)
∀i
2. Updating the weight vectors, wi(t), of the winning vector,
∆w i* (t ) = α (t )[x(t ) − w i* (t )] ,
(2)
where 0 < α(t) < 1 it is the learning rate that should decrease with the time for convergence purposes. In this paper, we adopt α(t) = α0(1- t/T), where α0 is the initial value of α(t) and T is the maximum number of training iterations. A limitation of the WTA is its high sensitivity to weight initialization, a problem that leads to the occurrence of dead units, i.e., neurons never selected as winners. To avoid this, simple modifications
1148
G.A. Barreto et al.
to the original WTA algorithm have been proposed to give chance to all neurons to become a winner at some stage of the training. The first algorithm of interest, called Frequency-Sensitive Competitive Learning (FSCL) [1], modifies Equation (1) by introducing a weighting factor that penalizes strongly those neurons that have been selected too often: (3) i * (t ) = arg min {f (t ) ⋅ x(t ) − w (t ) } . ∀i
i
i
z
where fi(t) = [ci/t] , so that ci is the number of times a neuron i is selected as winner until iteration t, and z ≥ 1 is a constant exponent. The adjustment of the weights follows in accordance with Equation (2). The second algorithm is the well-known Self-Organizing Map (SOM) [6], which modifies Equation (2) so that the weights of the neurons in the neighborhood of the winning neuron are also adjusted: (4) ∆w (t ) = α (t ) h(i * , i; t ) [x(t ) − w (t ) ] . i*
i*
where h(i*,i;t) is a Gaussian weighting function defined by: r (t ) − r (t ) i* h(i * , i; t ) = exp − i σ 2 (t )
2
.
(5)
in which σ defines the width of the neighborhood of neuron i, while ri(t) e ri*(t) are, respectively, the positions of the neurons i and i* in the SOM. The variable σ(t) should also decreases with time as σ(t) = σ0(1- t/T), where σ0 is its initial value.
3 Building Local Models via Competitive Neural Networks Currently, build local models for time series prediction via competitive ANNs is based on the training data set [7], [8], [12]. In this formulation, the network is used only to separate the input vectors per neuron. It is assumed that the input vector in the instant t is given by: (6) x(t ) = [ y (t + 1) | y (t ) L y (t − n + 1)]T y
= [ x1 (t )
x2 (t ) L xn y +1 (t )]T
(7)
where ny > 1 is the length of the window used to built the input vectors from consecutive samples of the time series. For a series with N samples, it is possible to get N - ny input vectors. After training is completed, the same set of input vectors used for training is presented once again in order to separate them per neuron. No weight adjustment is performed at this stage. We denote by xi(t) an input vector x(t) for which the neuron i is the winner. Then, to each neuron i we associate a linear AR models whose parameters are computed using only the corresponding vectors xi(t). An autoregressive linear model (AR) of order ny is represented by: ny
yˆ (t + 1) = a 0 + ∑ a j y (t − j + 1) j =1
(8)
Nonstationary Time Series Prediction Using Local Models
1149
where aj are the coefficients of the model. The prediction errors (or residuals) e(t ) = y (t ) − yˆ (t ) are used to evaluate the accuracy of the model by means of the Normalized Root Mean Square Error (NRMSE): N
∑ e (k ) 2
NRMSE =
k =1
N ⋅ σ y2
=
σˆ e2 σ y2
(9)
where σy2 is the variance of the original series, σˆ e2 is the variance of the residuals and N is the length of the sequence of residuals. To calculate the coefficients aj we use the well-known Least-Squares (LS) method [4], [5], [10]. Thus, the coefficients of the AR model associated with neuron i are computed as follows:
(
)
−1
(10) a i = R Ti R T R Ti p i where the vector of predictions pi and the regression matrix Ri are built from the vectors {xi(t1), xi(t2), ... , xi(tNi)} for which the neuron i was the winner at instants {t1, t2, ..., tNi}, respectively. By means of Equations (7) and (8), we have that: p i = [ x1i (t1 ) x1i (t 2 ) L x1i (t Ni )]T . (11) 1 x2i (t1 ) L x i (t1 ) ny + 1 1 x i (t ) L x i (t 2 ) (12) 2 2 ny + 1 R = . i M M M M i i 1 x2 (t Ni ) L x (t Ni ) ny + 1 i Once the coefficients a are computed by Equation (10), we use Equation (8) to predict new values for the time series. In this paper, we propose the use of the weight vectors of a competitive ANN to build local predictors, since these weights constitute the prototypes of the training data vectors. For this purpose, at each time t, we need to find K neurons, { i1* (t ) , i2* (t ) ,…,
iK* (t ) }, whose weight vectors are the closest to the input vector:
i1* (t ) = arg min x(t ) − w i (t ) .
(13)
i2* (t ) = arg min* x(t ) − w i (t ) .
(14)
∀i
∀i ≠ i1
M
i (t ) = arg * K
M
min
{
∀i ≠ i1* ,L,i *K −1
}
x(t ) − w i (t ) .
(15)
where i1* (t ) is the first winning neuron, i2* (t ) is the second winning neuron, and so on, until the K-th winning neuron, iK* (t ) . Then, we apply the LS method to the weight vectors of these K neurons to compute the coefficients of Equation (8). Thus, the prediction vector p and the regression matrix R, are now given by:
1150
G.A. Barreto et al.
p = [ wi* ,1 (t ) wi* ,1 (t ) L wi* ,1 (t )]T 1
1 1 R = M 1
w i * , 2 (t ) 1
2
w i * , 3 (t ) 1
L
wi* ,n
2
L
wi* ,n
w i * , 2 (t )
w i * , 3 (t )
M
M
2
wi*
K
,2
(t )
K
wi*
K
,3
(t )
M L
1
y
(t ) (t ) . +1 +1
w i * , n + 1 (t ) K y 2
y
(16) (17)
M
where wij corresponds to the j-th element of the weight vector of neuron i. For the present case, i ∈ { i1* (t ) , i2* (t ) , …, iK* (t ) } and j = 1, …, ny + 1. The index i in p and R is eliminated because now we do not refer anymore to a local model associated to neuron i, but rather to a local model associated to the K winning neurons, for each vector x(t). The main advantages of the proposed method are listed below: • Lower computational cost: The conventional method trains the ANN, then it separates the data per neuron i, and finally calculate the coefficients of the various local AR models. In the proposed method, only a single local model is built using the weight vectors of K winning neurons. A data separation stage is not needed anymore. • Greater numerical stability: When we use only K neurons out of N (K >> N) available, the require matrix inversion required by Equation is performed on a regression matrix of lower dimension than that of the traditional method. • Greater robustness: In competitive ANNs, the weight vector of neuron i converges to the centroid of the cluster of input vectors for which this neuron was the winner [4], [6], [10]. It is well known that “averaging” serves to alleviate the effects of noise, so that local models built from the weight vectors will turn to be more robust to distortions in the input vectors. • Greater generality: Previous attempts to building local models are based on properties that are inherent only to the SOM algorithm [7], [12]. Any competitive ANN, not only for the SOM, can use the methodology proposed in this paper. 3.1 Nonstationary Time Series The time series used in the simulations is shown in Fig. 1. This series was generated from the Mackey-Glass differential equation with time delay, td [9]: dy ( t ) =γ dt
td
= − 0 .1 y (t ) +
0 .2 y (t − t d ) 1 + [ y ( t − t d )] 10
(18)
where y(t) is the value of the time series at time t. Equation (18) models the dynamics of the production of white blood cells in patients with leukemia. A nonstationary time series is obtained from the composition of three stationary modes of operation of Equation (18), termed A, B and C, corresponding to different delays td = 17, 23 and 30. After iterating for 400 instants of time in mode A, the dynamics from the series is switched to a mixture of modes A, B and C, given by:
Nonstationary Time Series Prediction Using Local Models
1151
Fig. 1. The simulated nonstationary time series used in this paper. dy ( t ) = a γ 17 + b γ dt
23
+ cγ
30
(19)
so that a = 0.6, b = 0.3 and c = 0.1. The system runs in this combined mode for the following 400 instants of time, when then the dynamics is switched to mode B (t = 801, ..., 1200). After that, the system is changed to a new mixture of modes, for which a = 0.2, b = 0.3 and c = 0.5 until it reaches t = 1600. Finally, from t = 1601, ..., 2000, the system runs in mode C. The first 1500 samples of the generated series are for training the networks and the remaining samples are used for testing1.
4 Simulations The simulations aim to evaluate the performance of the proposed local prediction method, comparing the results obtained by the three competitive ANNs presented in Section 2 with the conventional approach by [7] and [12]. For this purpose, we adopted the one-step-ahead prediction task, in which we estimate only the next value y(t+1) of the time series. In this kind of prediction, the input of the network consists of actual values of the series, not the estimated ones. In other words, the predictions are not fed back to the input of the network. For all simulations, we used 50 neurons, α0 = 0.9, σ0 = 25 and T=104. The first set of simulations assess the quality of learning of the WTA, FSCL and SOM algorithms in terms of the number of dead units they generate. This issue is very important, since the proposed method depends on the distribution of the weight vectors in the data space: the better the clustering of the input vectors, the better will be
1
To discretize Equation (19) we used the simple Euler method: dy ( t ) y ( t − 1) − y ( t ) ≈ dt ∆
where ∆ is the discretization step. In this work, it was adopted ∆ = 1.
(20)
1152
G.A. Barreto et al.
the predictive capacity of the local prediction model. This quality can be roughly evaluated by counting of the number of times each neuron is selected the winning neuron during the training of a certain ANN. An approximately uniform distribution of the number of victories among neurons is indicative of good learning. As expected, the worst performance was presented by the WTA (Fig. 2a), while the FSCL had the best performance (Fig. 2b). For this simulation, we set ny = 5.
(a)
(b)
Fig. 2. Distribution of victories per neuron for (a) the WTA and (b) the FSCL.
The second test assesses the “speed” of knowledge acquisition by the networks. This issue can be evaluated by measuring the influence of the length of the training set on the prediction error. Each ANN is trained using only the first 100 samples of the series. This value is increased by 100 samples, until reaching the maximum length of 1500 samples. The time series for testing remains the same for all cases. For this simulation, we set K = 20 and ny = 5. For each length of the training set, each ANN is trained 3 times and the average value of the prediction error for the test set is calculated. The results are shown in Fig. 3a in which we also verify that the WTA performed worse than the FSCL and the SOM algorithms. The third test evaluates the influence of the memory parameter (ny) on the final prediction error, for a fixed value of K = 20. Fig. 3b shows a typical result for the proposed method using the SOM. In this figure, the value of the error decays until reaching its minimum at ny = 5, starting to grow for ny ≥ 6. The same behavior of the error is observed for the others two ANNs, varying only the point where the error reaches its minimum: WTA (ny = 11) and FSCL (ny = 7). The fourth test evaluates the influence of the number of winning neurons (K) on the final value of the prediction error during the testing stage. We fixed ny = 5 for all neural networks. Fig. 4a shows a typical result for the FSCL algorithm. In this figure, the value of the error presents an initial irregular pattern, until stabilizing by K = 20. From this value on, it is not worth increasing K because no substantial reduction of the error is observed. The same general behavior is observed for the FSCL and the SOM, varying only the point where the error seems to stabilize: WTA (K = 25) and SOM (K =17). The results in Figs. 3 and 4 suggest strongly that good training imply lower
Nonstationary Time Series Prediction Using Local Models
1153
values for the parameters ny and K. Thus, among the three competitive ANNs we tested the proposed local method, the SOM and FSCL algorithms performed much better than the WTA algorithm.
(a)
(b)
Fig. 3. Influence of (a) the size of the training set and (b) the order ny on the prediction error.
A typical sequence of predicted values is shown in Fig 4b, in which the solid line represents the actual time series, while small open circles represent the estimated values. For this simulation, we set K = 17 and ny = 5. As can be seen, the obtained estimates are very similar to the actual values. Results for the SOM algorithm 1.4
1.2
Amplitude
1
0.8
0.6
0.4
Actual values Predicted values 0.2 0
(a)
50
100
150
200
250 Time
300
350
400
450
500
(b)
Fig. 4. (a) Influence of the number of winning neurons K on the prediction error for the FSCL. (b) A time series predicted by the SOM algorithm using the proposed approach.
Numerical values of the prediction error comparing the method proposed in this paper with the one in [12] are shown in Table 1. In this table, SOM-K denotes the proposed local method based on K winning weight vectors of the SOM, while SOM-D
1154
G.A. Barreto et al.
refers to the local model based the original data vectors. Ten runs of training and testing were carried out, for which the maximum, minimum and average values, as well as the variance of the prediction error were computed. The results show that the SOM-K approach performed better than the SOM-D. Table 1. NRMSE values for the proposed approach (WTA, FSCL and SOM-K) and the conventional method (SOM-D) described in [12]. Neural Network
WTA FSCL SOM-K SOM-D
NRMSE Minimum Value 0.1453 0.0091 0.0055 0.0076
Maximum Value 0.4358 0.0177 0.0058 0.0132
Mean Value 0.2174 0.0114 0.0057 0.0093
Standard deviation 0.0872 0.0032 0.0002 0.0027
5 Conclusion In this paper it is proposed a new methodology for applying competitive neural networks to nonstationary time series prediction tasks. The traditional approach uses competitive ANNs only to separate the input data vectors per neuron of the network. In this approach local autoregressive models are built using only the subset of data vectors associated with a given neuron. The method proposed in this paper suggests the use of the weight vectors of the K winning neurons found for the current input vector. Since only few neurons are used to build the predictor for each input vector, this approach develops local representations of a nonstationary time series suitable for prediction tasks. The advantages of the proposed technique are its greater generality, lower computational cost, greater robustness to noise and greater numerical stability. Three competitive algorithms (WTA, FSCL and SOM) were tested and their performances compared with the conventional approach, confirming the efficacy of the proposed method. For future work, the authors aim to compare the performance of the methods presented in this paper with supervised neural networks, such as the MLP and the RBF, in time series prediction tasks. Acknowledgments. The authors thank the financial support of CNPq (DCR grant 305275/02-0) and Instituto Atlântico Research and Development Center in Telecom & IT.
References 1.
Ahalt, S., Krishnamurthy, A., Cheen, P. and Melton, D.: Competitive learning algorithms for vector quantization, Neural Networks 3 (1990) 277â&#x20AC;&#x201C;290
Nonstationary Time Series Prediction Using Local Models
2. 3. 4. 5.
6. 7. 8. 9. 10. 11.
12. 13.
1155
Box, G. and Jenkins, G.: Time series analysis, forecasting and control, Holden-Day, San Francisco (1970) Barreto, G. A. and Andrade, M. G.: Robust bayesian approach for AR(p) models applied to streamflow forecasting, Journal of Applied Statistical Science 13 (2004) Haykin, S.: Neural Networks: A Comprehensive Foundation, Macmillan/IEEE Press (1994) Haykin, S. and Principe, J.: Making sense of a complex world: Using neural networks to dynamically model chaotic events such as sea clutter, IEEE Signal Processing Magazine 15 (1998) 66–81 Kohonen, T.: Self-Organizing Maps, 2nd extended edn, Springer-Verlag, Berlin, Heidelberg (1997) Koskela, T., Varsta, M., Heikkonen, J. and Kaski, S.: Time series prediction using recurrent SOM with local linear models, International Journal of Knowledge-based Intelligent Engineering Systems 2 (1998) 60–68 Lehtokangas, M., Saarinen, J., Kaski, K. and Huuhtanen, P.: A network of autoregressive processing units for time series modeling, Applied Mathematics and Computation 75 (1996) 151–165 Mackey, M. C. and Glass, L.: Oscillations and chaos in physiological control systems, Science 197 (1977) 287–289 Principe, J. C., Euliano, N. R. and Lefebvre, W. C.: Neural and Adaptive Systems: Fundamentals Through Simulations, John Wiley & Sons (2000) Principe, J., C., Wang, L. and Motter, M. A.: Local dynamic modeling with selforganizing maps and applications to nonlinear system identification and control, Proceedings of the IEEE, 86 (1998) 2240-2258 Vesanto, J.: Using the SOM and local models in time series prediction, Proceedings of the Workshop on Self-Organizing Maps, (WSOM’97), Espoo, Finland (1997) 209–214 Weigend, A. and Gershefeld, N.: Time Series Prediction: Forecasting the Future and Understanding the Past, Addison-Wesley, Reading (1993)