2004 IEEE Workshop on Machine Learning for Signal Processing
O N T H E CLASSIFICATION OF MENTAL TASKS: A PERFORMANCE COMPARISON OF NEURAL A N D STATISTICAL APPROACHES Guilherme A. Barreto, Rewbenio A. Frota and Fdtima N. S. de Medeiros Department of Teleinformatics Engineering, Federal University of Gear& Campus d o Pici, 60455-760. Fortaleza. CearL, Brazil Phone: +55 85 288 9467. Fax: +55 85 288 9468 E-mails: rewbenio, fsombra. guilherme@deti.ufc.br
Abstract. Electroencephalogram (EEG) signals represent an important class of biological signals whose behavior can be used t o diagnose anomalies in brain activity. The goal of this paper is to find a concise representation of EEG data, corresponding t o 5 mental tasks performed by different individuals, for classification purposes. For that, we propose the use of Welch’s periodogram as a powerful feature extractor and compare the performance of SOMand MLP-based neural classifiers with that of standard Bayes optimal classifier. The results show that the Welch’s periodogram allow all classifiers to achieve higher classification rates (73%-100%) than those presented so far in the literature (271%).
1. INTRODUCTION The EEG signal is a useful tool in medical clinic and research. For instance, it can be used t.o determine the global activity of the cerebral cortex and. t o some extent, to locate abnormal activity in relatively small cortical areas. It also serves as an important auxiliary source of information for the diagnosis of sleep disturbances and epilepsy. and to differentiate bet,ween coma and brain death 191. In engineering-oriented scenarios, EEG signals are used for the classification of mental tasks performed by subjects 13, 2, 121 and the design of man-machine interfaces (11,1‘21. For a suitable utilization by the aforementioned applications, it is worth having a good representation of EEG data, which have been obtained. for example, by principal component analysis [15]- autoregressive (AR) models 121. wavelet transform 141 and power spectral densit,y (PSD) analysis 112: 71. All of thein have provided acceptable results in extracting and classifying different patterns from EEG signals. However, especially for the discrimination of several mental tasks, the classification rates are not satisfactory. This is mainly due to the noisy and
0-7803-8608-6/04/$20.0002004 IEEE
529
nonstationary nature of EEG signals, which are very often disturbed by electric interferences, caused by power-line interference: movements of eyes and electrodes in the scalp of the subject, as well as by vocalization of thoughts and loss of concentration during recording of brain activity (41. In this paper we argue that a simple but powerful preprocessing method capable of handling both the noisy and nonstationary natures of EEG signals, while maintaining the "useful" information, can alleviate the burden placed on the classifier design. For this, we use the Welch's periodogram [18]?a classical PSD estimation method, and analyze obtained benefits by comparing the performance of SOM- and MLP-based neural classifiers with that of standard Bayes optimal classifier. The results show that the Welch's periodogram allow all classifiers to achieve higher classification rates than those so far presented in the literature. The remainder of the paper is organized as follows. In Section 2, the EEG data acquisition process is described. In Section 3, we briefly present Welch's averaged modified periodogram method. The classifiers whose performances are analyzed in this paper are in Section 4. In Section 5 we present the simulation results, and discuss the achieved results with respect to those reported in the literature, focusing on the pros and cons of the proposed approach. We conclude the paper in Section 6 .
2. DATA ACQUISITION
The data set used in this study comprises EEG signals from five subjects performing five different mental tasks. The subjects are seated in an Industrial Acoustics Company sound controlled booth with dim lighting and noiseless fans (for ventilation). An Electro-Cap elastic electrode cap is used t o record EEG signals from positions C 3 , C4, P3, P4, 01 and 0 2 , defined by the 10-20 system of electrodes placement 151. The impedance of all electrodes is kept below 5 k n . These recordings have been used before by [2, 8_ 121 and are available online at h t t p : / / W . c s . colostate.edu/-anderson. Fig. 1 shows the electrode placement and the measurement procedure. that is made with reference to electrically linked mastoids, A1 and A2. The electrodes are connected through a bank of amplifiers (Grass 7P511)>whose band-pass analog filters are set at 0.1 to 100 Hz. The data are sampled a t 250 Hz with a Lab Master 12-bit AID converter mounted on a computer. Before each recording session, the system is calibrated with a known voltage. Signals are recorded for 10 seconds during each task and each task is repeated for a varying number of sessions, held on different weeks. The sampling rate is 250 Hz, so each EEG signal provides 2500 samples per channel. Thus, each ment.al task is described by the signals obtained from each one of the electrodes (also called channels). In a session, each subject is requested t o perform the following five mental tasks:
Baseline Task the subject is asked t o relax as much as possible, make as few movements as possible and think of not.hing in particular.
530
Figure 1: Signal acquisition L e t t e r Task: the subject is asked t,o mentally compose a letter to a knoNrn person (e.g. father: mother or friend) without vocalizing or making any movements. Multiplication Task the subject is asked to solve nontrivial multiplication problems, such as 87 times 69: without vocalizing or making any movements. Visual C o u n t i n g Task the subject is asked to visualize black arabic numerals on a white background, sequentially in ascending order. the previous numeral being erased before the next being written.
Geometric F i g u r e R o t a t i o n Task: the subject is asked to visualize three-dimensional figures rotating about an axis. Due to its noisy nature) it is difficult to classify brain activity just by visual inspection. Furthermore, it is well known that EEG signals are highly nonstationary [141. ruling out the use of most classical frequency domain techniques. Most of time-domain techniques, like AR models, are also prone to the same criticism, since they strongly rely on the hypothesis of stationarity of the signal [4]. This difficulty can be alleviated if we assume piecewise stationarity of EEG signals 121. In this paper we also make use of the assumption of piecewise stationarity to justify the successful application of Welch's periodogram method 1181 as a preprocessing method for EEG data classification. This is possible because, to apply Welch's procedure, the original nonstationary EEG signal is segmented into shorter (quasi)stationary sequences. The periodogram of each sequence is computed and then averaged for the final result. A by-product of the applicat,ion is a reduction in the length of the sequences to be processed by the classifiers, as we briefly describe next.
3. FEATURE E X T R A C T I O N VIA WELCH'S P E R I O D O G R A M Welchs procedure [IS] to estimate the PSD of a stochastic signal combines windowing and averaging in order t o obtain a smooth spectrum estimation without random fluctuations resulting from the estimation prncess itself 1161.
531
The original data sequence of each channel is divided into a number K of possible overlapping segments. A window u(n]is defined over each of these segments and the corresponding periodograms are computed and then averaged. If z@)[n] represents the sample z[n]of the kth data segment (of length N ) , then the modified periodogram for that segment is defined as
where w = 27rf (in rad/s) is the angular frequency. and the window U should obey the following (normalization) property: (1,") u[n]' = 1. Then the estimate of the PSD of the signal, for each frequency w . is taken as
1 " K
k=l
After preprocessing a total of six EEG signals, corresponding to channels C3, C4, P3, P4, 01 and 0 2 : containing 2500 samples each, we obtain six periodograms of length 1=129, where Nt = 256 is the number of points of the FFT' used t o calculate the PSD estimates. The feature vectors for training and testing the classifiers are then composed of periodogram samples collected a t a given frequency U :
4+
x(w)
z3b)
=
[.l(W)
zzb)
=
[S,c3(w)
S 2 4 ( U ) S 3 W ) S,P4(w)
z4(w)
zdw)
zdw)lT
S,O'(w)
s34
T
(3)
Since we are interested in the classification of 5 mental tasks and 129 samples per periodogram are available, we have 5 x 129 = 645 feature vectors per subject. Thus, once we collected data from 5 subjects, a set of 645 x 5 = 3225 feature vectors is made available for the design of the classifiers.
4. NEURAL AND STATISTICAL CLASSIFIERS For all the classifiers to be described next the goal is to classify an incoming feature vector into 1 out of 5 classes, corresponding to the 5 mental tasks of interest. Then, a given feature vector x E Rq is said to belong to class C k ? if the following general condition is observed:
where gr(.)is the discriminant function associated to class C, 1171. In order to help the evaluation of the classifiers we develop an analysis based on the type of discriminant function each one implements. 'FFT=Fast Fourier Transform Sote that since N < ATt, zero-padding is allowed for the purpose of FFT computations.
532
Thus, in statistical pattern recognition, the condition in (4) is generally implemented through the Fisher’s criterium for optimal classification: ~ ( C k l x )> P(cilx),
Vi, # k
(5)
where p(Ci1x) is a posterior density function defining the prohability tha,t, given the feature vector x, it belongs to class Ci. By means of Bayes rule, the condition ( 5 ) can be written as:
P(xIG)P(C~)> ~(xlCi)p(Ci). ‘Ji # k
(6)
where p(x[Ci)is the likelihood function of class Ci. which gives the prohability that, given a certain class, it is t.his class that better “explain” the vector x. The density function p(Ck) gives the prior probabilities of selecting class Ci.
4.1 The Quadratic Gaussian Classifier (QGC). A classifier designed according to (6) is called a Bayes Optimal Classifier (131. Assuming equal probability and Gaussian likelihood functions for all classes, we get:
where fit = E[xlCij is the mean vector, Ci = E[(x- p i ) ( x is the covariance matrix of a given class C,, and /Cil is the determinant of Ci. Taking the natura1 logarithm of both sides of ( 7 ) and eliminating terms that are independent of the index i since they do not have influence on the decision rule in (4). we can write the discriminant function of class Ci as: 1 1 T -1 gi(x)= Inp(x/Ci) = - 2 ln(lCil) - 2(x - +J C i (x - p c )
(8)
If we further assume a diagonal form for C , and a common variance U’ for all components of x; i.e. C, = g a l >t’he discriminant funct,ion reduces to:
where [/.I[ is the Euclidean vector norm. It is well known that the discriminant function in (8) generates quadratic decision surfaces between classes, while the one in (9) generates linear decision surfaces (13, 171. 4.2 The Multilayer Perceptron (MLP). The MLP classifier_ using the logistic function for nonlinearity. implements very general nonlinear discriminant functions and computes posterior probabilities in condition ( 5 ) direct.ly (1.31, provided that the size of the training set is large enough and the learning process does not get stuck a t a local minimum. For t,raining the MLP, a 5-dimensional target vector represents the desired class C k of a given feature vector: (class 1)-[10 0 0 0 ) (class 2)[0 1 0 0 01, . . ., (class 5)-[O 0 0 0 11. However. to speed up convergence,
533
-
we offset the lower and upper limits of the target vectors by some amount E . Then we make the following replacements: 0 E and 1 + 1 - E ; where we adopt E = 0.05. For testing purposes, assuming that an output neuron . k is indeed approxinmting the a, posteriori class proba,hility density function p(Cklx), we use condition ( 5 ) to decide for the class assigned t o current input vector X. 4.3 The Self-organizing Map (SOM). The Self-organizing Map (SOM) [IO] is a n unsupervised neural algorit,hm widely used in clustering, vector quantization and pattern recognition tasks. Neurons in this network are placed in an output layer, A, geometrically arranged in arrays of one: two or three dimensions. In addition, each neuron i E A has a vector of weights w i with same dimension of the input. vector x . The SOM learning procedure can be summarized as follows:
1. Search for the winning neuron, i*(t):
i*(t) = argmin{gi(x(t))}, tEd
where gi(x(t))= Ilx(t) - wi(t)ll
(10)
2. Weight updating procedure:
Awi(t) = a(t)h(i’,i;t)[x(t) - wi(t)]
(11)
sothat a ( t )is thelearningrateand h(i*,z;t) = eap(-llri(t) - ri.(t)((’/u’(t)) is a gaussian neighborhood function. where r % ( tand ) ri. ( t ) are respectively the positions of the neurons i and i’ in the output array. The variables 0 < a(t),o(t) < 1 should decay in time for convergence purposes. For classification purposes we have to assign class labels to the neurons in SOM. This is done in a post-training stage, called labelling phase [lo]?in which all the training vectors are presented once more to the SOM and the corresponding winners are found. By voting: the class label of neuron i is the class label of the training vectors for which it was the winner in most of the cases. No weight updating is performed a t this stage. During test,ing, the class of an incoming feature vector will be the class of the corresponding winning neuron. It is worth noting that the class decision rule in (10) is computationally equivalent to that in ( 5 ) and (9), because the discriminant function depends only on the Euclidean distance. Thus, the SOM-based classifier (SBC) is essent.ially a Linear Gaussian classifier. However, the LGC uses only one discriminant function per class, while SBC generally uses more than one neuron per class (the exact number depends on the labelling phase). Hence, the SBC is better understood as a piecewise linear Gaussian classifier 1171.
5.
SIMULATION RESULTS
The PSD of each EEG signal was estimated by using \Velchs periodogram as described in Section 3 over equally spaced non-overlapping segments of the
534
signal. A gaussian window is chosen for conrputing the Welch's periodogram. The length of each segment and. hence: of the gaussian window was defined as N = 250 points, corresponding to one second (Is) of brain activity. Each set of periodogram samples is normalized to unity variance and the square root' of the amplitudes are taken before using them to build the feature vectors. We further computed periodogram samples for 6 different values of the standard-deviation of the gaussian window, thus mgmenting the number of feature vect.ors to 6 x 3225 = 19325. From this total, 80% are used for training the classifiers and 20% for testing them (Hold-out validation). To estimate the classification rates we performed 100 runs of trainingltesting procedures, randomly selecting feature vectors for the training and testing sets, a t each run. Classification rates are given per subject and per standard-deviation of the Gaussian window (in brackets): as shown in Tables 1 to 3. The overall classification rate of a classifier is averaged over the five subjects. For these simulations the training parameters were the following:
MLP: 6 input neurons, 30 hidden neurons, 5 output neurons were used. Learning rate and moment factor were set to 0.35 and 0.85, respectively. Hidden and output neurons have logistic activation functions. The LevenbergAIarquadt method was used to train the MLP according to the input-output representation described in Sections 3 and 4. A training run is stopped when AlSE 5 0.001 or 1000 epochs is reached. Only training runs which converged (i.e. M S E 5 0.001) were used to compute the classification rates during the testing phase. Convergence occurred for 60% of the training runs.
SOM: 6 input neurons. 112 output, neurons in a 28 x 4 rectangular grid were used. Initial a,nd final learning rates were set to 0.1 and 0,0001, respectivelyl a,dopting an exponential decay between these values. Training is carried out in batch mode and each run lasted 1000 epochs. From the tables we infer that the hlLP classifier performed much better than the QGC and SBC algorithms. It is worth noting that, even the SBC algorithm, performed in average better than previously reported classifiers, for the case of 5 mental ta,sks. For instance, in [l],Anderson and Sijercic were able to identify only with 70% of accuracy which of five mental tasks a person is doing for two out of four subjects tested, and near 40% for the other two! also using the h4LP. Using a smaller number of mental tasks, Anderson et al. [2] achieved classification rates, ranging from 86.1% to 91.4%, only for the classification of two mental tasks using the hlLP classifier. Palaniappan e t al. [I21 reported classification rates higher than 94%, only for three mental tasks, using the Fuzzy ARTMAP classifier. The superior performance of the hILP can be explained by remembering that the design of the QGC and SBC algorithms (Section 4) are based on the assumption of gaussianity of the data. However, Johnson and Long 161 demonstrated recently that the probability density of the spectral estimates 'The distribution of sample values of a random variable can be made more Gaussian by applying a simple square root transformation.
535
computed by Welch's periodogram is highly non-gaussian. Since the MLP classifier makes no assumption about the distribution of the preprocessed data, it is able to perform better in the non-gaussian case, building very general nonlinear discriminant functions. By its turn, The QGC performs better than the SBC because it extracts extra information from the data by computing the coxariance matrices of each class. TABLE 1: CLASSIFICATION RATES AND STANDARD-DEVIATIONS (QGC ALGORITHM).
2: CLASSIFICATIOA' RATES AND STANDARD-DEVIATIONS (SBC ALGORITHM). TABLE
TABLE 3: CLAsSlFlCATloN RATES AND STAND.4RD-DEVIATIONS (hfLP ALGORITHM). II
Standard-Deviation of Gaussian Window
The last set of simulations aims to give a rough idea of the "speed of knowledge acquisition" of the algorit.hm, i.e.> how much information from t.he inputs it needs to achieve a reasonable classification rate. Figure 2 shows the results, which were obtained by averaging the classifications rates (computed on the remaining testing vectors) over 100 training runs for each size (in percentage of the total of vectors) of the training set. It is worth noting that both the MLP and the QGC require less information to discriminate data in comparison to the SBC algorithm. The MLP classifier achieved higher classificat.ion rates than t.he QGC and SBC algorithms. As a general conclusion for this simulation, one can say that, the usual approach of partitioning the data vectors into two setsl using 75.80% of the available vectors for training and the remaining 25.20% for testing, may be too conservative for some classifiers
536
Im
WL
-,--,-._
_._._._...- ,.." ,/*
sen 8
3
5
m~
B
m~
..........................
,,,,,.,.
i
,/.'
,..' Yj.
,,.,''
...'
,...
,,,....., ,
''
,.......
,,....
,../
40
....."
,
2
6. CONCLUSION
This paper aimed to find a concise representation of EEG data, corresponding to five ( 5 ) menta,l tasks performed by different individuals, for classification purposes. For that, we proposed the use of Welch's periodogram method as a powerful feature extractor and compared the performance of S O N and MLPbased neural classifiers with that of Quadratic Gaussian (optimal) classifier. The results have shown that the Welch's periodogram allowed all classifiers to achieve higher classification rates than those so far presented in the literature.
Acknowledgment The authors would like to thank CNPq (DCR. 30.527.5/2002-0) and the Federal University of Cear6 (UFC) for the financial support.
REFERENCES [l] C. Anderson and Z. Sijercic, "Classification of EEG Signals from Four Subjects During Five kIental Tasks," in Proceedings of t,he International Conference on Engineering Applications of Neural Networks (EA")> London, England, 1996, pp. 407-414. [ Z ] C. W. Anderson, E. A. Stolz and S. Shamsunder, "Multivariate autoregressive models for classification of spontaneous electroencephalografic signals during mental tasks,'' IEEE Transactions on Biomedical Engineering, vol. 45, no. 3; pp. 277-28G1 1998.
537
[3] D.Garrett, D. A. Peterson, C. W. Anderson and 1'1.H.Thaut, "Comparison of linear, nonlinear and feature selection methods for EEG signal classification," IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 11, no. 2: pp. 141-144, 2003. 141 , , N. Hazarika. J . 2. Chen. A. C. Tsoi and A. Sergeiew. "Classification of EEG signals using the wavelet transform:" Signal Processing: vol. 59, pp. 61-72, 1997. [5] H. Jasper, "The ten-twenty electrode system of the international federation,'' Electroencephalographic Clinical Neurophysiology, vol. 10, pp. 371375, 1958. (61 P. E. Johnson and D. G. Long, "The probability density of spectral estimates based on modified periodogram averages; IEEE Transactions o n Signal Processing, vol. 47, no. 5, pp, 2429-2438: 1999. [i S.-L. ] Joutisiniemi, S. Kaski and T. A. Larsen, "Self-organizingmap in recognition of topographic patterns of EEG spectra," IEEE Transactions o n Biomedical Engineering. vol. 42, no. 11, pp. 1062-1008, 1995. (81 2. A . Keirn and J . I. Aunon, "A new mode of communication between man and his surroundings,'' IEEE Transactions on Biomedical Engineering, vol. 37, pp. 1209-1214, 1990. 191 R. E. Kingsley, Concise Text of Neuroscience, Baltimore: Lippincott, \Villiams and Wilkins, 2nd edn., 2000. [lo] T. Kohonen, Self-organizing Maps, Berlin: Springer-Verlag, 2nd edn., 1997. Ill] K.-R. Muller, C. W.Anderson and G. E. Birchl "Linear and nonlinear methods for brain-computer interface," IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 11, no. 2, pp. 165-169, 2003. 1121 R. Palaniappan, R. Paramesran, S. Nishida and N. Saiwaki, "A new braincomputer interface design using Fuzzy ARTLIAP," IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 10, no. 3: pp. 141148, 2002. [13] J. C. Principe, N. R. Euliano and W. C. Lefebvre, Neural and Adaptive Systems: Fundamentals through Simulations, John U'iley & Sons, 2000. 1141 R. h l . Rangayyan, Biomedical signal analysis: A case-study approach, Wiley-Interscience, 2nd edn., 2002. [15] A. C. K. Soong and 2. 3. Koles, "Principal-component localization of the sources of the background EEG," IEEE Transactions on Biomedical Engineering, vol. 42; no. 1, pp. 5W7, 1995. [I61 C. W. Therrienl Discrete Random Signals ans Statistical Signal Processing, New Jersey: Prentice-Hall, 1992. [li] A. Webb, Statistical Pattern Recognition, \Wey 61 Sons, 2002. [18] P. D. Welch; "The use of the fast Fourier transform for the estimation of power spectra," IEEE Transactions o n Audio Electroacoustics, vol. 15: pp. 70-i3, 1967. Y
538
"