EFFICIENT CLASSIFICATION SYSTEM BASED ON LOW ERROR RATE by neha rathore

Spring’09 EE559-Project: Mathematical Pattern Recognition Neha Rathore-5994499980

[MATHEMATICAL PATTERN RECOGNITION] Project Assignment – 27 April 2009

EE559 Mathematical Pattern Recognition Project Report Abstract In this project, we design a classification system that aims to achieve low error rate. We have been given two dataset; modified Wine data and USPS dataset for both train and test data. The report is divided in three sections. Section one talks about the baseline performance of the training data with respect to “Linear Bayes Classifier (LDC)” followed by effect of dimensionality reduction on classification results. In Section 2, we study different classifiers and test their performance for our training set. In this report we have used support vector machines, K-nearest neighbor , Parzen , Quadratic and Perceptron as classifier. Finally, In Section 3, We classify the test data as per the chosen best classifier in Section2 and comment on the final results. We used PRtools in MATLAB to analyze the given data.

Introduction Datasets: (i): Modified Wine Data: The dataset comes from the chemical analysis of wine grown in one region of Italy but derived from three different cultivars. The given dataset has three different values/ features: 1. Alcohol. 2. Malic Acid. 3. Ash. 4. Alcalinity of Ash. 5. Magnesium. 6.Total Phenols. 7. Flavanoids. 8. NonFlavanoids Phenols. 9.Pronthocyanins. 10. Color Intensity. 11. Hue. 12. OD280/OD315 of diluted wines. 13. Proline. The number of classes is three and the data is preprocessed. The training data has 128 labeled samples from 3 classes. The test data has 50 unlabeled samples. Frequency of occurrence : class 1-59 ;

class 2-71 ;class 3-48 . Number of features: 13.

(ii): USPS Dataset: The data consists of 16x16 normalized and deslanted images of digits scanned from actual letters from the US Postal Service. The training data has 7291 labeled prototypes form 10 classes (one for each digit) with 256 feature vectors per prototype. The test data has 2007 unlabeled prototypes. Frequency of occurrence : There are 7291 training observations and 2007 test observations, distributed as follows: 0 1 2 3 4 5 6 7 8 9 Total Train 1194 |1005 |731| 658| 652| 556| 664| 645| 542| 644|| 7291 Test 359 | 264| 198| 166| 200| 160| 170| 147| 166| 177|| 2007 Train 0.16 0.14 0.1 0.09 0.09 0.08 0.09 0.09 0.07 0.09 Test 0.18 0.13 0.1 0.08 0.10 0.08 0.08 0.07 0.08 0.09 . Number of features: 256 representing the grayscale values from 0-255.

Aim: We need to find method of classification such that we achieve lowest error rate by means of Distribution-free classification, Non-Parametric Techniques and Parametric techniques. Section 1 –Baseline Performance In this section we set the benchmark as Linear Bayes classifier (LDC) and run the classifier on the labeled samples in the training dataset for mod_wine_train.mat and zip_train.mat. We perform a 5-Fold cross validation to estimate the performance and repeat it 10 times. The mean is taken as “baseline” and we report the error and standard deviation for both the data sets. % %-----------------------------OUTPUT--------------------------------------% % PR_Warning: getprior: crossval: No priors found in dataset, class frequencies are used instead % % mod_wine_train- baseline error:0.028906 Baseline Standard Deviation for mod_wine_train:0.0082762

% %-----------------------------OUTPUT--------------------------------------% %PR_Warning: getprior: crossval: No priors found in dataset, class frequencies are used instead % %zip_train- baseline error:0.079111 Baseline Standard Deviation for zip_train:0.00057757 % %------------------------------------------------------------------------Dimensionality Reduction: We now perform some basic dimensionality reduction techniques to reduce the number of features in classification data and study its effects on the classification in terms of error metrics. Feature selection is done to achieve energy compaction such that features containing most of the information are selected thereby, reducing the redundancy in data and time to execute. A good feature selection technique can give very good classification using minimum number of features. In our datasets, we need to reduce number of features from 13 to 1 in mod_wine_train data set and 256 to 1 in the zip_train dataset. We achieve this by reducing one dimension at a time and calculating the error in performance for the Linear bayes classifier. We then LDC: MOD WINE TRAIN DATASET

0.8 train 0.2 test

plot the graphs to analyze the classification performance vs dimensionality of data and the scatter plot showing the decision boundaries. Section 2 –Comparing Classifiers: The most common classifiers in each class are: • Distribution free Classifiers: Multiple Class Perceptron; Pseudoinverse; Widrow-Hoff & Hokashyap classifier, Support Vector Machines (SVM). • Parametric Techniques: Mahalanobis Distance; Maximum-Likelihood(ML) estimate; Minimum Mean Square Error( MMSE) Estimate; Maximum a posteriori(MAP) Estimate; Bayes Classifier. • Non-Parametric Techniques: Parzen Window Estimation; K-Nearest Neighbor Estimation. In our case, we choose the Perceptron, Ho-kashyap, SVM, Normal density based Linear (ldc) and Quadratic classifier (qdc), Parzen Window estimation (parzenc) and K-nearest neighbor (knnc) for testing performances for two data sets.

0.1 train 0.9 test

Effect of Feature set or Dimension reduction: We have used two methods of dimension reduction. The standard Principal component analysis (pca)and the backward feature selection(featselb). We reduce the dimension from maximum number of dimensions to 1 and see the effect of dimension reduction be means of error curve. The classification is tested for baseline classifier( ldc) and the results been plotted. We notice, that for backward feature selection, technique the data is reorganized and separated in a way that the linear separation of data is possible for low dimension. Hence, we have a good enough error rate till the dimension of 9 and it shoots up to 0.4 from a min. of 0.8 after the number of dimensions is further increased. For PCA, we see that the classifier tries to find linear boundaries for the given scatter plot. We tried combinations of 0.8,0.6,0.4,0.2 training sets and remaining test sets. We found out that the boundaries remain more or less unchanged along with error. We drastically took 0.1 fraction of training data and trained the classfier. The boundaries achieved are better with min error of 0.07 till the dimension of 5 and shoot up beyond that. This is very logical as the training data was so less.s ZIP-TRAIN DATASET

Train 0.8-test 0.2

Train 0.5- test 0.5 Effects of Dimensionality reduction on ZIP train. Since, the process took a very long time to run for 10 repetitions of error estimation, we did not test a lot of combinations of train or test fractions. We instead tested this by just taking 1 repetition for error. We see the data for ZIP train is very scattered. Also the features of the image are taken as the spatial distribution of the gray level values. When we start reducing the dimensions one by one, we see that the error almost becomes constant th after about 25 dimensions. The possible reason for this is that after 25 dimension there is a lot of overlapping of gray level values. Since the training set contains of normalized images that are more or less black n white , there is a huge probability that the all the features from each image in the dataset lies very close to each other. This reduces the probability of error. Also, reducing the dimensions one by one implies we are reducing the size of image and taking the first n features. This might result in truncating the important features of the image. For example: for 2 it is very important to consider the last row as it gives the lower bar and differentiates it from being recognized as 7( in handwritten). A better approach would be to take the DCT of the image and then reduce the dimension in the similar way. As we know the DCT puts all the low frequency in the beginning of the image and high frequencies in the end, reducing dimensions such that cutting the high frequency first will result in better recognition rates. Another approach is the take feature set such that after preprocessing we can calculate the number of end points, horizontal and vertical symmetry, number of horizontal and vertical bars and etc. This feature set would give more holistic view of the alphabet and might result in better recognition on the test set. The current scheme does not take into account all these factors and just classifies according to spatial distribution fo gray levels. Hence , resulting in great errors. Aside: Computation of the linear classifier between the classes of the dataset A by assuming normal densities with equal covariance matrices. The joint covariance matrix is the weighted (by a priori probabilities) average of the class covariance matrices. Since LDC assumes equal covariance matrices, it treats the data in a more generalized

point of view. QDC: MOD WINE TRAIN DATASET

J=1

J=2

J=4 We see that the error curve is better for j=1. And the boundaries more or less covers all pretty good amount of data points of each class. Qdc is Bayes-Normal-2, untrained mapping --> qdc. It takes into account the

probability distribution of the training data, hence for more number of dimensions it gives lesser error rate. ZIP-TRAIN DATASET

The error fluctuates as the number od dimenision is increase. This is because fro increasing number of dimensions the classfier is trained again and again for new probability ditribution set and hence, the error shows such values depending on the spatial distribution of the gray levels.s

UDC: MOD WINE TRAIN DATASET

Computation a quadratic classifier between the classes in the dataset A assuming normal densities with uncorrelated features Since this assumes uncorrelated data, this assumes a best case. Hence the error is less as we increase the dimension and as compared to other classifiers. However, in reality the data is correlated and hence, this might not give good error estimation for recognition rates. ZIP-TRAIN DATASET

Since, the classifier assumes uncorrelated data, it classifier moderately well till features as few as 25. But as the number of dimensions increase, due to similarity in symbol set, the classifier faces higher error rate.

PARZEN WINDOW: MOD WINE TRAIN DATASET

PARZEN classifier assigned a window around the data and with help of some smoothing parameter, assigns classification regions. Computation of the optimum smoothing parameter H for the Parzen classifier between the classes in the dataset A. We see the error first decreases and then increases. This may be because of overlapping data in the higher dimension. ZIP-TRAIN DATASET OUT of MEMORY ERROR! KNN MOD WINE TRAIN DATASET

Nearest neighbors 3

Nearest neighbors 10 We tested the data for 0.8 train data and 0.2 test data with 3 and 10 nearest neighbors. We see the error patterns remain almost same. It first decreases then increase and then decrease followed by increasing. As we know that the KNN takes into consideration the majority of prototypes within the number of nearest neighbors we see that for larger number of dimensions , there is a possibility that more number of prototypes belonging to same class lie close to each other and hence, results in better recognition for higher dimension. The tipping pointin this case might be dimensions more than 11. ZIP-TRAIN DATASET

The error for zip train remain the same after 25 dimensions. This is because the number of class is very high and the symbol set is similar. So it is hard to distinguish the classes from the given data( due to overlapping feature values). Hence, this error pattern is well justied.

SVM: MOD WINE TRAIN DATASET -

Too heavy for Matlab

The minimum error is 0.05 for 11 dimensions. This is easy as the data is not overlapping and can be classified into linear boundaries with some acceptable error rates. We have seen similar boundaries in the linear classifiers above. Since the application was too heavy, the system could not take the load of classifying . ZIP-TRAIN DATASET Too heavy dataset for testing error 10 times. System going out of memory!

MULTICLASS PERCEPTRON: MOD WINE TRAIN DATASET

0.2 train 0.8 test The perceptron has given good boundaries with a minimum error of 0.05(approx.0 for 6 dimensions0. This is an expected behavior as the data is not linearly separable. For lesser number of dimensions, the perceptron can classify better as there is less overlap of data. The perceptron does converge as expected. ZIP-TRAIN DATASET

Since, the application was heavy, we decremented the dimension by a factor of 16. Hence the error curve shows , missing values for particular dimensions. We see the error is tremendously high for 1 dimension, as large number of prototypes might have same grey level value at pixel location one. However, afterwards, the error remains more of less constant because the perceptron is inefficient in recognizing data that is too highly overlapped or densely scattered.

Classifier

LDC (PCA+LDC) QDC

Training data error (0.8,0.2) Mod_wine_train Error Dimension ~0.2 11 ~0.05 7

Zip_train Error 0.08 (BEST) 0.09

UDC KNN PERCEPTRON PARZEN WINDOW SUPPORT VECTOR MACHINES

~0.02 (BEST) ~0.215 ~0.08 ~0.16 ~0.05

0.08 0.025 (5 repetition) 0.09 (decrement of 16) Out of memory! Out of memory!

11 12 6 9 6

Test data error dimension 48 25 25 25 25

Cannot be determined for unlabeled test data Cannot be determined for unlabeled test data Cannot be determined for unlabeled test data Cannot be determined for unlabeled test data Cannot be determined for unlabeled test data Cannot be determined for unlabeled test data Cannot be determined for unlabeled test data

The baseline classifier presented good results. For mod_wine_train data, the UDC overshadows the LDC as it makes more assumptions for correlation of data. Similarly , the QDC is also gives better performance then LDC as the data is not exactly linearly separable ,but is fairly separable. The QDC is able to draw circular boundaries resulting in better boundaries. However, for Zip_train, the LDC gives best performance with higher number of dimensions at lowest error rate. This may be because, 0.8 fraction of data is used for training, hence there is a high probability that the remaining 0.2 faction of test data lies within the trained boundaries that separable the closely places class clusters. Confusion Matrix for mod_wine _train for UDC 6 2 0 0 10 0 0 0 6

mod_wine _train for LDC 8 0 0 0 10 0 0 0 6

Zip_train_LDC 201 0 0 0 0 128 5 2 0 0 118 0 2 3 0 114 0 1 10 3 3 3 0 1 0 1 0 4 0 0 5 3 1 0 0 3 1 0 0 1

0 0 0 0 0 0 0 2 1 7 0 1 2 0 4 4 1 2 2 2 0 1 5 1 91 1 0 3 0 2 2 116 0 4 0 3 0 0 112 2 10 0 3 1 0 94 1 1 0 0 3 1 120 0 1 2 0 2 1 230

ZIP_train_KNN 201 0 0 0 0 0 0 0 0 0 1 143 0 0 0 0 2 0 0 0 0 1 129 1 0 0 0 0 0 0 4 1 0 121 0 0 2 0 2 0 1 1 3 0 102 1 0 0 1 2 1 0 0 0 0 130 0 0 0 1 0 0 0 0 0 0 129 0 0 0 0 1 6 0 3 1 1 96 0 0 0 0 0 4 0 0 3 0 121 0 0 0 1 0 1 1 0 0 0 235

PROBLEM 3: As the data provided is unlabeled , there is no way to know the test error. The results of classification has been mailed. : general comparison of classifiers for mod_wine_training data: Errors for individual classifiers Test results result for clsf_1 : klm - ldc 0.167 clsf_2 : NN-FFS - ldc 0.042 clsf_3 : LDC-FFS - ldc 0.042 clsf_4 : ldc 0.000 clsf_5 : 1-NN 0.250 Errors for combining rules Test results result for clsf_1 : Product combiner 0.000 clsf_2 : Mean combiner 0.125 clsf_3 : Median combiner 0.000 clsf_4 : Maximum combiner 0.250 clsf_5 : Minimum combiner 0.000 clsf_6 : Voting combiner 0.000