Real time Face Recognition System Real time face recognition system using Radon and DCT transform
Neha Rathore Shivani Pandya [EE586- Project Report]
Contents 1. Abstract ............................................................................................................................... 3 Introduction ................................................................................................................................. 4 System Description ..................................................................................................................... 6 Hardware : ................................................................................................................................... 6 Software: ................................................................................................................................. 7 Selection of Algorithms .............................................................................................................. 8 Preprocessing: ......................................................................................................................... 8 Downsampling ........................................................................................................................ 8 Normalization ......................................................................................................................... 9 Wide-Sense histogram equalization.................................................................................. 10 Full Scale linear Scaling ................................................................................................... 10 Feature Selection ................................................................................................................... 14 PCA :..................................................................................................................................... 14 Linear Discriminant Analysis (LDA) ................................................................................... 14 Elastic Graph Bunch Matching ............................................................................................. 15 Radon ................................................................................................................................ 15 Discrete Cosine Transform ............................................................................................... 15 Feature extraction using Radon transform and DCT ........................................................ 16 Classification......................................................................................................................... 17 Euclidean Distance: .......................................................................................................... 17 K-Nearest Neighbor Classifier:......................................................................................... 17 IMPLEMENTATION AND RESULTS ................................................................................... 18 Database Collection: ............................................................................................................. 18 Dimensionality reduction ...................................................................................................... 19 Downsampling .................................................................................................................. 19 Selection of downsample image size ................................................................................ 19 Radon Transform .............................................................................................................. 20 DCT Transform ................................................................................................................. 20 Sel Selection of number of coefficients ............................................................................ 21 Selection of Classifier ( ED results and KNN): ................................................................ 22 Selection of value of K ..................................................................................................... 23 Challenges ................................................................................................................................. 25 Dimensionality ...................................................................................................................... 25 Processing time for Radon for different image sizes ............................................................ 26 Scatter plot of data : .......................................................................................................... 26 Limitations with REAL TIME RESPONSE ............................................................................. 37 Displaying classification result ......................................................................................... 37 Imposter model. ................................................................................................................ 38 Quantitative Results .................................................................................................................. 38 Conclusion ................................................................................................................................ 39 Future Work .............................................................................................................................. 40 References ................................................................................................................................. 40
Abstract We propose a real time face recognition system suitable for small businesses and home security systems. The goal of this project is to build a system that works in real world real world situations where the user is not constrained by lighting conditions or slight variation of user poses and in-plain rotation. The system is designed for good recognition rates and takes care of the various problems in face recognition systems like illumination, rotation and etc. For a system based on purely face recognition, it is specifically difficult to achieve good recognition rates without considering some kind of clustering or linear separation of data. System represents prototype of Real time face recognition system using Radon Transform and 2D DCT for feature extraction and KNN for classification giving acceptable performance of 86% on small set database. The performance of the system is presented in terms of recognition rates for various combinations of feature set extraction techniques. The system shows a recognition rate as high as 98% for the offline data which consisted of 30 test images, 20 validation images and 10 training images.
Introduction In today’s age of technology small world security systems are high in demand. Such security system should be non-obtrusive and require low user interaction. Facial recognition is able to satisfy such needs as visibility of face does not need any specific action from user. At the same time if a system allows a user to recognize him/her in uncontrolled environment, it in a way becomes an non-obtrusive way of recognition.
Some of the desirable features of such systems include ease of use, low error rates, low cost of implementation, portability and ease of integration. This report describes a prototype implementation of face recognition and verification algorithms in a stand-alone system using the TI TMS320C6713 floating point processor. This system is organized to capture an image sequence, find the features of face in the images, and recognize and identify a person from a database of 20 people in indoor-building lighting conditions. For each person an image database is collected in two sessions possibly on different days to capture the maximum variance in illumination and face poses and 30 images per session were stored in the database for each person. One of the main challenges in face recognition system is finding informative and discriminative information about class image. A 2D-DCT of face images is very sensitive to pose variations, where as the most commonly used techniques like PCA, LDA are computationally very expensive for a small hardware system. Also face recognition based on Gaussian Mixture models pose a difficulty in terms of singularities arising due to large feature set as compared to small number of training prototype images. Hence, for this system we used combination of radon transform and 2D DCT transform to achieve feature that can yield to low frequency information which is crucial to face recognition system. The property of Radon transform to enhance the low frequency components, which are useful for face recognition, has been exploited to derive the effective facial features. Data compaction property of DCT yields lower-dimensional feature vector. The proposed technique computes Radon projections in different orientations and captures the directional features of the face images. Further, DCT applied on Radon projections provides frequency features. The
technique is invariant to in-plane rotation (tilt) and robust to zero mean white noise. The system is also tested for combinations of only 2D-DCT (10 coefficients), 2D-DCT (5 coefficients), Radon + DCT(10 coeff.) by simple Euclidean distance based classifier and a knearest neighbor classifier for different values of K.
System Description Hardware : The system is implemented using a floating point DSP processor TI TMS320C6713along with daughter card DSP STAR TFT LCD Video Daughtercard (VM3224K2) and camera Color TeleCamera NCK41CV. The DSP board has 16 KB of internal RAM, 16 MB SDRAM and 512 MB external RAM. System is designed as a standalone application and does not need intervention of the computer once the system is loaded for the first time. The training feature set is collected offline on the board form the training image set and stored in SDRAM at the time of loading the program.
Input from the camera: The camera gives images at the rate of 30, 15 or 7 frames per second in 16 bit YUV format. This camera is a low resolution camera and does not perform well in low lighting conditions, where it introduces a lot of noise which makes recognition very difficult. Although the systems is programmed to achieve good performance even in moderate lighting conditions , the camera stills is operated in good lighting conditions.
Software: The face recognition algorithm is as shown in the figure:
System Layout
The system first captures the image of the person through the camera in the 16 bit YUV format. This image is then used to extract the gray level image of the person which is basically the Y value of the image obtained from the camera. This format is then converted to 8 bit format and 8 bit gray level input image is obtained.
This image is then preprocesses to make it suitable for recognition step. Firstly, the image is down sampled from 128x128 to 64x64 followed by normalization of the image to take care of illumination changes. Since, the most common histogram equalization method introduces artificial grey level values; we preferred the contrast enhancement by Linear scaling method. Once the image is normalized by preprocessing, it is send to the feature extraction step where first the radon of the image is calculated for 32 rotation angles and a 32x44 sized image is obtained.
Then a 2D-DCT of this image is performed to capture the low frequency components important for face recognition. While training, this process is followed to collect the feature set of the training set and store it in a file. This feature file is then loaded in to system along with the program and used as database.
In the test stage, this process produces the feature vector which then compared to the n-feature vectors of the training set and a distance measure is calculated from each of the training feature vector. This is followed by the k-nearest neighbor classification which sorts the values of difference vector and give the closest k-neighbors which leads to final classification depending on the frequency of class indices in the sorted array.
Selection of Algorithms There are various algorithms for face recognition that used either the eigenfaces approach, geometric features approach or the appearance based approach for extracting features from the given image set. Preprocessing: Each method of feature selection poses its own limitation in terms of being illumination variant, pose variant or dependent on similarity with training set. For this reason, before applying any feature extraction method we first normalize the image to reduce the effect of lighting and rotations. Similarly, a high dimensional data poses a problem in terms of processing time, redundant information and very large feature set leading to “curse of dimensionality� during the classification stage. Hence, we downsample the image to reduce the number of dimensions. Downsampling As mentioned above, down sampling is a efficient way of reducing redundant information in an image which might lead to unnecessary feature values that do no contribute much in the final classification. Down sampling, maintains the total entropy of the image while reducing the number of dimensions. There are typically two ways of downsampling an image; Bilinear interpolation and pixel averaging.
Bilinear interpolation leads to a sharper image as the pixel values are reconstructed by using the pixel values of its neighbors thereby, taking care of the pixel value variations on the neighboring pixels. Usually in image processing tasks, bilinear imterpolation is a preferred method for image reconstruction. In down sampling by pixel averaging, we simply take the value of all the pixels and divide by the total number of pixels, thereby averaging the pixel value over those n-pixel values. As a result of averaging the sharp features of the image are lost resulting in blurring. Although, the only disadvantage of using this method is the rounding of error in case the average of the pixels is not an integer value. However, this is a very small error and would not lead to significant difference in the pixel values.
EXAMPLE: Downsampling Original 128x128 image
Pixel Averaging
Bilinear Interpolation
We know, the edges of an image are represented as high frequency components in the frequency domain and smooth regions of the image represent the low frequency regions like the cheek, nose, forehead and etc. As we need to capture the low frequency components of the image for good face recognition, pixel averaging method is more suitable in our case. Normalization Normalization of the image is important to take care of the illumination changes. HISTOGRAM EQUALIZATION: The most common method to normalize image is the histogram equalization method that distributes the grey levels in the image such that we attain a uniform grey level distribution or pdf of the image.
Wide-Sense histogram equalization In this method we stretch the original histogram to cover the whole 0-255 range of gray levels. This technique does not guarantee equal number of pixels in each gray level, but gives a contrast enhanced version of input image. We use the following formula:
i Max. Intensity Level Oi = ∑ N j × No. of Pixels j =0
The meaning of Max. Intensity Levels maximum intensity level which a pixel can get. For example, if the image is in the grayscale domain, then the count is 255. And if the image is of size N × N then, the No. of pixels is N2. And the expression is the bracket means the CDF value for the value of input gray level. This is how we get new intensity levels calculated for the old intensity levels.
LINEAR SCALING : The problem with histogram equalization method is that it introduces some artificial grey levels in different locations of the image as per the gray level distribution. This is not a desirable thing for our face recognition system. Hence, for our system we choose to normalize the image by Linear Scaling which stretched the pdf of the images in such a way that it covers the whole gray scale range whereas keeping the variations in the image intact.
Full Scale linear Scaling There are three common linear scaling methods, the first one is called Linear Image Scaling, in which the processed image is linearly mapped over its entire range; the second one is called Linear Image Scaling with Clipping, where the extreme amplitude values of the processed image are clipped to maximum and minimum limits. The last one is called Absolute Value Scaling, which utilizes an absolute value transformation for visualizing an image with negatively valued pixels. The second technique is often subjectively preferable, especially for images in which a relatively small number of pixels exceed the limits.
For our purpose, we are going to implement the second method, which is Linear Image Scaling. The idea of linear scaling is illustrated below.
This process Involves mapping of histogram of the input image in such a way that the histogram of the output image covers the entire range from [0-255] of gray scale levels. The main challenge faced here is to realize the mapping range. Low contrast images can be result of poor illumination and lack of dynamic range in the imaging sensor. These low contrast images have a very low dynamic range. Thus the primary idea is to increase the dynamic range of these images, that is to stretch the range from low to high linearly. We have an equation,
G − Gmin G = H ( F ) = Gmin + max Fmax − Fmin
(F − Fmin )
Where, (Fmin,Fmax)= minimum and maximum grey level of input image that is occupied (Gmin,Gmax)= minimum and maximum grey level of output image that is desired.
In a way, this equation represents the line form y=mx+c, where m is the slope and c is the intersection on y axis. In our case the slope is given by the quantity (Gmax-Gmin) / (Fmax Fmin). When we make Gmin=0 and Gmax=255, we cover the entire range for 8 bit images, hence the process is called full range linear scaling.
EXAMPLE: NORMALIZATION Original bright,dark and midtone images repectively
HISTOGRAM EQUALIZATION
LINEAR SCALING
Original images
Histogram Equalization
Linear Scaling
Analysis: the figure above shows the image normalization by histogram equalization and linear scaling method. As mentioned before, we see that the histogram equalization method introduces unwanted effects like contouring and also introduces unwanted grey levels. On the otherhand, Linear scaling method enhances the contrast such that it suppresses any sudden
occurrence of bright light and also takes care of the poor lighting conditions. Hence our selection of Linear scaling method for image normalization is well-justified.
Feature Selection PCA : PCA is one of the most successful techniques used in face recognition algorithms. The purpose of PCA is to reduce the large dimensionality of the data space to the smaller intrinsic dimensionality of feature space, which is needed to describe the data. This is the case when there is a strong correlation between observed variables. The main idea of using PCA for face recognition is to express the large 1D vector of pixels constructed from 2D facial image into the compact principal components of the feature space. The equation of PCA is given by the equation below for the set of D dimensional vector {xi }1 the M dominant eigenvectors of the n
sample covariance matrix formulate as follows : C = ∑ ( xi − µ ) T ( xi − µ ) i
Where is µ is the sample data mean, each each i v is an eigenvector of the Covariance Matrix (C) having associated eigenvalue λ j : Cv j = λ j v j
Linear Discriminant Analysis (LDA) LDA is closely related to PCA in terms of finding linear combinations which best explains the data. LDA models the difference between the classes to make class cluster more separable. Suppose that each of C classes has a mean µi and the same covariance Σ. Then the between class variability may be defined by the sample covariance of the class means:
∑ b
=
1 C (µi − µ )(µi − µ )T ∑ C i =1
r The class separation in a direction w in this case will be given by: r r wT ∑ w S = rT b r w ∑w
Linear Discriminant Analysis (LDA) finds the vectors in the underlying space that best discriminate among classes. For all samples of all classes the between-class scatter matrix SB and the within-class scatter matrix SW are defined. The goal is to maximize SB while minimizing SW, in other words, maximize the ratio
∆SB
∆SW
.
Elastic Graph Bunch Matching EGBM approach has used the structure information of a face which reflects the fact that the images of the same subject’s trend to translate, scale, rotate, and deform in the image plane. It makes use of the labeled graph, edges are labeled the distance information and nodes are labeled with wavelet coefficients in jets. This feature model graph can then be used to generate image graph. The model graph can be translated, scaled, rotated and deformed during the matching process. This can make the system robust to large variation in the images.
Radon Radon transform has been used to derive enhanced low frequency components, which are useful in face recognition. Radon transform for 2 dimensional function f ( x, y ) is defined as : ∞ ∞
R(r ,θ ) =
∫ ∫ f ( x, y )(r − x cosθ − y sin θ )dxdy
− ∞− ∞
r - distance from the center of the image
θ ∈ [0 π ]
Radon is an efficient way of extracting frequency components in different directions. The Line integrals of face, during the computation of Radon transform, amplify low frequency components, which are useful for face recognition. Radon Transform can give very good dimensionality reduction by choosing proper number of angles (0-179 orientations.). Radon Transform can achieve lossless compression. Provides Rotation Invariance to images which are a very important factor in real time recognition systems.
Discrete Cosine Transform 2D DCt is a efficient way of transforming the image such that there is good distinction between the high and low frequency components. DCT enables a proper selection on low or high frequency coefficients due to its way of spatial distribution of the coefficients. The DCT of image is
calculated as follows: N 2 −1 N 2 −1
X k1,k 2 =
∑ ∑x
n1,n 2
n1=0 n 2=0
π π 1 1 cos n1 + k1 cos n2 + k2 2 2 N1 N2
The first DCT coefficient is the DC coefficient and gives the average value of the image. This mainly consists of the illumination information and is hence, sometimes discarded to remove illuminations changes. DCT has excellent energy compaction property and divides the region of image into regions of low and high frequency. It also facilitates feature selection through zig-zag or other methods. 8x8 block DCT allows capturing of local frequency distribution and speed up the overall performance.
OUR APPROACH As we are implementing face recognition system on the dsp board with limited capabilities of computation and memory. This constrains our algorithm selection in certain way. PCA and LDA though very standard methods of face recognition algorithm due require computation of correlation matrix. Hence, we decided to implement radon transform and DCT transform as our feature extraction algorithms.
Feature extraction using Radon transform and DCT (RDCT) Facial features derived in the proposed approach are the frequency components in different directions. The line integrals of face image, during the computation of the Radon transform, amplify low frequency components, which are useful in face recognition. Radon space image for 0–179 orientations is shown in Figure. DCT is used to derive the frequency features from the Radon space. The figure reveals excellent energy compaction property of DCT. Significant coefficients (10) of DCT are concatenated to form the facial feature vector.
In our Project, we have used 32 radon orientations ranging from 0-180 degrees at the jumps of 5 degrees. The Image was divided in blocks of 8x8 for the DCT and 10 coefficients from each block were chosen. The number of DCT coefficients were chosen keeping in mind, the computational requirements, data seperatability and ease of classification. There are many
ways one can choose the DCT coefficients of a block, like methods based on maximum magnitude of coefficients, maximum energy, or variance f coefficients and etc.
There are also proven results of different recognition rates as an effect of overlapping blocks and the percentage of overlap while calculation 2D-DCT .In Our case, DCT coefficients of non-overlapping blocks of an image are computed and ordered using zigzag scanning. The main reason behind this is to keep the system computationally less expensive and speed up the recognition process. The DCT coefficients extracted from each block are concatenated to obtain the feature vector.
Classification
Euclidean Distance: Euclidean distance is absolute difference between two points in one or more dimensions. Euclidean distance for N dimensional space is given by:
d = ( p1 − q1 ) 2 + ( p2 − q2 ) 2 + ( p3 − q3 ) 2 + K + ( pn − qn ) 2 It classifies the class which is at smallest distance from the test sample. Each class prototype is defined as mean of the feature values of all the prototypes of that particular class.
K-Nearest Neighbor Classifier: k-Nearest Neighbor (k-NN) method assumes all instances correspond to points in the ndimensional space. The nearest neighbors of an instance are defined in terms of the standard Euclidean distance. KNN is based on instance based learning. A test sample is classified by majority of votes by the neighbors.
Implementation and Results Database Collection: Database was collected for 20 people for minimum of 30 images. Each subject is asked to adjust the face it the 128 X 128 box displayed on the LCD. No instructions were given to the subjects in terms of pose or distance from the camera. The only constraint was to provide a frontal image such that there are no off-plane rotations. As the algorithm selected does not support these variations. Also it is quite likely for subject to follow the natural distance and pose at which he/she is most comfortable at the time of testing/training hence no instructions were provided to the subjects so that they can follow their natural pose at the time of testing.
The database was collected in two sessions for 11 people to capture variance in pose, illumination and zoom factor. The subject was made to provide frontal pose while the camera capturing image at 15 frames per second buffered the images for 2 seconds. Once the program collected 30 images it saved the images in the database creating unique name for each file. Training images
Test Images
Sample Images from Database
Dimensionality reduction
Downsampling The original LCD size was 240x320, i.e.: 76.8 Kbytes of data per processing. Taking into consideration the size of internal RAM of 16Kbytes, processing of 1 frame of this size would take about 4.8 seconds + processing of algorithms. In a real time system, such high processing times are discouraged as they make the system very slow. As in our case we buffer the images and then processes them for classification, taking an image size of such large dimensions is impractical. Hence, we choose a smaller sized image of 128x128, i.e. 16.4Kb which brings down the processing time to 1 sec per frame. However, considering the fact that a video based recognition system would take more than 1 frame for classification, a processing time of 1 sec per frames is still large. An image size of 128x128 when displayed on the LCD provides enough resolution and image size for the user to see and adjust his face into the camera. Hence, an Image size of 128x128 for display and crop purposes seems a reasonable size.
Selection of downsample image size The cropped image is buffered and frames for 2 seconds are saved. This enables the system to capture variations in poses of the face. For a frame rate of 15 fps, we get 30 frames leading to a total buffer size of 491.5 bytes. Processing all the frames is a time consuming process for such a large dataset. Hence, a need for further reduction in dimension arises. We know that a larger image contains a lot of redundant information and hence, reducing the size of image would not harm the classification as far as the image is still recognizable. We initially choose am image size of 24x24, but it posed lot difficulty as the Image size was too small and the information contained was much lesser. Since, the system is only based on face recognition we decided to go with a larger size of 64x64 to capture most of the information of the face and at the same time, making the dimension of feature vector reasonable enough for the processor to handle without much delays.
Radon Transform We learnt that the in-plane rotations can be handled well with radon transformation which projects the image into various orientations and amplifies the low frequency components by taking line integrals along the columns for each orientation.
Since, low frequency
components are the most important components for face recognition, Radon transform is a good choice to reduce dimensions and compacting the information contained in the image. Since we used 32 radon orientations we received a final image size of 32x44. To prevent any clipping of data while rotating, the image were zero padded by 12 pixels each side to make the resulting image 88x88. The padding size was decided by taking into consideration the diagonal length which is 90. Since, the Radon image is followed by DCT where the blocking of image takes place, we choose 88 as column width so that the image is divisible by 8 from each side. This makes us loose 2 pixels from each side, but considering the fact that the face of a person is mostly located in the center of the frame, this loss might not present a serious error in classification. The Radon transform gives us a final image of 32x44 which is then send for DCT transform stage.
DCT Transform A 2D DCT transform of an image accumulated the low frequency components of the image in the top-left corner of the image. A Local- block based DCT helps separating the frequency components of the image on a local basis. This helps us select the coefficients from important blocks containing eyes, lips, nose, etc. without any loss of data. Also it is proven that a Local based DCT presents better frequency capturing for face recognition purposes. Hence, we decided to go with a local 2D DCT for a block size of 8x8. We divide the radon image into blocks of 8x8 and calculate DCT coefficients for each block. We have 64 coefficients per block leading to 4096 total coefficients. LOW frequency vs. High frequency components: When DCT is chosen as a feature selection method, we can choose either high frequency components as features of low frequency components. If we choose the high frequency components, this implies that the regions containing edges information’s are selected as features. As the edges present the shape of the face and features, it is sometimes considered as a good approach as it imitates the face very closely. On the other hand, the low frequency component represents an approximation of the
image, so it can be considered as a source of classification errors. However, choosing high frequency components make the dependency of training set too high and can lead to high classification errors if the subject doesn’t present and image similar to his training set. Choosing the low frequency components means that we are capturing the total energy of the image which can in turn lead to better classification results. Hence, we choose to pick low frequency components as our feature set. Picking up 10 coefficients from each clock gives us a feature vector of 200 coefficients i.e. 0.8 kb for each image. This is a dimensionality reduction of 98%.
Selection of number of coefficients Number of Radon angles: Selection of number of radon angles was an important decision. We wanted to cover the whole range of 0-179 but at the same time wanted to keep the computational complexity in terms of time and processing, very low. We noticed that the 128x128 images were taking 1 sec for each angle to rotate. This was due to the restricted small size of internal RAM. Hence in total each frame took about 32 seconds for radon transform. This hurdle was taken care of, when we reduced the size of the image to 64x64. Now the image was taking only 8 secs to complete the recognition process including 32 rotations of radon transform. Although, 8 seconds is pretty high for a real time system, we decided to go with it and focus on optimization of loops to reduce the time to rotations. For the 32 angles, we took angles at the steps of 5 degrees to cover up all the 0-179 range of orientations. However, we learnt later that a better approach could have been to range the set of angles from 0-30 in steps of two and then flipping the results to obtain similar orientations in negative direction. For example, flipping the orientation of 10 degrees would give resulting orientation of -10 degrees which is 170 degrees in effect. Also, considering a practical situation, a person would only rotate his face up to as much as 30 degrees in each direction. Hence, this approach sounded very intuitive to apply and would have given better results. This approach gives us 60 radon angles resulting in a final image of 60x44. This was not a significant gain of dimensionality reduction in our case and hence, we decided to go with the original 32 radon angles.
Number of DCT coefficients:
Selection of number of DCT coefficients from each block was another important decision. Too many coefficients would give a large feature vector and vice versa. Since, the images are gray level images; the symbol set consisted of 0-255 gray level values. After feature extraction, it is very likely that these feature vectors lie very close to each other. Hence, to differentiate between the data, a correct selection of coefficients was necessary. A dimension of 16 coefficients per block gave a final feature vector of 1024 coefficients resulting in the training set of 20 vectors of 1024 values each. This seems like a very big number, considering the fact that we need to calculate distance of the incoming vector from each vector in training feature set. Hence, 16 coefficients per block was an expensive choice in terms of existing system for us. Since, it is an only face based recognition system, taking a too small value of coefficients would also be an impractical thing to do.
The capture the low frequency
components properly, we take 10 DCT coefficients per block in a zigzag manner such that we have the highest magnitude coefficients in our feature set. Coefficients with larger magnitude affect the classification rate more than the coefficients with lower magnitude. 10 DCT coefficients per clock resulted in a feature vector of 200 values per feature vector. This sounded a reasonable choice. Another choice was taking 5 DCT coefficients per block which resulted in a feature vector of 100 values per feature vector. This again is a reasonable number for our existing system.
The system has also been tested for only DCT based classification. For this purpose, DCT coefficients are extracted for the original 64x64 normalized images resulting in the feature vector of size 640 for 10 coefficients per block and 320 in case of 5 coeffecients per block.
Selection of Classifier ( ED results and KNN): For this system we wanted to choose the classifier that is very less expensive in terms of computation yet yielding the better classification rate. Euclidian distance classifier is very light computationally but gives much emphasis on the minimum distance which results into misclassification with little variation from test images. For Euclidian distance classifier the recognition rate for validation data was 97% where on test data was 63%. k-Nearest neighbor gives the most frequent class from the first k smallest class distance from the test sample. So for KNN even if test sample yields lower distance from prototype of a
class if it only occurs once it has more probability of classifying itself correctly. KNN is computationally heavy than Euclidian distance but overcomes the problem of singularity in GMM which occurs because of small database and large number of feature set, and the number of features which results in non-singular model of GMM fails to capture the facial details of the face.
Selection of value of K A proper selection of value of k is very important for a k-nn classifier. To high k value will give noisy classification, however too low value might land up only considering the very nearest neighbor, which might or might not be correct classification. We have noticed by experimentation that the test samples that are very different from the training data will generally produce larger distances than more similar data. Consequently they are more likely to cause a misclassification. The following diagrams show the effect of k on classification.
K-NN
classification
for
Different values of k radius in increasing order.
Lets say the incoming vector belongs to class red. As wee see if the vector resembles the training image it might lie very close to the feature vector of the class RED. Taking a smaller radius of K will help in this case as we will get the maximum frequency for the class RED has the unknown vector will be classified as RED. Now, we increase the size of the radius under the assumption that more number of vectors of same class should exist under the circle now. However, we see that the prototypes for class GREEN and equal in number to that of class RED. Hence on the basis of value of indices the classifier will classify the unknown vector as RED or GREEN, whichever has the lower class index. If GREEN is chosen, this clearly is a wrong classification even when the RED prototypes lie very close to the unknown vectors. As we increase the size of the radius, the classification areas becomes more and more noisy and might result in higher misclassifications then classifications.
In our experiments, we noticed that the classification was getting affected by the value of K in the similar way. Also, we sometimes noticed that the correct class was within the first 10 nearest neighbors of the input vector, but still the classification was incorrect. This was probably due to the above mentioned reasons.
Also, when the test image is very different from training images, the incoming vector will lie very far from the training vector (as shown in the figure) and hence, will be always misclassified. A large variation in training set can help this situation. However, in our case due to restrictions in data availability we could not include lot of variant images in the training set. Hence, the system gives good classification if the user presents an image close to the training set. Following is an example of test set misclassification due to large values of K. String:
String:
./Database/ashwin/video_ashwin
./Database/ashwin/video_ashwi
_10.raw
n_23.raw
Just
finished
radon
Just
The classification result here shows the
finished
radon
classification of class-13 (“Ashwin�) for different values of K. We see that
sorted
value
of
0
at
13
sorted
value
of
0
at
13
sorted
value
of
1
at
13
sorted
value
of
1
at
13
sorted
value
of
2
at
13
sorted
value
of
2
at
13
sorted
value
of
3
at
13
sorted
value
of
3
at
13
For k=15 this class is misclassified as
sorted
value
of
4
at
10
sorted
value
of
4
at
13
10, however, for k=5 this gives a correct
Ashwin is misclassified as class 10 inspite of being in the top 4 neighbors.
10
sorted
value
of
at
10
sorted
value
of
6
at
4
at
10
sorted
value
of
7
at
4
at
10
sorted
value
of
8
at
4
9
at
18
sorted
value
of
9
at
4
of
10
at
18
sorted value of 10 at 16
value
of
11
at
11
sorted value of 11 at 16
sorted
value
of
12
at
18
sorted value of 12 at 16
sorted
value
of
13
at
11
sorted value of 13 at 16
sorted
value
of
14
at
18
sorted value of 13 at 16
10
Classified as 13
sorted
value
sorted
value
of
6
sorted
value
of
7
sorted
value
of
8
sorted
value
of
sorted
value
sorted
Classified
of
5
at
as
5
at
13
classification rate.
Challenges Dimensionality As mentioned before, a large size of image or feature vector posed the hurdle of very slow recognition process. Also, since we buffer the images before processing them, a very large size of image was constantly leading to buffer or stack overflow. As the stack overflows, the data being displayed on the LCD was displaying garbage values. To solve, this we had to reduce the size of image from originally 240x320 to 64x64 and flush the buffer after every classification result.
For a very large feature vector size, we noticed that the data is noisier and closely scattered. Hence, for large feature vectors the data for each class was overlapping and hence, resulting in a lot of misclassifications. This was even true for the case of 640 DCT coefficients. Hence, for DCT only case, we decided to go with selection of 5 DCT coefficients per block, resulting in the feature vector of 340 coefficients
Small size of internal Ram and memory issues
Processing time for Radon for different image sizes
Scatter plot of data : RADON 10images per class
FIGURE: Scatter plot of original feature set of radon and DCT,200 coefficients and 10 images per class.
FIGURE: Scatter plot of mean image of radon and DCT of 10 training images and feature set of 200 coefficients per class.
Graphs above show scatter plot for 10 images and the scatter plot for mean image for each class. We observe that in the scatter plot of 10 images/class the feature vector is highly overlapping. Since our symbol set consisted of only 0-255 grey level values this translates to
closely lying feature values which in turn contributes more towards misclassification. However in the mean image graph we see that the values are visible and close to being distinct which in turn means lower classification rate.
Median image
FIGURE: Scatter plot of median image of radon and DCT of 10 training images and feature set of 200 coefficients per class.
A general analysis of median image has been presented above. While experimentation we noticed that the class was very oftenly being classified as class id -11 (“Shivani�) the probable reason for this is that while we take Euclidian distance from all the features from the original feature set, the set of values lie very close to the other classes as the original feature vector in itself very less distinct and matches lot of classes in the train set, hence influencing the final decision. As seen from the median graph the median sufficiently separated the data for class 11 which was highly scattered in the mean image.
Means of radon+DCT
FIGURE: A comparative chart for set of one mean per class for radon and DCT of 10 training images.
Median of radon
FIGURE:A comparative chart for set of one median per class for radon and DCT of 10 training images.
FIGURE: comparative chart for set of one mean per class as compared to median for that class for radon and DCT of 10 training images.
For analysis purposes, we tried projecting the data on 1 dimension in terms of means and medians. If we consider
Euclidian distance than the class that is most likely to be
misclassified is the one having its mean value very close to any other class. The bar graphs above the mean and median value, so the bar graphs at same level would be create confusion at classification because of same distance from each of these feature vectors, hence the misclassification rate would be high. In most of the cases the mean and median value is similar for respective classes. However, when the variation between training images is too high the median presents a better representation of the class values as it tends to fall between the most frequent values where as mean is a reconstructed mid value for the variation of the training set.
FIGURE: comparative chart for set of 10means per class for radon and DCT of 10 training images.
FIGURE: comparative chart for set of one mean,median, and median of 10 emans per class for radon and DCT of 10 training images.
As mentioned above, the bars above shown either the median or the mean value. Classes at the same level are more likely to be misclassified then classes showing some difference in mean or median values. We have taken mean of 10 images , median of 10 images and median of 10 means for 10 training mages for each class. From the graph above, we notice that the median as 1D representation of data is a better metric for Euclidean distance.
We also notice that there are certain example like class 11(“shivani” and class-12 (“neha”) and class 18(“pranav”) where there is a significant difference in the mean and median image. The reason for this is that the training set have a image set of large variance. Since the variance is too high, the mean of the images is increased whereas the median remains unchanged. These classes are more likely to be misclassified if we take the distance from mean image as a metric.
We also, notice that the radon and DCT together also, cannot handle such kind of variations in the image set. When observed carefully, we see the images are different in terms of zoom primarily and illumination secondly. The Second condition is handled by the algorithm but the first condition still remains the problem.
Classification based on only DCT coefficients.
FIGURE: Scatter plot of 10images per class for feature vector of 640 DCT coefficients for each image.
As mentioned earlier, the scatter plot of 640 DCT coefficients is too dense. Also, wee see a number of coefficients at same level, which implies less distinctness and more redundancy in the information provided by the DCT coefficients. This highly overlapped data poses a great problem in classification rates.
FIGURE: comparative chart for set of one mean per class for DCT of 10 training images.
FIGURE: comparative chart for set of one mean per class for DCT of 10 training images.
FIGURE: comparative chart for set of one mean and one median per class for DCT of 10 training images.
We notice from the above graph that the mean values of the DCT –training vectors are quite distinct and the median and mean values are similar even for cases where the training data is highly variant like in above mentioned cases. Since no two bars at same level, this scheme can provide better recognition rates. Both mean and median serves as a sufficient metric for distance calculations. We noticed that class 11(“shivani” was quite often being classified as class 6(“shengakai”) in the real time tesing. The probable reason for this can be the very little difference between the mean values and hence, misclassification because of the slight variations in the test image. Hence, this provides a very good intuitive reasoning for the misclassification results.
FIGURE: comparative chart for set of one median per class for DCT of 10 training images.
FIGURE: comparative chart for set of one median per class for DCT of 10 training images.
Again, As metioned above, median of images provides a good distinction in the feature set.
Intuitively, that the incoming feature vector is more likely to go towards the median values for correct classification, than the slight variant images away from the median.
FIGURE: comparative chart for set of one mean, one median and median of means per class for DCT and radon plus DCT of 10 training images.
RADON_DCT VS DCT
For analysis purposes, we plotted the mean of radon and DCT feature vector, median of radon and DCT feature vector, mean of DCT feature vector, median of DCT feature vector, and medians of means for 10 images in each scenario. We notice that , as expected the feature set values of the DCT are much higher then the Radon feature set. We also observe that the DCT feature set presents more distinctness in the values of the clusters thereby, facilitating the correct classification. Radon transform on the otherhand,
had values very closely lying
together, hence, making the system very sensitive to slight changes.
The reason for less variation of the Radon transform could be that for the 32 rotations, it sums up the values of the columns as line integrals. This in a way suppresses the variations in the
image and brings the values close together. This might be a good technique for clustering of images. However, in our case need
largevariations in the feature set , as we are not
implementing any discriminant algorithms. Radon transform and DCt in conjection with any discriminant algorithms would provide excellent results as the data for each class would be closely scattered making the in-class variation small and the for each class the values would be well separated from other classes making the inter-class variation also large. This is a desirable situation for any face recognition system. Unfortunately, in our case, due to the large calculation of covariance matrix, implementation of any Discriminant analysis method was not possible. Hence, in our case, DCT provides better recognition rates then the combination of Radon and DCT.
Flow diagram of the code: VM322K2 video input(16 bit YUV) CAM (8bit-Y240x320) Radon(32x44) Block(20)
Block(20) 2DDCT
CAM (8bit-Y 240x320) CROP(12 8x128)
CROP(128x128) buffer (3 sec) Downsample(64x64)
Preprocessing (64x64) Radon(32x44)
Downsample(64x64) Pr eprocessing (64x64)
2D-DCT 10 coeff per block feature_concat[200]
Send to kNN for classification.
Limitations with REAL TIME RESPONSE One of the major problems we encountered was not being able to capture the image properly while real time testing. The dataset gave excellent results for the offline testing, while for the same pose the classification of in real time was very drastic. We learnt that the algorithm was working fine, however, the Image was not being able to capture properly in the first stage itself. As initially we were just taking one image for classification, we implemented the downsampling function in the interrupt (while) loop. As a particular number of frames we collected, the last frame stored in the crop array was processed and downsampled. However, later we learnt that the image in the final frame was most of the times corrupted or over written because the interrupt was still enabled and running while we were performing the recognition task. Hence, the image we used for classification was sometimes, blurred or distorted due to interlace effects or sudden halt of interrupt while the camera was still storing the image. In the very few times when the image was captured properly, out of luck, the classifier gave satisfactory results. To solve this problem, we went back to the old scheme of buffering images and then processing them. To avoid confusion or any corrupted frame we choose a middle frame for classification rather than the first or last frame. Also, we disabled the imterrupt everytime we went into recognition step.
The second problem, was the camera input. We notices that the classification was also getting affected by the choice of camera. One of the camera units gave a noisy image on the edges and a low resolution image while the other camera gave a high resolution and clear image. Since we used the later camera for collecting the database, we used the same camera for test stage.
Displaying classification result
The classification result has been displayed on the LCD as the names of the person being classified. For this, instead of storing the images for each person, we stored the images in terms of 84x320 sized 2d arrays. This was a reasonable decision because the program
eventually had to read the messages and store it in arrays in case we saved them in forms of images. This would have been an extra overhead. This also saved us considerable amount of the loading time. Messages for initial user instructions were also saved in the system .
Imposter model. Making an imposter model requires a large amount of data with large variations. Due to constraints and lack of availability of such large databases, we leave the imposter model as a future work making the current recognition system as “one out of k classes classifier”.
Quantitative Results The quantitative results in terms of recognition rate sis presented below:
Feature set extraction method and classification for test Correct database.
rate
32 Radon Angles+ 10 DCT coefficients/block – k-NN(10)
86%
32 Radon Angles+ 10 DCT coefficients/block – E.D
63%
5 DCT coefficients (320)+ k-NN (5)
98%
5 DCT coefficients (320) + k-NN (15)
74%
10 DCT coefficients(640) + k-NN (10)
93%
16 DCT coefficients (1024) + E.D
60%
Recognition
We tested the system for real-time as well as the test dataset. The above results show , the correct recognition rates for the various combinations of feature selection methods. As expected the Euclidean distance presents a very low recognition rate for both Radon and DCT based methods. For the 16 DCT coefficient based method the Euclidean distance gives
recognition rates as low as 60%. This is an expected result due to increased amount of overlap in data, redundancy in the coefficient values. At the same time, the selection 5 DCT coefficients per block along with a k-NN classifier with k=5 gives recognition rates as high as 98%. To see the effect of selection of value of K , we calculated the recognition rates for the same 5 DCT coefficients per block with a k-nn classifier with k=15 nearest neighbors. The recognition rates in this case went to as low as 74%. We noticed, while classification, that for most of the misclassified classes, the correct class was present amongst the first 5-7 neighbors. Hence, this explains the good recognition rate of k=5 and low recognition rate of k=15. Hence, an optimum value of k for the k-nn classifier is also an very important issue.
We also noticed that , the recognition rates for the test database of 50 images was pretty high as compared to the size of training set of 10 images. Also, as shown above, the training and test dataset is very distinct and has lot of pose variations. The algorithms gives good recognition rates for the test database for offline-on-board testing. However, achieving such high recognition rates for real-time recognition is still a problem due to variance in poses from the training set.
Conclusion The system is designed to work well with database containing varying images of each subject. After looking at the real time classification we believe that face detection is needed prior to feature extraction to take into account the zoom factor. Including any discriminant analysis methods would boost the real time recognition rate significantly as explained earlier. We also noticed that radon + DCT is a good approach for energy compaction and decreasing the inclass variance but it does not contribute towards increasing inter class variance and hence, is dependent on linear discriminant method to give good recognition rate. On the contrary only DCT provides a good interclass variance but poor in-class variance and is very sensitive to change in pose and training set.
Future Work We would like to implement one of the discriminant analysis method or weighing function so we can create a voting scheme for set of images. We also would like to test different feature selection techniques like Haar wavelet transforms, WHT and EBGM.
References References 1. Z. Hafed, M. Levine, “Face Recognition Using the Discrete Cosine Transform”, International Journal of Computer, 43(3),167-188 2. W. Zhao et al., “Face Recognition: A Literature Survey”, ACM Computing Surveys, Vol. 35, No. 4, pp. 399-458, 2003. 3. H.K. Ekenel, R. Stiefelhagen, "Local Appearance based Face Recognition Using Discrete Cosine Transform", 13th European Signal Processing Conference (EUSIPCO 2005), Antalya, Turkey, September 2005. 4. J. Stallkamp, H.K. Ekenel, R. Stiefelhagen, “Video-based Face Recognition on RealWorld Data”, Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on 5. P. Viola, M. Jones, “Robust Real-Time Face Detection”, Intl. J. of Computer Vision, Vol. 57, No. 2, pp. 137-154, May 2004. 6. C. Sanderson, K.K. Paliwal, “Fast features for face authentication under illumination direction changes”, Pattern Recognition Lett. 24 (14) (2003) 2409–2419. 7. A. Batur, B.Flinchbaugh and M. Hayes III, “A DSP-Based Approach for the Implementation of Face Recognition Algorithms”, IICASS 2003, pp. II 253-256 8. S-W. Lee, Sang Lee and H.C. Jung, “Real-time Implementation of Face Recognition Algorithms on DSP chips”, Lecture Notes in Computer Science, 2003, pp-1057