570 by ides editor

Poster Paper Proc. of Int. Colloquiums on Computer Electronics Electrical Mechanical and Civil 2011

Application of Discrete Wavelet Transforms and Artificial Neural Networks in Recognizing Spoken Digits Sonia Sunny1, David Peter S2, K. Poulose Jacob3 1

Dept. of Computer Science, Cochin University of Science & Technology, Kochi, India Email: sonia.deepak@yahoo.co.in 2 School of Engineering, 3 Dept. of Computer Science, Cochin University of Science & Technology, Kochi, India Email:{davidpeter,kpj}@cusat.ac.in parameters and decisions are made based on some kind of a minimum distortion rule. Among these two stages, feature extraction is a key, because better feature is good for improving recognition rate. The paper is organized as follows. In the following section, we give a brief review of the feature extraction stage followed by the concepts of discrete wavelet transforms. The classification stage and the description of artificial neural networks are explained in the next section. The subsequent section presents the spoken digits database. Section 6 presents the summary of experiments done. Section 7 explores the results obtained and conclusions are given in the last section.

Abstract— This paper discusses the scope of speech recognition techniques in recognizing spoken digits. Speech recognition is a fascinating application of Digital Signal Processing offering unparalleled opportunities. Digits in Malayalam, one of the South Indian languages, are used to create the database. Features are extracted by using Discrete Wavelet Transforms (DWT). Training, testing and pattern recognition are performed using Artificial Neural Networks (ANN). The rate of recognition is 85.19% for the spoken digits. The experimental result obtained shows the efficiency of this hybrid approach. Index Terms— Discrete Wavelet Transforms, Artificial Neural Networks, Feature Extraction, Speech, Recognition, Multi Layer Perceptron.

III. FEATURE EXTRACTION STAGE Feature extraction is the process of extracting the relevant features from the input signals for further processing. The extracted feature vectors contain only the relevant information about the given utterance that is essential for its correct recognition. The technique selected plays a vital role in the speech recognition rate. Researchers have experimented with many different types of features for use in speech recognition. Most of the speech-based studies are based on Fourier Transforms (FTs), Short Time Fourier Transforms (STFTs), Mel-Frequency Cepstral coefficients (MFCCs), Linear predictive Coding (LPCs), and prosodic parameters. Literature on various studies reveals that in case of the above said parameters, the feature vector dimensions and computational complexity are higher to a greater extent. Moreover, many of these methods accept signals stationary within a given time frame. So, it is difficult to analyze the localized events correctly. The computational complexity can be successfully reduced using wavelet transforms, since the size of the feature vector is very less compared to other methods.

I. INTRODUCTION Speech Recognition is a very difficult task due to the differences in the way people speak. Human speech is parameterized over many variables such as amplitude, pitch, and phonetic emphasis that vary from speaker to speaker. The ultimate aim of research on Automatic Speech Recognition (ASR) is to make machines understand and convert the spoken sounds and words to text. It is one of the intensive areas of research [1]. Speech recognition is widely gaining attention because it allows natural interaction between computer and human beings without the use of keyboard. Many parameters affect the accuracy of the speech recognition system. Automatic recognition of spoken digits is one of the challenging tasks in the field of speech recognition [2]. A spoken digit recognition process is needed in many applications that need numbers as input such as automated banking system, airline reservations, voice dialing telephone, automatic data entry etc [3]. Digits in Malayalam, one of the four major Dravidian languages of southern India is chosen for recognition.

A. Discrete Wavelet Transform The wavelet transform is a multi- resolutional, multi-scale analysis, which has been shown to be very well suited for speech processing. DWT is a special case of the wavelet transform that provides a compact representation of a signal in time and frequency that can be computed efficiently. The Discrete Wavelet Transform is defined by the following equation

II. SYSTEM OVERVIEW Speech recognition process can be easily separated into different modules. Here we have divided the speech recognition process into two stages. The front-end processing is the feature extraction stage wherein short time temporal or spectral parameters of speech signals are extracted. The second one is the classification stage wherein the derived parameters are compared with stored reference © 2011 ACEEE DOI: 02.CEMC.2011.01. 570

W (j, K) = ΣjΣkX (k) 2-j/2Ψ (2-jn-k) 71

(1)

Poster Paper Proc. of Int. Colloquiums on Computer Electronics Electrical Mechanical and Civil 2011 A. Artificial Neural Networks Artificial neural networks have been investigated for many years in the hope that speech recognition can be done similar to human beings. In this work, we use the Multi Layer Perceptron (MLP) network, which is a feed forward network with back propagation algorithm (FFBP). The MLP network consists of an input layer, one or more hidden layers, and an output layer. In this type of network, the input is presented to the network and moves through the weights and nonlinear activation functions towards the output layer, and the error is corrected in a backward direction using error back propagation correction algorithm. After extensive training, the network will eventually establish the input-output relationships through the adjusted weights on the network. After training the network, it is tested with the dataset used for testing.

Where Ψ (t) is the basic analyzing function called the mother wavelet. The functions with different region of support that are used in the transformation process are derived from the mother wavelet. DWT is used to obtain a time-scale representation of the signal by means of digital filtering techniques. The original signal passes through two complementary filters, namely low-pass and high-pass filters. In speech signals, low frequency components known as the approximation coefficients h[n] are of greater importance than high frequency signals known as the detail coefficients g[n] as the low frequency components characterize a signal more than its high frequency components [4]. The wavelet decomposition tree is shown in figure 1.

V. DIGITS DATABASE FOR MALAYALAM For this experiment, a spoken digit database is created for Malayalam language using 14 speakers. We have used six male speakers and eight female speakers for creating the database. The samples stored in the database are recorded by using a high quality studio-recording microphone at a sampling rate of 8 KHz (4 KHz band limited). Recognition has been made on the ten Malayalam digits from 0 to 9 under the same configuration. Our database consists of a total of 140 utterances of the digits. The spoken digits are preprocessed, numbered and stored in the appropriate classes in the database. The spoken digits and their International Phonetic Alphabet (IPA) format are shown in Table 1.

Figure 1. Wavelet decomposition tree

The low frequencies sequence of the first level is taken as an input to the second stage. The discrete time domain signal is subjected to successive low pass filtering and high pass filtering to obtain DWT [5]. This algorithm is called the Mallat algorithm [6]. At each decomposition level, the half band filters produce signals spanning only half the frequency band. The filtering and decimation process is continued until the desired level is reached. The main advantage of the wavelet transforms is that it has a varying window size, being broad at low frequencies and narrow at high frequencies, thus leading to an optimal time–frequency resolution in all frequency ranges [7]. The successive high pass and low pass filtering of the signal can be obtained by the following equations. Yhigh[k]= Σnx[n]g[2k-n] Ylow[k]= Σnx[n]h[2k-n]

TABLE I. NUMBERS STORED IN THE DATABASE AND THEIR IPA FORMAT

(2) (3)

Where Y high and Y low are the outputs of the high pass and low pass filters obtained by sub sampling by 2. IV. THE CLASSIFICATION STAGE Speech recognition is basically a pattern recognition problem. Pattern recognition is becoming increasingly important in the age of automation and information handling and retrieval. Since neural networks are good at pattern recognition, many early researchers applied neural networks for speech pattern recognition. In this study also, we have used a neural network as the classifier. Artificial Neural networks are well suited for speech recognition due to their fault tolerance and non-linear property. ANN is an adaptive system that changes its structure based on external or internal information that flows through the network [8]. © 2011 ACEEE DOI: 02.CEMC.2011.01. 570

VI. EXPERIMENT Daubechies 4 (db4) type of mother wavelet is used for feature extraction purpose. Daubechies wavelets are the most popular wavelets that represent foundations of wavelet signal 72

Poster Paper Proc. of Int. Colloquiums on Computer Electronics Electrical Mechanical and Civil 2011 processing. The speech samples in the database are successively decomposed into approximation and detailed coefficients. The approximation coefficients from eighth level are used to create the feature vectors for each spoken digit and the number of approximation coefficients obtained at the eighth level was twelve. The developed feature vectors are given to an artificial neural network for parameter classification. We have divided the database into two. One set for training and the other set for testing purpose. Ten samples were taken for training and four samples for testing respectively. Same feature extraction procedure is used for creating test data. A feature vector with a size of twelve was obtained for test data also. The number of hidden layers units chosen was five.

VIII. CONCLUSION In this paper, an Automatic Speech Recognition system is designed for spoken digits in Malayalam using wavelet transforms and neural networks. A better performance of identification with high recognition accuracy is obtained from this study. The computational complexity and feature vector size is successfully reduced to a great extent by using discrete wavelet transforms. Thus a wavelet transform is an elegant tool for the analysis of non-stationary signals like speech. In this experiment, we have used a limited number of samples. Recognition rate can be increased by increasing the number of samples. The experiment results show that this hybrid architecture using discrete wavelet transforms and neural networks could effectively extract the features from the speech signal for automatic speech recognition.

VII. RESULTS Using the MLP network, the classifier could successfully recognize the spoken digits. After testing, the corresponding accuracy of each spoken digit is obtained. An overall recognition accuracy of 85.19% is obtained from this experiment. The various decomposition levels of spoken digit 1 and the graph showing the feature vectors are shown in figure 2 and 3.

REFERENCES [1] L. Rabiner, B. H. Juang, “Fundamentals of Speech Recognition”, Prentice-Hall, Englewood Cliffs, NJ, 1993. [2] Y. Ajami Alotaibi, “Investigating Spoken Arabic Digits in Speech Recognition Setting”, Information sciences, Vol 173, pp. 115-139, June 2005. [3] C. Kurian, K. Balakrishnan, “Speech Recognition of Malayalam Numbers”, World Congress on Nature and Biologically Inspired Computing, pp. 1475-1479, December 2009. [4] S. Kadambe, P. Srinivasan, “Application of Adaptive Wavelets for Speech”, Optical Engineering Vol 33(7), pp. 2204-2211, July 1994. [5] G.K. Kharate, A.A. Ghatol, P.P. Rege, “Selection of Mother Wavelet for Image Compression on Basis of Image”, IEEE - ICSCN 2007, India. pp.281-285, February 2007. [6] S .G. Mallat “A Theory for Multiresolution Signal Decomposition: The Wavelet Representation”, IEEE Transactions on Pattern Analysis And Machine Intelligence, Vol.11, pp.674693, 1989. [7] Elif Derya Ubeyil, “Combined Neural Network model Employing wavelet Coefficients for ECG Signals Classification”, Digital signal Processing, Vol 19 pp.297-308, 2009. [8] V.R Vimal. Krishnan, A. Jayakumar, P. Babu Anto, “Speech Recognition of Isolated Malayalam words using Wavelet Features and Artificial neural network “, 4th IEEE International Symbosium on Electronic Design, Test and Applications, pp. 240-243, 2008.

Figure 2. Different decomposition levels of digit

Fig.3. Feature vector for Malayalam digit