International Journal of Research in Advent Technology, Vol.2, No.8, August 2014 E-ISSN: 2321-9637
Speech Enhancement of Punjabi Language at Phoneme Level using Digital Signal Processing Techniques Jaismine Jassal1, Manjot Kaur Gill2 M.Tech. student, Dept. of Computer Science and Engineering1, Guru Nanak Dev Engg. College, Ludhiana1 Assistant Professor, Dept. of Information Technology2,Guru Nanak Dev Engg. College, Ludhiana2 Email:jassal.priya@yahoo.com1 , gill.manjot@gmail.com2 Abstract-This paper presents an overview of several most commonly used methods for enhancement of degraded speech. The common methods like Spectral Subtraction, Wiener Filter, Kalman Filter, RASTA Filter and the Proposed Method which contains the features from all the methods mentioned are explained. Each method uses certain Digital Signal Processing (DSP) techniques. Framing, windowing, DFT(Discrete Fourier Transform), FFT(Fast Fourier Transform), noise detection, SNR are the common parameters used in each method. These methods are applied on the phonemes of Punjabi language extracted from the word recorded. Keywords- Noise, speech enhancement, phonemes, SNR (Signal to Noise Ratio).
2.
1. INTRODUCTION Speech signals in the real worlds scenario are often corwhere f is the index of frequency bin. rupted by various types of degradations. The most common The problem of enhancing noisy speech received degradation includes background noise, reverberation and considerable attention in the literature and a variety of speech from competing speaker(s). Degraded speech is methods have been proposed to overcome it. the overpoor, both in terms of quality and intelligibility. Therefore, view for each of them is discussed underneath. there is a need to process the degraded speech for enhancing the perceptual quality and intelligibility. Several methods in the literature have been proposed for the purpose. Degraded 2.1. Spectral Subtraction Spectral Subtraction is a very popular method to enspeech is processed in the frequency domain for achieving hance the quality of speech that has been degraded by enhancement. Different types of noise from the environadditive noise. It is a form of spectral amplitude estiment were being added and their results were computed and mation method to restore signals degraded by additive compared. noise, where the phase distortion can be ignored This paper provides an overview of some of the (Saeed, 2005) .Since, it is assumed that the human ear commonly used methods, the comparison between them and is insensitive to the phase. This method of enhancement the proposed method. The rest of the paper is organised as works at restoring the signal by subtracting an estimate follows: Section 2 presents a review of the methods for processing speech degraded by background noise. Section 3 of the noise spectrum from the noisy signal spectrum describes the Punjabi language and its phonemes. Section 4 (Saeed, 2005). In Spectral Subtraction the noise in the covers the methodology followed. Section 5 describes the degraded speech is estimated from the ‘pauses’ or comparative results and discussion between the methods ‘quiet’ periods in the speech signal, when there is no applied on the phonemes. The conclusion is discussed in speech being said and only noise is present. The noise Section 5. spectrum is then usually updated as more frames of noise or silent periods appear in the speech signal. However since the noise is random by nature the resulENHANCEMENT OF NOISY SPEECH Background noise is the most common factor that causes tant spectrum can become negative when Spectral Subdegradation of the quality and intelligibility of speech. The traction is applied. This means that the negative values term background noise refers to any unwanted signal that is need to be set to a positive value. This in turn can also added to the desired signal. Background noise can be stacause distortion of the signal but reduces distortion tionary or non-stationary and is assumed to be uncorrelated caused when the spectrum turns negative. Spectral Suband additive to the speech signal. Mathematically, speech traction of the signal takes place in the frequency dodegraded by background noise can be expressed as the sum main rather than the time domain where the signal is of clean speech and background noise (Krishnamoorthy and given. To transform the signals to the frequency doPrasanna, 2010) given as main is usually done using a Discrete Fourier transform (DFT). In this, the Fast Fourier Transform is used ins(n) = x(n) + p(n) (1) stead (FFT). The FFT is the same as the DFT only it is an efficient way of doing it. Therefore, it is quicker and where s(n), x(n) and p(n) denote the noisy speech, clean will use fewer resources when working with it, making the system more efficient(Paul, 2009). speech and the background noise respectively. In the frequency domain it can be represented as 2.2. Wiener Filtering Method S(f) = X(f) + P(f) (2) 98
International Journal of Research in Advent Technology, Vol.2, No.8, August 2014 E-ISSN: 2321-9637 The improvement to spectral is the Wiener Filter. In signal processing, the Wiener Filter is a filter used to produce an estimate of a desired or target random process by linear time-invariant filtering an observed noisy process, assuming non-stationary signal and noise spectra, and additive noise. The Wiener Filter minimizes the mean square error between the estimated random process and the desired process. The goal of the Wiener Filter is to filter out noise that has corrupted a signal (Paul, 2009). 2.3. Kalman Filtering Method Next method of improvement in signal is through Kalman Filtering. It is an adaptive least square error filter that provides an efficient computational recursive solution for estimating a signal in presence of Gaussian noises. It is an algorithm which makes optimal use of imprecise data on a linear (or nearly linear) system with Gaussian errors to continuously update the best estimate of the system's current state (Gannot et al, 1998). Kalman Filter theory is based on a state-space approach in which a state equation models the dynamics of the signal generation process and an observation equation models the noisy and distorted observation 3. signal. This method however, is best suitable for reduction of white noise to comply with Kalman assumption. In deriving Kalman equations it is normally assumed that the process noise (the additive noise that is observed in the observation vector) is uncorrelated and has a normal distribution. This assumption extends to whiteness character of the noise chosen. However, there are different methods developed to fit the Kalman approach to colored noises (Gannot et al, 1998) 2.4. RASTA Method The next technique is RASTA i.e. Relative Spectral Analysis. To compensate for linear channel distortions the analysis library provides the ability to perform RASTA Filtering. This method can be used either in the log spectral or cepstral domains. In effect, the filter band passes each feature coefficient. the linear channel distortions appear as an additive constant in both the log spectral and the cepstral domains. The high-pass portion of the equivalent band pass filter alleviates the effect of convolution noise introduced in the channel. The low-pass filtering helps in smoothing frame to frame spectral changes (Urmila and Vilas, n.d). 2.5. The Proposed Method for Speech Enhancement The Proposed method uses the features of Wiener and 4. Kalman Filtering method. The connection is not simple cascade but the blocks are interacting. The combination of Wiener and Kalman approach can be termed as hybrid approach used to improve the performance at even low SNRs (0-15dB). This method is designed to enhance the speech ( i.e. phonemes in our case ) degraded by noise. The method contains certain features of Wiener and some of the parameters and features used in Kalman filtering technique.
The features of Wiener like doubling the magnitude and eliminating negative magnitude because sometimes the estimated noise could be larger than the current signal and we end up with a negative magnitude. This would lead to poor quality sound and needed to be limited to positive values to reduce musical noise and. It was also necessary to keep the code flexible so a range of values could be tested for the different parameters. The features from Kalman consists of innovation process, Kalman gain, and recursive update. The Kalman gain matrix acts as a coefficient to the innovation sequence. Their product gives a correction factor that is used to update the initial prediction of the state vector. The final, optimal estimate is the sum of the initial predicted value and the correction factor. Likewise, the a prior error covariance is updated to give the posterior error covariance matrix at time n. Along with this the SNR was also used. The tests were conducted using the combination of all these factors to get the enhanced and better results from all the filtering methods discussed above. PUNJABI LANGUAGE PHONEMES Phonemes are the smallest segmental unit of sound to form contrasts between utterances(Phonemes, n.d). Punjabi language has 38 consonants and 10 non-nasal vowels and 10 nasal vowels. these are shown as follows (Vivek and Meenakshi, 2013):
Figure 1. Punjabi Consonants and Vowels Consonants are further divided into aspirated and non aspirated consonants (Phonemes, n.d). Aspirated consonants has sound of ( h, B, P, T, J, C, D, K, d, G, Q) whereas non aspirated consonants (p, b, q, t, s, j, c, h, d, r, V, S, g, l, n, x, v, X) have single character sound. The ten non nasal vowels are divided into two forms i.e. independent vowels ( A, Aw, au, aU, ie, eI, AY, a o, AO) and dependent vowels( w, i, I, u, U, y, Y, o, O ). There are three nasal symbols( N, M, ` ) that produce double sound and three paireens ( h, v, r). METHODOLOGY Step 1. Input :Word level input is fed into the system. This can be done using microphone to record the word. Step 2. Phoneme Extract: Break words into phonemes. This is done with the help of Sound Forge 5.0. Step 3. Add noise: Different types of noises are added. The noise like random noise generated in Matlab (7.12) which is of same length i.e. of the signal (phoneme). Apart from this, other types of noises like cars, aircraft, household, bells, water etc were added whose length was truncated to the length of speech (phoneme). 99
International Journal of Research in Advent Technology, Vol.2, No.8, August 2014 E-ISSN: 2321-9637 Step 4. DSP Techniques: Techniques like truncation, digital filtering, blocking into frames, windowing, noise detection, SNR(Signal-to-Noise Noise Ratio), FFT (Fast Fourier Transform) etc, applied before filtering methmet ods. Step 5. Filtering methods: The methods explained above in Section 2 are used and then the results are computed and compared. Step 6. Output: Enhanced speech.. 5.
RESULTS AND DISCUSSION Figure 2. Original Signal Different types of noises were used along with different levels of SNR (Signal to Noise Ratio). Apart from this, the test for random andom noise generated in Matlab was also 5.1. Graph of random noise for each method at different values of SNR done at different SNR values. During the whole develdeve opment of the algorithms there were tests being conco The graphs are plotted in Matlab M 7.12 using function tinuously carried out to verify that the filters were opo 'plot' having syntax as shown: erating as required. These tests involved the developer listening to the filtered speech, the spectrogram and plot( plot(X,Y); (5) also examining graphs raphs of the speech signals that had gone through the filters. Doing so helped to see the which creates a 2-D D line plot of the data in Y versus progress of the filtering methods.. When the algorithms the corresponding values in X where X and Y are both were working, they were then setup to be able to vectors, both matrices or one vector other matrix of change the value of the SNR of the signal. This now ala equal length. lowed to be able to choose their own SNR value and run the filters to see how well they functioned under 5.1.1. Comparison at SNR 20.0 dB different levels of noise in the speech signals. The test itself consisted of different speech samples. Each speech sample was then broken up again by apa plying different SNR values to the speech samples ranging from 20db to 40db. Therefore most tests were held in a relaxing atmosphere at the PC using either headphones or speakers. First of all, the phoneme is ses lected , afterwards noise is chosen and added to the th phoneme. The phoneme selected was extracted from the word recorded using Sound Forge5.0. Forge5.0 Then the length of phoneme and the noise was made equal by using ing truncation method in Matlab using equation 4: Figure 3. (a) SS (b) WF (c) RF (d) KF (e) PM at 20 dB ,
(3) showing Clean signal (blue), noisy signal (red) and filtered signal (green). where, 'Len' stores the minimum among the both clean signal and the noise signal. The noisy signal is then 5.1.2. Comparison at 30.0 dB computed using the addition operator in Matlab. The formula to compute the noisy signal is shown in equaequ tion 5 . The (1:Len) is used to shorten length of Both clean and noise signal to 'Len'. 1: 1:
(4)
In the labelling of each figure SS, WF, RF, KF, PM denotes Spectral Subtraction method, hod, Wiener Filtering method, RASTA Filtering Method, Kalman Filtering method and the Proposed Method respectively. respectively The graph for original signal i.e. phoneme (ey) is Figshown Figure 1. ure 4. (a) SS (b) WF (c) RF (d) KF (e) PMat 30 dB showing Clean signal (blue), noisy signal (red) and filtered signal (green). 5.1.3. Comparison at 40.0 dB 100
International Journal of Research in Advent Technology, Vol.2, No.8, August 2014 E-ISSN: 2321-9637
Figure 5. (a) SS (b) WF (c) RF (d) KF (e) PM at 40 dB showing Clean signal (blue), noisy signal (red) and filtered signal (green). 5.2. Graph of birds005.wav noise for each method at difdi ferent values of SNR As from the previous graphs, we can clearly see the differdiffe ence that the Proposed method produces the best result as compared to the other filters and it is observed that each filters works best when SNR is increased. Apart from this another type of noises were also introduced, which were truncated to the length of the phoneme using truncation. truncation The result for each of the filter at SNR ranging from 20db to 40db in the noise(birds005.wav) is shown underneath:
Figure 8. (a) SS (b) WF (c) RF(d) KF (e) PM at 40 dB showing Clean signal (blue), noisy signal (red) and filtered signal (green). 5.3. Spectrogram of birds005.wav noise for each method at different values of SNR The another method used for comparison between the difdi ferent filters is the spectrogram. It is a visual representation of the spectrum of frequencies in a sound or other signal as they vary with time or some other variable. Spectral waterfalls, voiceprints, or voice-grams grams are commonly referred as spectrograms (Spectrogram, n.d). n.d) For identification ication of the spoken words phonetically,, spectrograms can be used. used Extensively, it can be used in the development fields like music, sonar, radar, and speech processing, seismology mology etc (Spectrogram, n.d).
5.2.1. Comparison at SNR 20.0 dB 5.3.1. Comparison at SNR 20.0 dB
Figure 6. (a) SS (b) WF(c) RF (d) KF (e) PM at 20 dB showing Clean signal (blue), noisy signal (red) and filtered signal (green).
Figure 9. (a) Original Signal (b) SS (c) WF (d) RF(e) KF (f) PM at 20 dB
5.2.2.
5.3.2. Comparison at SNR 30.0 dB
Comparison at SNR 30.0 dB
Figure 7. (a) SS (b) WF (c) RF (d) KF(e) KF PM at 30 dB showing Clean signal (blue), noisy signal (red) and filtered signal (green).
Figure 10. (a) Original Signal (b) FF (c) WF (d) RF(e) KF (f) PM at 30 dB
5.2.3. Comparison at SNR 40.0 dB
101
International Journal of Research in Advent Technology, Vol.2, No.8, August 2014 E-ISSN: 2321-9637 5.3.3. Comparison at SNR 40.0 dB
[3] Phonemes (n.d),Available from: https://www.princeton.edu/~achaney/tmve/wiki100k/doc s/Phoneme.html [4] S.Gannot,D.Brushtein,E.Weinstein (1998), Iterative and Sequential Kalman filter-based Speech Enhancement Algorithms, IEEE Transaction,Speech AudioProcess, vol. 6, no. 4, pp. 373-385. [5] Saeed V.Vasegi (2005), Advanced Digital Signal Processing and Noise Reduction, Third edition. [6] Spectrogram (n.d), Available from: en.wikipedia.org/wiki/Spectrogram. [7] Urmila Shrawankar, Dr Vilas Thakare (n.d), Techniques for Feature Extraction in Speech Recognition System: A Comparitive Study, Available from: arxiv.org/ftp/arxiv/papers/1305/1305.1145.pdf Figure 11. (a) Original Signal (b) SS (c) WF (d) RF (e) KF [8]. Vivek Sharma, Meenakshi Sharma(2013), A quantita(f) PM at 40 dB tive study of the Automatic Speech Recognition Technique, International Journal of Advances in Science and As from the above spectrogram it can be more precisely Technology, vol 1 issue 1. seen that each filters performs better when SNR(Signal to Noise Ratio) is increase and at the same time it indicates that the Proposed method performs better even at low SNR value as described in the earlier comparison phase. 6. CONCLUSION After studying and comparing all the filtering techniques, it is clear that the proposed method gives the better results even in the random noise and other noises observed, recorded and used in these methods. As from the discussion in previous section, it becomes clear that even at low SNR value the results of the Proposed method are better from the other four filters. The Table 1 shows the rating 1 to 5 ranging from very poor, poor, bad, good to very good respectively. In the Table 1 the five different methods are labelled as SS, WF, RF, KF, PM denoting the Spectral subtraction method, Wiener Filtering method, Rasta Filtering method, Kalman filtering method and the proposed method respectively. The rating is done on the behalf of the results computed and the comparison shown in previous section. Table 1: Rating for each method based on the testing results Noise type(.wav) Randn cars002.wav household018.wav aircraft003.wav animals006.wav birds005.wav
SS 2 2 3 3 2 2
Filtering Methods WF RF KF 4 4 3 4 4 3 3 4 3 4 4 2 3 4 2 3 3 2
PM 5 5 5 5 4 4
REFERENCES [1] P. Krishnamoorthy; S. R. Mahadeva Prasanna (2010), Temporal and Spectral Processing Methods for Processing of Degraded Speech: A Review. [2] Paul Coffey (2009), Enhancement of Speech in Noisy Condition, Project Report, National University of Ireland, B.E. Electronic Engineering. 102