Imperial Journal of Interdisciplinary Research (IJIR) Vol-3, Issue-2, 2017 ISSN: 2454-1362, http://www.onlinejournal.in
Study & Analysis of Speech Recognition Techniques 1
2
Indu Bala1 & Dr. Kailash Bahl2
Research Scholar, PIET , Patiala, Punjab, India Professor PIET Computer Science and Engineering at PIET , Patiala, Punjab, India
Abstract : This Paper Describes Study And Analysis Of Dynamic Time Wrap And Hidden Markov Model Approaches For Isolated Speech Recognition, Vector Quantization For Hidden Markov Model And Dynamic Time Wrap Technique, Analysis Of The Speech Recognition Techniques On Basis Of Accuracy Of Different Models For Speech Recognition And Differentiate The HMM And DTW Technique On The Basis Of Different Parameters. Keywords: HMM, DTW, VECTOR QUANTIZATION, ASYMMETRICAL DTW ALGORITHM, SYMMETRICAL DTW ALGORITHMS
1. Introduction The Speech is one of the most important tools for communication between human and his environment. Speech recognition made it doable for machine to understand human languages. With the increase in power and resources of computer technology, building natural sounding synthetic voices has progressed from a knowledge-based activity to a data-based one. Rather than handcrafting each phonetic unit and its applicable contexts, high-quality synthetic voices may be built from sufficiently diverse single speaker databases of natural speech. Speech recognition is nowadays regarded by market projections as one of the more promising technologies of the future. A wide variety of approaches have been proposed to recognize isolated words, based on standard statistical-pattern-recognition techniques. The most popular technique in 1980’s was the template-based recognizer approach, which uses Dynamic Programming (DP) as the method for comparing patterns. The research in the field of speech recognition has produced so many algorithmic procedures. Literature review of past suggests HMM and DTW technique are found to be the prime efficient techniques of ASR.
2.
Dynamic Time Wrap
In order to understand DTW, two concepts need to be dealt with, Features- the information in each signal has to be represented in some manner. Distances- some form of metric has to be used in order to obtain a match path. There are two types of distances. The
Imperial Journal of Interdisciplinary Research (IJIR)
computational difference between a feature of one signal and a feature of the other is called Local distance and the overall computational difference between an entire signal and another signal of possibly different length is called Global distance. For a linear predictive analysis the feature vector consist of the prediction coefficients (or their transformation). Since the feature vector could possibly have multiple elements, some mean of calculating the local distance is required. The distance measure between the two measures is calculated using Euclidean distance metric.
2.1.1 Symmetrical DTW algorithm Speech is a time-dependent process. Several utterances of the same word are likely to have different durations, and utterances of the same word with the same duration will differ in the middle, due to different parts of the words being spoken at different rates. To obtain a global distance between two speech patterns a time alignment must be performed. The best matching template is the one for which there is the lowest distance path aligning the input pattern to the template. A simple global distance score for a path is simply the sum of local distances that go to make up the path. All possible paths are being evaluated- but this is extremely inefficient as the number of possible paths is exponential in length of the input. Instead consideration what constraints can impose on the matching process. 1. Matching paths cannot go backwards in time; 2. Every frame in the input must be used in a matching path; 3. Combine local distance scores by adding to give global distance. Every frame in the template and the input must be used in a matching path. This means that if a point (i, j) in the time-time matrix (where i indexes the input pattern frame, j the template frame), then previous point must have been (i-1, j-1), (i-1, j) or (i, j-1). The key idea in dynamic programming is that at point (i, j) and just
Page 336
Imperial Journal of Interdisciplinary Research (IJIR) Vol-3, Issue-2, 2017 ISSN: 2454-1362, http://www.onlinejournal.in continue with the lowest distance path from (i-1, j-1), (i1, j) or (i, j-1). This algorithm is known as Dynamic Programming (DP). When applied to template-based speech recognition, it is often referred to as Dynamic Time Wrap (DTW). DP is guaranteed to find the lowest distance path through matrix, while minimizing the amount of computation. The DP algorithm operates in a time- synchronous manner: each column of the timetime matrix is considered in succession (equivalent to processing the input frame by frame) so that, for a template of length N, the maximum number of paths being considered at any time is N. If D(i, j) is the global distance up to (i, j) and the local distance up to (i, j) is given by d(i, j), then according to D(i, j)= min [D(i-1, j-1), D(i-1, j), D(i, j-1)] + d(i, j).
1. (i-1, j-2) to (i, j) – extended diagonal (skips a template frame); 2. (i-1, j-1) to (i, j) – standard diagonal; 3. (i-1, j) to (i, j) – horizontal (duplicate a template frame). If each frame of the input pattern is used only once and only once. This means that dispense with templatelength normalization and it is not required to add the local distance in twice for diagonal path transitions. This approach is referred to as asymmetric dynamic programming.
3. Comparison between Asymmetrical and Symmetrical DTW Algorithms Now the question arises in above algorithms which are a cut above, the main parameter for the comparison between two algorithms is:
3.1 Complexity Given that D[1, 1] = d(1, 1), the basis for an efficient recursive algorithm for computing D(i, j). The final global distance D(n, N) gives us the overall matching score of the template with the input. The input word is then recognized as the word corresponding to the template with the lowest matching score.
2.1.2 Asymmetrical DTW Algorithm Although the basic DP algorithm has the benefit of symmetry (i.e. all frames in both input and reference must be used) but has the side effect of penalizing horizontal and vertical transitions relative to diagonal ones. One-way to avoid this effect is to double the contribution of d(i, j) when a diagonal step is taken. This has the effect of charging no penalty for moving horizontally or vertically rather than diagonally. So independent penalties dh and dv can be applied to horizontal or vertical moves. Then the global cost up to (i, j) is given by, D (i, j)= min [D (i-1, j-1)+ 2d (i, j), D (i-1, j)+ d (i, j)+ dh, D (i, j-1)+d (i, j)+dv]. This approach will favor shorter templates over longer templates, so a further refinement is to normalize the final distance score by template length to redress the balance. If the allowable paths are restricted transitions to be:
Imperial Journal of Interdisciplinary Research (IJIR)
Complexity is the combination of two things.
Space complexity Time Complexity
Space complexity can be determining the space occupied by the algorithm in the computer memory. Whereas the time complexity can be determined by the time required to execute the algorithm. The overall complexity of the symmetric algorithm is: O (( n+|E|) log n) Where n is the number of nodes. And the overall complexity of asymmetric algorithm is: O(|E| log |E|) Where E is the Set of Edges in Graph G. The time complexity of Asymmetric algorithm is better than symmetric algorithm of Dynamic time wrap. Because the time complexity is depends on the number of comparisons to find the shortest path between the source and destination. In the asymmetric algorithm the no. of comparison is less than the no. of comparisons in symmetric algorithms.
4. Hidden Markov Model Hidden Markov Model approach is a widely used statistical method for characterizing the spectral properties of the frame of a pattern. The underlying assumption of HMM is that the speech signal can be well characterized as a parametric random process, and that the parameters of the stochastic process can be determined (estimated) in a precise and well-defined manner. Hidden Markov Model is a collection of states connected by transitions. Each transition carries two sets
Page 337
Imperial Journal of Interdisciplinary Research (IJIR) Vol-3, Issue-2, 2017 ISSN: 2454-1362, http://www.onlinejournal.in of probabilities: a transition probability, which provides the probability for taking this transition, and an output probability, which defines the conditional probability of emitting each output symbol from a finite alphabet given that a transition is taken.
Hidden Markov Model Algorithm: 1. Algorithm [ v, cost, dist, n] 2. dist[j], 1<=j<=n is set to the length of the shortest path from vertex v to vertex j in a digraph. G with n vertices, dist[v] is set to zero. G is represented by its cost adjacency matrix, cost[1:n, 1:n]. 3. { 4. for (i=1 to n) do 5. initialize S. 6. S[i]= false, dist [i]= cost [v,i]. 7. } 8. S[v] = true, dist [v]= 0.0. 9. for num=2 to n-1 do 10. { 11. determine n-1 pats from v 12. Choose u from among those vertices not. 13. in S such that dist[u] is minimum. 14. s[u] = true. 15. for (each w adjacent to u with S[w]= false) do 16. update distances. 17. if (dist[w]> dist[u] + cost [u,w]) then 18. dist [w]= dist [u] + cost [u,w]. A19. Hidden Markov Model is characterized } } following: by20.the
The set of states of the model, The number of distinct observation symbol per state, A set of transition
It is a doubly embedded stochastic process with an underlying stochastic process that is not directly observable (it is hidden) but can be observed only through another set of stochastic processes that produce the sequence of observations. probabilities, The output probability matrix, The initial state distribution.
4.1 Vector Quantization Vector quantization is an important step of the speech recognition system. In this quantization method, the multiple valued tuples (vectors) are brought together and expressed by a single code. This is used for waveform time series and characteristic patterns of speech (vector representation of LPC parameters, etc.), and since low bit rate expression of vectors is possible, this is used as method for reducing the amount of data. The HMM is built upon the sequence of symbols which are the code book indices coming from the vector quantizer. Thus even if the feature vector is changed, the HMM implementation remains unchanged. Codebook and its Size Selection Vector quantization codebook of size 256 is used. The calculations were performed with different size of the codebook (8, 16, 32, 64, 128, 256). As it is apparent from the calculation results that as the size of the codebook increase, the accuracy of the recognition system increases. So the codebook size of 256 for the comparison between DTW and HMM techniques is used. states, so the experiments with the larger number of states were not performed.
Figure 2.1 Dependence of accuracy Recognizer on Codebook Size
Number of States The recognition rate has been improving with the increase in the number of states of HMM. Experiments were performed using up to seven states and the recognition rate was maximum with seven states among the experiments performed. Since, there are other parameters to be tuned for each change in the number of
Imperial Journal of Interdisciplinary Research (IJIR)
Figure 2.2 Effect of Number of States of HMM on the % Accuracy
Initial Estimates Start with the initial estimates for the model parameters, iterate re-estimating these parameters to optimally train the HMM. There is no theoretical way either to determine or ensure the initial estimates to give the optimum final trained parameters. From the information
Page 338
Imperial Journal of Interdisciplinary Research (IJIR) Vol-3, Issue-2, 2017 ISSN: 2454-1362, http://www.onlinejournal.in of the experimental results, it is evident that any random initial estimates (subjected to our statistical constraints) of state distribution, and transition probabilities aij would result in correct parameters after a few iterations. However the same is not true for the symbol probabilities. A good initial estimate is achieved for symbol probabilities would give a better performance of the speech recognition. A good estimate is achieved by manually segmenting the observation sequence into states. However a uniform probability distribution for all the observation symbols is good alternative for the initial estimates.
and Computer). The systems were trained with one utterance per word from a single speaker. These results are for a typical speaker, hence there is so much variation in recognition accuracies of different words. Still an inference can be drawn about the overall behavior of these approaches.
Analysis of speech recognition techniques Both Dynamic Time Wrap approach and Hidden Markov approach for isolated word recognition of small vocabulary are implemented. In DTW approach, time wraping technique is combined with linear predictive coding analysis. In HMM approach, well known techniques of vector quantization and hidden markov modeling are combined with a linear predictive coding analysis. This is done in the framework of a standard statistical pattern recognition model. To provide some basis of comparison for the performance of the DTW recognizer with HMM recognizer, the same speech data was tested on both of the recognizers. The recognition accuracies of the DTW recognizer and HMM recognizer are given in the table. Word
DTW (%)
accuracy
HMM (%)
Hindi
96.0
92.0
English
96.0
76.0
Physics
100
96.0
Chemistry
100
100
Computer
80.0
96.0
Average
94.4
92.0
Figure 2.4 Graphical Representation of Accuracy of DTW Vs. HMM
accuracy
Table 2.3: Performance of DTW Isolated Word Recognition System Direct comparisons of the results of HMM based recognizer with the results of the DTW based recognizer with single template for each vocabulary word shows that the HMM based recognizer performs only a little worse than the DTW based recognizer. The results show that when DTW recognizer has incorrectly identified the word, most of the time the HMM recognizer has correctly identified (see the results of utterances English
Imperial Journal of Interdisciplinary Research (IJIR)
5. Conclusion The fact that performance of HMM recognizer is somewhat shoddier than the DTW based recognizer appears to be primarily because of the insufficiency of the HMM training data. Having seen one consequence of this inadequacy in that to use the constraints on value whose value fell below the threshold value. with the increase in the size of the codebook, the accuracy of the HMM based recognizer improves. The performance of the HMM recognizer also depends on the number of states of the model. It is necessary that number of states should be such that they can model the word. The time and space complexity of the HMM approach is less than the DTW approach because during HMM testing to compute the probability of each model to produce that observed sequence. In DTW testing, the distance of the input pattern from every reference pattern is computed, which is computationally more expensive.
6 . References 1.
Suma Swamy and K.V Ramakrishnan, â&#x20AC;&#x153;an efficient speech recognitionsystemâ&#x20AC;?, Computer Science &
Page 339
Imperial Journal of Interdisciplinary Research (IJIR) Vol-3, Issue-2, 2017 ISSN: 2454-1362, http://www.onlinejournal.in Engineering: An International Journal (CSEIJ), Vol. 3, No. 4, August 2013 2.
C.R. Kothari, “Research Methodology”, New Age International Publishers, Second Edition 2004.
3.
Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Gerald Penn,“applying convolutional neural networks concepts to hybrid nn-hmmmodel for speech recognition”, 978-1-4673-0046-9/12/ ©2012 IEEE
4.
Doda et al., International Journal of Advanced Research in Computer Science and Software Engineering 4(8), August - 2014, pp. 944-947
5.
Lwrence Rabiner, Biing-Hwang Juang “ Fundamentals of Speech Recognition” Published by Prentice-Hall International Inc., pp 1-3.
6.
Lwrence Rabiner, Biing-Hwang Juang “Fundamentals of Speech Recognition” Published by Prentice-Hall International Inc., pp 6-11.
7.
http://www.upress.umn.edu/Books/D/de_certeau_ca pture.html.
8.
Joseph W. Picone, “ Signal Modeling Techniques in Speech Recognition”, Proceedings of the IEEE, Vol. 81, No. 9, September 1993, pp 1214-1245.
9.
http://www.upress.umn.edu/Books/D/de_certeau_ext rection.html.
10. Lwrence Rabiner, Biing-Hwang Juang “Fundamentals of Speech Recognition” Published by Prentice-Hall International Inc., pp 242. 11. D. Raj Reddy, “Speech Recognition by Machine: A Review”, Proceedings of the IEEE, Vol. 64, No. 4, April 1976, pp 501-531. 12. F. Itakura, “Minimum prediction residual principle applied to speech recognition”, IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-23, pp 67-72, February 1975. 13. Joseph W. Picone, “ Signal Modeling Techniques in Speech Recognition”, Proceedings of the IEEE, Vol. 81, No. 9, September 1993, pp 1214-1245. 14. M. Karnjanadecha and Stephen A. Zahorian, “Signal Modeling for Isolated Word Recognition”, IEEE International Conference on Acoustics, Speech, and Signal Processing, 1999, Vol. 1, pp 293-296.
Imperial Journal of Interdisciplinary Research (IJIR)
Page 340