Maximum-a-Posteriori Posteriori-Based Decoding for End-to-End End Acoustic Models
Abstract: This paper presents a novel decoding framework for acoustic models (AMs) based on end-to-end end neural networks (e.g., connectionist temporal classification). The end-to-end end training of AMs has recently demonstrated high accuracy and efficiency in automatic speech recognition (ASR). When using the trained AM in decoding, although a language model (LM) is implicitly involved in such an end-toend end AM, it is still essential to integrate an external LM trained with a large text corpus to achieve the best results. While there is no theoretical justification, most of the studies suggest using a naive interpolation of the end end-to-end end AM score and the external LM score, empirically. In this paper, we propose a more theoretically sound decoding framework derived from a maximization of the posterior probability of a word sequence given an observation. As a consequence of the theory, the subword LM is newly introduced to sea seamlessly mlessly integrate the external LM score with the end-to-end end AM score. Our proposed method can be achieved by a small modification of the conventional weighted finite finite-state state transducertransducer based implementation, without having to heavily increase the graph size. We tested the proposed decoding framework on ASR experiments with the Corpus of the Wall Street Journal and the Corpus of Spontaneous Japanese. The results showed that the proposed framework achieved significant and consistent improvements over the convent conventional interpolation-based based decoding framework.